ingest.stores.dense

Data store for dense vector represenations of documents

source

DenseStore

 DenseStore (**kwargs)

A factory for built-in DenseStore instances.


source

DenseStore.create

 DenseStore.create (persist_location=None, kind=None, **kwargs)

*Factory method to construct a DenseStore instance.

Extra kwargs passed to object instantiation.

Args: persist_location: where the vector database is stored kind: one of {chroma, elasticsearch}

Returns: DenseStore instance*


source

ElasticsearchDenseStore

 ElasticsearchDenseStore (dense_vector_field:str='dense_vector', **kwargs)

Elasticsearch store with dense vector search capabilities. Extends DenseStore to provide Elasticsearch-based dense vector storage.


source

ChromaStore

 ChromaStore (persist_location:Optional[str]=None, **kwargs)

A dense vector store based on Chroma.


source

ChromaStore.exists

 ChromaStore.exists ()

Returns True if vector store has been initialized and contains documents.


source

ChromaStore.add_documents

 ChromaStore.add_documents (documents, batch_size:int=41000)

Stores instances of langchain_core.documents.base.Document in vectordb


source

ChromaStore.remove_document

 ChromaStore.remove_document (id_to_delete)

Remove a single document with ID, id_to_delete.


source

ChromaStore.remove_source

 ChromaStore.remove_source (source:str)

*Deletes all documents in a Chroma collection whose source metadata field starts with the given prefix. The source argument can either be a full path to a document or a prefix (e.g., parent folder).

Args: - source: The source value or prefix

Returns: - The number of documents deleted*


source

ChromaStore.update_documents

 ChromaStore.update_documents (doc_dicts:dict, **kwargs)

Update a set of documents (doc in index with same ID will be over-written)

Type Details
doc_dicts dict dictionary with keys ‘page_content’, ‘source’, ‘id’, etc.
kwargs VAR_KEYWORD

source

ChromaStore.get_all_docs

 ChromaStore.get_all_docs ()

Returns all docs


source

ChromaStore.get_doc

 ChromaStore.get_doc (id)

Retrieve a record by ID


source

ChromaStore.get_size

 ChromaStore.get_size ()

Get total number of records


source

ChromaStore.erase

 ChromaStore.erase (confirm=True)

Resets collection and removes and stored documents


source

VectorStore.query

 VectorStore.query (query:str, **kwargs)

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.


source

VectorStore.ingest

 VectorStore.ingest (source_directory:str, chunk_size:int=500,
                     chunk_overlap:int=50,
                     ignore_fn:Optional[Callable]=None,
                     batch_size:int=41000, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
batch_size int 41000 batch size used when processing documents
kwargs VAR_KEYWORD
Returns None
import tempfile
temp_dir = tempfile.TemporaryDirectory()
tempfolder = temp_dir.name
store = DenseStore.create(tempfolder)
store.ingest("tests/sample_data/ktrain_paper/")
Creating new vectorstore at /tmp/tmpmftvr854
Loading documents from tests/sample_data/ktrain_paper/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00,  7.85it/s]
Processing and chunking 6 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 985.74it/s]
Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.01it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
type(store)
__main__.ChromaStore
store.get_size()
41
a_document = store.get_all_docs()[0]
store.remove_document(a_document['id'])
store.get_size()
40