ingest.stores.dense

Data store for dense vector represenations of documents

DenseStore

 DenseStore (**kwargs)

A factory for built-in DenseStore instances.

source

DenseStore.create

 DenseStore.create (persist_location=None, kind=None, **kwargs)

*Factory method to construct a DenseStore instance.

Extra kwargs passed to object instantiation.

Args: persist_location: where the vector database is stored kind: one of {chroma, elasticsearch}

Returns: DenseStore instance*

source

ElasticsearchDenseStore

 ElasticsearchDenseStore (dense_vector_field:str='dense_vector', **kwargs)

Elasticsearch store with dense vector search capabilities. Extends DenseStore to provide Elasticsearch-based dense vector storage.

source

ChromaStore

 ChromaStore (persist_location:Optional[str]=None, **kwargs)

A dense vector store based on Chroma.

source

ChromaStore.exists

 ChromaStore.exists ()

Returns True if vector store has been initialized and contains documents.

source

ChromaStore.add_documents

 ChromaStore.add_documents (documents, batch_size:int=41000)

Stores instances of langchain_core.documents.base.Document in vectordb

source

ChromaStore.remove_document

 ChromaStore.remove_document (id_to_delete)

Remove a single document with ID, id_to_delete.

source

ChromaStore.remove_source

 ChromaStore.remove_source (source:str)

*Deletes all documents in a Chroma collection whose source metadata field starts with the given prefix. The source argument can either be a full path to a document or a prefix (e.g., parent folder).

Args: - source: The source value or prefix

Returns: - The number of documents deleted*

source

ChromaStore.update_documents

 ChromaStore.update_documents (doc_dicts:dict, **kwargs)

Update a set of documents (doc in index with same ID will be over-written)

	Type	Details
doc_dicts	dict	dictionary with keys ‘page_content’, ‘source’, ‘id’, etc.
kwargs	VAR_KEYWORD

source

ChromaStore.get_all_docs

 ChromaStore.get_all_docs ()

Returns all docs

source

ChromaStore.get_doc

 ChromaStore.get_doc (id)

Retrieve a record by ID

source

ChromaStore.get_size

 ChromaStore.get_size ()

Get total number of records

source

ChromaStore.erase

 ChromaStore.erase (confirm=True)

Resets collection and removes and stored documents

source

VectorStore.query

 VectorStore.query (query:str, **kwargs)

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.

source

ChromaStore.semantic_search

 ChromaStore.semantic_search (*args, **kwargs)

Perform a semantic search of the vector DB. Returns results as LangChain Document objects.

source

VectorStore.ingest

 VectorStore.ingest (source_directory:str, chunk_size:int=500,
                     chunk_overlap:int=50,
                     ignore_fn:Optional[Callable]=None,
                     batch_size:int=41000, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

	Type	Default	Details
source_directory	str		path to folder containing document store
chunk_size	int	500	text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap	int	50	character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
ignore_fn	Optional	None	Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
batch_size	int	41000	batch size used when processing documents
kwargs	VAR_KEYWORD
Returns	None

import tempfile

temp_dir = tempfile.TemporaryDirectory()
tempfolder = temp_dir.name

store = DenseStore.create(tempfolder)
store.ingest("tests/sample_data/ktrain_paper/")

Creating new vectorstore at /tmp/tmpmftvr854
Loading documents from tests/sample_data/ktrain_paper/

Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00,  7.85it/s]
Processing and chunking 6 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 985.74it/s]

Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.01it/s]

Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods

type(store)

__main__.ChromaStore

store.get_size()

a_document = store.get_all_docs()[0]

store.remove_document(a_document['id'])

store.get_size()