ingest.stores.dense

vector database for question-answering and other tasks

source

DenseStore

 DenseStore (persist_directory:Optional[str]=None, **kwargs)

Helper class that provides a standard way to create an ABC using inheritance.


source

DenseStore.get_db

 DenseStore.get_db ()

Returns an instance to the langchain_chroma.Chroma instance


source

DenseStore.exists

 DenseStore.exists ()

Returns True if vector store has been initialized and contains documents.


source

DenseStore.add_documents

 DenseStore.add_documents (documents, batch_size:int=41000)

Stores instances of langchain_core.documents.base.Document in vectordb


source

DenseStore.remove_document

 DenseStore.remove_document (id_to_delete)

Remove a single document with ID, id_to_delete.


source

DenseStore.update_documents

 DenseStore.update_documents (doc_dicts:dict, **kwargs)

Update a set of documents (doc in index with same ID will be over-written)

Type Details
doc_dicts dict dictionary with keys ‘page_content’, ‘source’, ‘id’, etc.
kwargs VAR_KEYWORD

source

DenseStore.get_all_docs

 DenseStore.get_all_docs ()

Returns all docs


source

DenseStore.get_doc

 DenseStore.get_doc (id)

Retrieve a record by ID


source

DenseStore.get_size

 DenseStore.get_size ()

Get total number of records


source

DenseStore.erase

 DenseStore.erase (confirm=True)

Resets collection and removes and stored documents


source

DenseStore.query

 DenseStore.query (query:str, k:int=4,
                   filters:Optional[Dict[str,str]]=None,
                   where_document:Optional[Dict[str,str]]=None, **kwargs)

Perform a semantic search of the vector DB

Type Default Details
query str query string
k int 4 max number of results to return
filters Optional None filter sources by metadata values using Chroma metadata syntax (e.g., {‘table’:True})
where_document Optional None filter sources by document content in Chroma syntax (e.g., {“$contains”: “Canada”})
kwargs VAR_KEYWORD

source

VectorStore.check

 VectorStore.check ()

Raise exception if VectorStore.exists() returns False


source

VectorStore.ingest

 VectorStore.ingest (source_directory:str, chunk_size:int=500,
                     chunk_overlap:int=50,
                     ignore_fn:Optional[Callable]=None,
                     batch_size:int=41000, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
batch_size int 41000 batch size used when processing documents
kwargs VAR_KEYWORD
Returns None
import tempfile
temp_dir = tempfile.TemporaryDirectory()
tempfolder = temp_dir.name
store = DenseStore(persist_directory=tempfolder)
store.ingest("tests/sample_data/ktrain_paper/")
Creating new vectorstore at /tmp/tmpmrschkmy
Loading documents from tests/sample_data/ktrain_paper/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00,  3.47it/s]
Processing and chunking 6 new documents: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 814.90it/s]
Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.10it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
store.get_size()
41
a_document = store.get_all_docs()[0]
store.remove_document(a_document['id'])
store.get_size()
40