import tempfile
ingest.stores.dense
DenseStore
DenseStore (persist_directory:Optional[str]=None, **kwargs)
Helper class that provides a standard way to create an ABC using inheritance.
DenseStore.get_db
DenseStore.get_db ()
Returns an instance to the langchain_chroma.Chroma
instance
DenseStore.exists
DenseStore.exists ()
Returns True if vector store has been initialized and contains documents.
DenseStore.add_documents
DenseStore.add_documents (documents, batch_size:int=41000)
Stores instances of langchain_core.documents.base.Document
in vectordb
DenseStore.remove_document
DenseStore.remove_document (id_to_delete)
Remove a single document with ID, id_to_delete
.
DenseStore.update_documents
DenseStore.update_documents (doc_dicts:dict, **kwargs)
Update a set of documents (doc in index with same ID will be over-written)
Type | Details | |
---|---|---|
doc_dicts | dict | dictionary with keys ‘page_content’, ‘source’, ‘id’, etc. |
kwargs | VAR_KEYWORD |
DenseStore.get_all_docs
DenseStore.get_all_docs ()
Returns all docs
DenseStore.get_doc
DenseStore.get_doc (id)
Retrieve a record by ID
DenseStore.get_size
DenseStore.get_size ()
Get total number of records
DenseStore.erase
DenseStore.erase (confirm=True)
Resets collection and removes and stored documents
DenseStore.query
DenseStore.query (query:str, k:int=4, filters:Optional[Dict[str,str]]=None, where_document:Optional[Dict[str,str]]=None, **kwargs)
Perform a semantic search of the vector DB
Type | Default | Details | |
---|---|---|---|
query | str | query string | |
k | int | 4 | max number of results to return |
filters | Optional | None | filter sources by metadata values using Chroma metadata syntax (e.g., {‘table’:True}) |
where_document | Optional | None | filter sources by document content in Chroma syntax (e.g., {“$contains”: “Canada”}) |
kwargs | VAR_KEYWORD |
DenseStore.semantic_search
DenseStore.semantic_search (*args, **kwargs)
Semantic search is equivalent to queries in this class
VectorStore.check
VectorStore.check ()
Raise exception if VectorStore.exists()
returns False
VectorStore.ingest
VectorStore.ingest (source_directory:str, chunk_size:int=500, chunk_overlap:int=50, ignore_fn:Optional[Callable]=None, batch_size:int=41000, **kwargs)
Ingests all documents in source_directory
(previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata
dict with the absolute path to the file in metadata["source"]
. Extra kwargs fed to ingest.load_single_document
.
Type | Default | Details | |
---|---|---|---|
source_directory | str | path to folder containing document store | |
chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
ignore_fn | Optional | None | Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested. |
batch_size | int | 41000 | batch size used when processing documents |
kwargs | VAR_KEYWORD | ||
Returns | None |
= tempfile.TemporaryDirectory()
temp_dir = temp_dir.name tempfolder
= DenseStore(persist_directory=tempfolder)
store "tests/sample_data/ktrain_paper/") store.ingest(
Creating new vectorstore at /tmp/tmpmrschkmy
Loading documents from tests/sample_data/ktrain_paper/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00, 3.47it/s]
Processing and chunking 6 new documents: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 814.90it/s]
Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.10it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
store.get_size()
41
= store.get_all_docs()[0] a_document
'id']) store.remove_document(a_document[
store.get_size()
40