import tempfile
ingest.stores.dense
DenseStore
DenseStore (**kwargs)
A factory for built-in DenseStore instances.
DenseStore.create
DenseStore.create (persist_location=None, kind=None, **kwargs)
*Factory method to construct a DenseStore
instance.
Extra kwargs passed to object instantiation.
Args: persist_location: where the vector database is stored kind: one of {chroma, elasticsearch}
Returns: DenseStore instance*
ElasticsearchDenseStore
ElasticsearchDenseStore (dense_vector_field:str='dense_vector', **kwargs)
Elasticsearch store with dense vector search capabilities. Extends DenseStore to provide Elasticsearch-based dense vector storage.
ChromaStore
ChromaStore (persist_location:Optional[str]=None, **kwargs)
A dense vector store based on Chroma.
ChromaStore.exists
ChromaStore.exists ()
Returns True if vector store has been initialized and contains documents.
ChromaStore.add_documents
ChromaStore.add_documents (documents, batch_size:int=41000)
Stores instances of langchain_core.documents.base.Document
in vectordb
ChromaStore.remove_document
ChromaStore.remove_document (id_to_delete)
Remove a single document with ID, id_to_delete
.
ChromaStore.remove_source
ChromaStore.remove_source (source:str)
*Deletes all documents in a Chroma collection whose source
metadata field starts with the given prefix. The source
argument can either be a full path to a document or a prefix (e.g., parent folder).
Args: - source: The source value or prefix
Returns: - The number of documents deleted*
ChromaStore.update_documents
ChromaStore.update_documents (doc_dicts:dict, **kwargs)
Update a set of documents (doc in index with same ID will be over-written)
Type | Details | |
---|---|---|
doc_dicts | dict | dictionary with keys ‘page_content’, ‘source’, ‘id’, etc. |
kwargs | VAR_KEYWORD |
ChromaStore.get_all_docs
ChromaStore.get_all_docs ()
Returns all docs
ChromaStore.get_doc
ChromaStore.get_doc (id)
Retrieve a record by ID
ChromaStore.get_size
ChromaStore.get_size ()
Get total number of records
ChromaStore.erase
ChromaStore.erase (confirm=True)
Resets collection and removes and stored documents
VectorStore.query
VectorStore.query (query:str, **kwargs)
Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.
ChromaStore.semantic_search
ChromaStore.semantic_search (*args, **kwargs)
Perform a semantic search of the vector DB. Returns results as LangChain Document objects.
VectorStore.ingest
VectorStore.ingest (source_directory:str, chunk_size:int=500, chunk_overlap:int=50, ignore_fn:Optional[Callable]=None, batch_size:int=41000, **kwargs)
Ingests all documents in source_directory
(previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata
dict with the absolute path to the file in metadata["source"]
. Extra kwargs fed to ingest.load_single_document
.
Type | Default | Details | |
---|---|---|---|
source_directory | str | path to folder containing document store | |
chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
ignore_fn | Optional | None | Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested. |
batch_size | int | 41000 | batch size used when processing documents |
kwargs | VAR_KEYWORD | ||
Returns | None |
= tempfile.TemporaryDirectory()
temp_dir = temp_dir.name tempfolder
= DenseStore.create(tempfolder)
store "tests/sample_data/ktrain_paper/") store.ingest(
Creating new vectorstore at /tmp/tmpmftvr854
Loading documents from tests/sample_data/ktrain_paper/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00, 7.85it/s]
Processing and chunking 6 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 985.74it/s]
Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.01it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
type(store)
__main__.ChromaStore
store.get_size()
41
= store.get_all_docs()[0] a_document
'id']) store.remove_document(a_document[
store.get_size()
40