import tempfileingest.stores.dense
DenseStore
def DenseStore(
kwargs:VAR_KEYWORD
):
A factory for built-in DenseStore instances.
DenseStore.create
def create(
persist_location:NoneType=None, kind:NoneType=None, kwargs:VAR_KEYWORD
)->DenseStore:
Factory method to construct a DenseStore instance.
Extra kwargs passed to object instantiation.
Args: persist_location: where the vector database is stored kind: one of {chroma, elasticsearch}
Returns: DenseStore instance
ElasticsearchDenseStore
def ElasticsearchDenseStore(
dense_vector_field:str='dense_vector', kwargs:VAR_KEYWORD
):
Elasticsearch store with dense vector search capabilities. Extends DenseStore to provide Elasticsearch-based dense vector storage.
ChromaStore
def ChromaStore(
persist_location:Optional=None, kwargs:VAR_KEYWORD
):
A dense vector store based on Chroma.
ChromaStore.exists
def exists(
):
Returns True if vector store has been initialized and contains documents.
ChromaStore.add_documents
def add_documents(
documents, batch_size:int=41000
):
Stores instances of langchain_core.documents.base.Document in vectordb
ChromaStore.remove_document
def remove_document(
id_to_delete
):
Remove a single document with ID, id_to_delete.
ChromaStore.remove_source
def remove_source(
source:str
):
Deletes all documents in a Chroma collection whose source metadata field starts with the given prefix. The source argument can either be a full path to a document or a prefix (e.g., parent folder).
Args: - source: The source value or prefix
Returns: - The number of documents deleted
ChromaStore.update_documents
def update_documents(
doc_dicts:dict, # dictionary with keys 'page_content', 'source', 'id', etc.
kwargs:VAR_KEYWORD
):
Update a set of documents (doc in index with same ID will be over-written)
ChromaStore.get_all_docs
def get_all_docs(
):
Returns all docs
ChromaStore.get_doc
def get_doc(
id
):
Retrieve a record by ID
ChromaStore.get_size
def get_size(
):
Get total number of records
ChromaStore.erase
def erase(
confirm:bool=True
):
Resets collection and removes and stored documents
VectorStore.query
def query(
query:str, kwargs:VAR_KEYWORD
):
Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.
ChromaStore.semantic_search
def semantic_search(
args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):
Perform a semantic search of the vector DB. Returns results as LangChain Document objects.
VectorStore.ingest
def ingest(
source_directory:str, # path to folder containing document store
chunk_size:int=1000, # text is split to this many characters by [langchain.text_splitter.RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
chunk_overlap:int=100, # character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
ignore_fn:Optional=None, # Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
batch_size:int=41000, # batch size used when processing documents
kwargs:VAR_KEYWORD
)->None:
Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.
temp_dir = tempfile.TemporaryDirectory()
tempfolder = temp_dir.namestore = DenseStore.create(tempfolder)
store.ingest("tests/sample_data/ktrain_paper/")Creating new vectorstore at /tmp/tmpmftvr854
Loading documents from tests/sample_data/ktrain_paper/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00, 7.85it/s]
Processing and chunking 6 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 985.74it/s]
Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.01it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
type(store)__main__.ChromaStore
store.get_size()41
a_document = store.get_all_docs()[0]store.remove_document(a_document['id'])store.get_size()40