ingest.stores.dense

Data store for dense vector represenations of documents

source

DenseStore


def DenseStore(
    kwargs:VAR_KEYWORD
):

A factory for built-in DenseStore instances.


source

DenseStore.create


def create(
    persist_location:NoneType=None, kind:NoneType=None, kwargs:VAR_KEYWORD
)->DenseStore:

Factory method to construct a DenseStore instance.

Extra kwargs passed to object instantiation.

Args: persist_location: where the vector database is stored kind: one of {chroma, elasticsearch}

Returns: DenseStore instance


source

ElasticsearchDenseStore


def ElasticsearchDenseStore(
    dense_vector_field:str='dense_vector', kwargs:VAR_KEYWORD
):

Elasticsearch store with dense vector search capabilities. Extends DenseStore to provide Elasticsearch-based dense vector storage.


source

ChromaStore


def ChromaStore(
    persist_location:Optional=None, kwargs:VAR_KEYWORD
):

A dense vector store based on Chroma.


source

ChromaStore.exists


def exists(
    
):

Returns True if vector store has been initialized and contains documents.


source

ChromaStore.add_documents


def add_documents(
    documents, batch_size:int=41000
):

Stores instances of langchain_core.documents.base.Document in vectordb


source

ChromaStore.remove_document


def remove_document(
    id_to_delete
):

Remove a single document with ID, id_to_delete.


source

ChromaStore.remove_source


def remove_source(
    source:str
):

Deletes all documents in a Chroma collection whose source metadata field starts with the given prefix. The source argument can either be a full path to a document or a prefix (e.g., parent folder).

Args: - source: The source value or prefix

Returns: - The number of documents deleted


source

ChromaStore.update_documents


def update_documents(
    doc_dicts:dict, # dictionary with keys 'page_content', 'source', 'id', etc.
    kwargs:VAR_KEYWORD
):

Update a set of documents (doc in index with same ID will be over-written)


source

ChromaStore.get_all_docs


def get_all_docs(
    
):

Returns all docs


source

ChromaStore.get_doc


def get_doc(
    id
):

Retrieve a record by ID


source

ChromaStore.get_size


def get_size(
    
):

Get total number of records


source

ChromaStore.erase


def erase(
    confirm:bool=True
):

Resets collection and removes and stored documents


source

VectorStore.query


def query(
    query:str, kwargs:VAR_KEYWORD
):

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.


source

VectorStore.ingest


def ingest(
    source_directory:str, # path to folder containing document store
    chunk_size:int=1000, # text is split to this many characters by [langchain.text_splitter.RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
    chunk_overlap:int=100, # character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
    ignore_fn:Optional=None, # Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
    batch_size:int=41000, # batch size used when processing documents
    kwargs:VAR_KEYWORD
)->None:

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

import tempfile
temp_dir = tempfile.TemporaryDirectory()
tempfolder = temp_dir.name
store = DenseStore.create(tempfolder)
store.ingest("tests/sample_data/ktrain_paper/")
Creating new vectorstore at /tmp/tmpmftvr854
Loading documents from tests/sample_data/ktrain_paper/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00,  7.85it/s]
Processing and chunking 6 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 985.74it/s]
Split into 41 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.01it/s]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
type(store)
__main__.ChromaStore
store.get_size()
41
a_document = store.get_all_docs()[0]
store.remove_document(a_document['id'])
store.get_size()
40