ingest.stores.dual

Dual vector store implementation for ingesting documents into both sparse and dense stores

source

ElasticsearchStore


def ElasticsearchStore(
    dense_vector_field:str='dense_vector', kwargs:VAR_KEYWORD
):

A unified Elasticsearch-based dual store that supports both dense vector searches and sparse text searches in a single index. Uses composition to manage both stores.


source

DualStore


def DualStore(
    dense_kind:str='chroma', dense_persist_location:Optional=None, sparse_kind:str='whoosh',
    sparse_persist_location:Optional=None, kwargs:VAR_KEYWORD
):

Helper class that provides a standard way to create an ABC using inheritance.


source

DualStore.exists


def exists(
    
):

Returns True if either store exists.


source

DualStore.add_documents


def add_documents(
    documents:Sequence, batch_size:int=1000, kwargs:VAR_KEYWORD
):

Add documents to both dense and sparse stores. If both stores use the same persist_location, only add once.


source

DualStore.remove_document


def remove_document(
    id_to_delete
):

Remove a document from both stores.


source

DualStore.remove_source


def remove_source(
    source:str
):

Remove a document by source from both stores.

The source can either be the full path to a document or a parent folder. Returns the number of records deleted.


source

DualStore.update_documents


def update_documents(
    doc_dicts:dict, kwargs:VAR_KEYWORD
):

Update documents in both stores.


source

DualStore.get_all_docs


def get_all_docs(
    
):

Get all documents from the dense store. For simplicity, we only return documents from one store since they should be the same.


source

DualStore.get_doc


def get_doc(
    id
):

Get a document by ID from the dense store.


source

DualStore.get_size


def get_size(
    
):

Get the size of the dense store.


source

DualStore.erase


def erase(
    confirm:bool=True
):

Erase both stores.


source

VectorStore.query


def query(
    query:str, kwargs:VAR_KEYWORD
):

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.


source

VectorStore.check


def check(
    
):

Raise exception if VectorStore.exists() returns False


source

VectorStore.ingest


def ingest(
    source_directory:str, # path to folder containing document store
    chunk_size:int=1000, # text is split to this many characters by [langchain.text_splitter.RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
    chunk_overlap:int=100, # character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
    ignore_fn:Optional=None, # Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
    batch_size:int=41000, # batch size used when processing documents
    kwargs:VAR_KEYWORD
)->None:

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.