ingest.stores.dual
ElasticsearchStore
def ElasticsearchStore(
dense_vector_field:str='dense_vector', kwargs:VAR_KEYWORD
):
A unified Elasticsearch-based dual store that supports both dense vector searches and sparse text searches in a single index. Uses composition to manage both stores.
DualStore
def DualStore(
dense_kind:str='chroma', dense_persist_location:Optional=None, sparse_kind:str='whoosh',
sparse_persist_location:Optional=None, kwargs:VAR_KEYWORD
):
Helper class that provides a standard way to create an ABC using inheritance.
DualStore.exists
def exists(
):
Returns True if either store exists.
DualStore.add_documents
def add_documents(
documents:Sequence, batch_size:int=1000, kwargs:VAR_KEYWORD
):
Add documents to both dense and sparse stores. If both stores use the same persist_location, only add once.
DualStore.remove_document
def remove_document(
id_to_delete
):
Remove a document from both stores.
DualStore.remove_source
def remove_source(
source:str
):
Remove a document by source from both stores.
The source can either be the full path to a document or a parent folder. Returns the number of records deleted.
DualStore.update_documents
def update_documents(
doc_dicts:dict, kwargs:VAR_KEYWORD
):
Update documents in both stores.
DualStore.get_all_docs
def get_all_docs(
):
Get all documents from the dense store. For simplicity, we only return documents from one store since they should be the same.
DualStore.get_doc
def get_doc(
id
):
Get a document by ID from the dense store.
DualStore.get_size
def get_size(
):
Get the size of the dense store.
DualStore.erase
def erase(
confirm:bool=True
):
Erase both stores.
VectorStore.query
def query(
query:str, kwargs:VAR_KEYWORD
):
Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.
DualStore.semantic_search
def semantic_search(
query:str, kwargs:VAR_KEYWORD
):
Perform semantic search using the dense store.
VectorStore.check
def check(
):
Raise exception if VectorStore.exists() returns False
VectorStore.ingest
def ingest(
source_directory:str, # path to folder containing document store
chunk_size:int=1000, # text is split to this many characters by [langchain.text_splitter.RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
chunk_overlap:int=100, # character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
ignore_fn:Optional=None, # Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
batch_size:int=41000, # batch size used when processing documents
kwargs:VAR_KEYWORD
)->None:
Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.