ingest.stores.sparse

full-text search engine as a backend vectorstore

source

SparseStore


def SparseStore(
    kwargs:VAR_KEYWORD
):

A factory for built-in SparseStore instances.


source

SparseStore.create


def create(
    persist_location:NoneType=None, kind:NoneType=None, kwargs:VAR_KEYWORD
)->SparseStore:

Factory method to construct a SparseStore instance. Extra kwargs passed to object instantiation.

Args: persist_location: where the index is stored (for whoosh) or Elasticsearch URL (for elasticsearch) kind: one of {whoosh, elasticsearch}

Elasticsearch-specific kwargs: basic_auth: tuple of (username, password) for basic authentication verify_certs: whether to verify SSL certificates (default: True) ca_certs: path to CA certificate file timeout: connection timeout in seconds (default: 30, becomes request_timeout for v9+ compatibility) max_retries: maximum number of retries (default: 3) retry_on_timeout: whether to retry on timeout (default: True) maxsize: maximum number of connections in the pool (default: 25, removed for Elasticsearch v9+ compatibility)

Returns: SparseStore instance


source

ReadOnlySparseStore


def ReadOnlySparseStore(
    kwargs:VAR_KEYWORD
):

A sparse vector store based on a read-only full-text search engine


source

SharePointStore


def SharePointStore(
    persist_location:str, # SharePoint URL
    username:str, # user name (e.g., CORP\my_username)
    password:str, # your SharePoint password
    chunk_for_semantic_search:bool=True, # whether to do dynamic chunking
    chunk_size:int=500, # chunk size in characters (1 word ~ 4 characters)
    chunk_overlap:int=50, # chunk overlap between chunks
    n_candidates:Optional=None, # number of candidate documents to consider for answer. Defaults to limit*10.
    load_web_documents:bool=True, # whether to load content from web URLs when content field is empty
    kwargs:VAR_KEYWORD
):

A sparse vector store based on Microsoft Sharepoint using a “vectors-on-demand” approach.


source

ElasticsearchSparseStore


def ElasticsearchSparseStore(
    persist_location:Optional=None, index_name:str='myindex', basic_auth:Optional=None, verify_certs:bool=True,
    ca_certs:Optional=None, timeout:int=30, max_retries:int=3, retry_on_timeout:bool=True, maxsize:int=25,
    content_field:str='page_content', # Field mapping parameters for existing indices
    source_field:Optional='source', id_field:str='id', content_analyzer:str='standard',
    chunk_for_semantic_search:bool=False, # Dynamic chunking for semantic search
    chunk_size:int=500, chunk_overlap:int=50, n_candidates:Optional=None, kwargs:VAR_KEYWORD
):

A sparse vector store based on Elasticsearch.


source

WhooshStore


def WhooshStore(
    persist_location:Optional=None, index_name:str='myindex',
    chunk_for_semantic_search:bool=False, # Dynamic chunking for semantic search
    chunk_size:int=500, chunk_overlap:int=50, kwargs:VAR_KEYWORD
):

A sparse vector store based on the Whoosh full-text search engine.


source

get_field_analyzer


def get_field_analyzer(
    field_name, value, schema:NoneType=None
):

Get the analyzer that would be used for a field.


source

create_field_for_value


def create_field_for_value(
    field_name, value
):

Create field definition based on value type - same logic as add_documents.


source

default_schema


def default_schema(
    
):

source

WhooshStore.exists


def exists(
    
):

Returns True if documents have been added to search index


source

WhooshStore.add_documents


def add_documents(
    docs:Sequence, # list of LangChain Documents
    limitmb:int=1024, # maximum memory in  megabytes to use
    verbose:bool=True, # Set to False to disable progress bar
    kwargs:VAR_KEYWORD
):

Indexes documents. Extra kwargs supplied to TextStore.ix.writer.


source

WhooshStore.remove_document


def remove_document(
    value:str, field:str='id', kwargs:VAR_KEYWORD
):

Remove document with corresponding value and field. Default field is the id field.


source

WhooshStore.remove_source


def remove_source(
    source:str
):

remove all documents associated with source. The source argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed.


source

WhooshStore.update_documents


def update_documents(
    doc_dicts:dict, # dictionary with keys 'page_content', 'source', 'id', etc.
    kwargs:VAR_KEYWORD
):

Update a set of documents (doc in index with same ID will be over-written)


source

WhooshStore.get_all_docs


def get_all_docs(
    
):

Returns a generator to iterate through all indexed documents


source

WhooshStore.get_doc


def get_doc(
    id:str
):

Get an indexed record by ID


source

WhooshStore.get_size


def get_size(
    include_deleted:bool=False
)->int:

Gets size of index

If include_deleted is True, will include deletd detects (prior to optimization).


source

WhooshStore.erase


def erase(
    confirm:bool=True
):

Clears index


source

VectorStore.query


def query(
    query:str, kwargs:VAR_KEYWORD
):

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.


source

SparseStore.semantic_search


def semantic_search(
    args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):

Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s search method. The search method is expected to include “query” as first positional argument and a “limit” keyword argument. Additional kwargs can be supplied to focus the search (e.g., see where_document and filters arguments of search method). Results of invoked search method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.

If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.

Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.


source

VectorStore.check


def check(
    
):

Raise exception if VectorStore.exists() returns False


source

VectorStore.ingest


def ingest(
    source_directory:str, # path to folder containing document store
    chunk_size:int=1000, # text is split to this many characters by [langchain.text_splitter.RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
    chunk_overlap:int=100, # character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
    ignore_fn:Optional=None, # Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
    batch_size:int=41000, # batch size used when processing documents
    kwargs:VAR_KEYWORD
)->None:

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.