ingest.stores.sparse

full-text search engine as a backend vectorstore

source

SparseStore

 SparseStore (**kwargs)

A factory for built-in SparseStore instances.


source

SparseStore.create

 SparseStore.create (persist_location=None, kind=None, **kwargs)

*Factory method to construct a SparseStore instance.
Extra kwargs passed to object instantiation.

Args: persist_location: where the index is stored (for whoosh) or Elasticsearch URL (for elasticsearch) kind: one of {whoosh, elasticsearch}

Elasticsearch-specific kwargs: basic_auth: tuple of (username, password) for basic authentication verify_certs: whether to verify SSL certificates (default: True) ca_certs: path to CA certificate file timeout: connection timeout in seconds (default: 30, becomes request_timeout for v9+ compatibility) max_retries: maximum number of retries (default: 3) retry_on_timeout: whether to retry on timeout (default: True) maxsize: maximum number of connections in the pool (default: 25, removed for Elasticsearch v9+ compatibility)

Returns: SparseStore instance*


source

ReadOnlySparseStore

 ReadOnlySparseStore (**kwargs)

A sparse vector store based on a read-only full-text search engine


source

SharePointStore

 SharePointStore (persist_location:str, username:str, password:str,
                  chunk_for_semantic_search:bool=True, chunk_size:int=500,
                  chunk_overlap:int=50, n_candidates:Optional[int]=None,
                  load_web_documents:bool=True, **kwargs)

A sparse vector store based on Microsoft Sharepoint using a “vectors-on-demand” approach.

Type Default Details
persist_location str SharePoint URL
username str user name (e.g., CORP_username)
password str your SharePoint password
chunk_for_semantic_search bool True whether to do dynamic chunking
chunk_size int 500 chunk size in characters (1 word ~ 4 characters)
chunk_overlap int 50 chunk overlap between chunks
n_candidates Optional None number of candidate documents to consider for answer. Defaults to limit*10.
load_web_documents bool True whether to load content from web URLs when content field is empty
kwargs VAR_KEYWORD

source

ElasticsearchSparseStore

 ElasticsearchSparseStore (persist_location:Optional[str]=None,
                           index_name:str='myindex',
                           basic_auth:Optional[tuple]=None,
                           verify_certs:bool=True,
                           ca_certs:Optional[str]=None, timeout:int=30,
                           max_retries:int=3, retry_on_timeout:bool=True,
                           maxsize:int=25,
                           content_field:str='page_content',
                           source_field:Optional[str]='source',
                           id_field:str='id',
                           content_analyzer:str='standard',
                           chunk_for_semantic_search:bool=False,
                           chunk_size:int=500, chunk_overlap:int=50,
                           n_candidates:Optional[int]=None, **kwargs)

A sparse vector store based on Elasticsearch.

Type Default Details
persist_location Optional None
index_name str myindex
basic_auth Optional None
verify_certs bool True
ca_certs Optional None
timeout int 30
max_retries int 3
retry_on_timeout bool True
maxsize int 25
content_field str page_content Field mapping parameters for existing indices
source_field Optional source
id_field str id
content_analyzer str standard
chunk_for_semantic_search bool False Dynamic chunking for semantic search
chunk_size int 500
chunk_overlap int 50
n_candidates Optional None
kwargs VAR_KEYWORD

source

WhooshStore

 WhooshStore (persist_location:Optional[str]=None,
              index_name:str='myindex',
              chunk_for_semantic_search:bool=False, chunk_size:int=500,
              chunk_overlap:int=50, **kwargs)

A sparse vector store based on the Whoosh full-text search engine.

Type Default Details
persist_location Optional None
index_name str myindex
chunk_for_semantic_search bool False Dynamic chunking for semantic search
chunk_size int 500
chunk_overlap int 50
kwargs VAR_KEYWORD

source

default_schema

 default_schema ()

source

WhooshStore.exists

 WhooshStore.exists ()

Returns True if documents have been added to search index


source

WhooshStore.add_documents

 WhooshStore.add_documents
                            (docs:Sequence[langchain_core.documents.base.D
                            ocument], limitmb:int=1024,
                            optimize:bool=False, verbose:bool=True,
                            **kwargs)

Indexes documents. Extra kwargs supplied to TextStore.ix.writer.

Type Default Details
docs Sequence list of LangChain Documents
limitmb int 1024 maximum memory in megabytes to use
optimize bool False whether or not to also opimize index
verbose bool True Set to False to disable progress bar
kwargs VAR_KEYWORD

source

WhooshStore.remove_document

 WhooshStore.remove_document (value:str, field:str='id',
                              optimize:bool=False)

Remove document with corresponding value and field. Default field is the id field. If optimize is True, index will be optimized.


source

WhooshStore.remove_source

 WhooshStore.remove_source (source:str, optimize:bool=False)

remove all documents associated with source. The source argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed. If optimize is True, index will be optimized.


source

WhooshStore.update_documents

 WhooshStore.update_documents (doc_dicts:dict, **kwargs)

Update a set of documents (doc in index with same ID will be over-written)

Type Details
doc_dicts dict dictionary with keys ‘page_content’, ‘source’, ‘id’, etc.
kwargs VAR_KEYWORD

source

WhooshStore.get_all_docs

 WhooshStore.get_all_docs ()

Returns a generator to iterate through all indexed documents


source

WhooshStore.get_doc

 WhooshStore.get_doc (id:str)

Get an indexed record by ID


source

WhooshStore.get_size

 WhooshStore.get_size (include_deleted:bool=False)

*Gets size of index

If include_deleted is True, will include deletd detects (prior to optimization).*


source

WhooshStore.erase

 WhooshStore.erase (confirm=True)

Clears index


source

VectorStore.query

 VectorStore.query (query:str, **kwargs)

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.


source

SparseStore.semantic_search

 SparseStore.semantic_search (*args, **kwargs)

*Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s query method. The query method is expected to include “query” as first positional argument and a “limit” keyword argument. Results of query method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.

If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.

Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.*


source

VectorStore.check

 VectorStore.check ()

Raise exception if VectorStore.exists() returns False


source

VectorStore.ingest

 VectorStore.ingest (source_directory:str, chunk_size:int=500,
                     chunk_overlap:int=50,
                     ignore_fn:Optional[Callable]=None,
                     batch_size:int=41000, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
batch_size int 41000 batch size used when processing documents
kwargs VAR_KEYWORD
Returns None