ingest.stores.sparse

full-text search engine as a backend vectorstore

SparseStore

 SparseStore (**kwargs)

A factory for built-in SparseStore instances.

source

SparseStore.create

 SparseStore.create (persist_location=None, kind=None, **kwargs)

*Factory method to construct a SparseStore instance.
Extra kwargs passed to object instantiation.

Args: persist_location: where the index is stored (for whoosh) or Elasticsearch URL (for elasticsearch) kind: one of {whoosh, elasticsearch}

Elasticsearch-specific kwargs: basic_auth: tuple of (username, password) for basic authentication verify_certs: whether to verify SSL certificates (default: True) ca_certs: path to CA certificate file timeout: connection timeout in seconds (default: 30, becomes request_timeout for v9+ compatibility) max_retries: maximum number of retries (default: 3) retry_on_timeout: whether to retry on timeout (default: True) maxsize: maximum number of connections in the pool (default: 25, removed for Elasticsearch v9+ compatibility)

Returns: SparseStore instance*

source

SparseStore.semantic_search

 SparseStore.semantic_search (*args, **kwargs)

*Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s search method. The search method is expected to include “query” as first positional argument and a “limit” keyword argument. Additional kwargs can be supplied to focus the search (e.g., see where_document and filters arguments of search method). Results of invoked search method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.

If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.

Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.*

source

ReadOnlySparseStore

 ReadOnlySparseStore (**kwargs)

A sparse vector store based on a read-only full-text search engine

source

SharePointStore

 SharePointStore (persist_location:str, username:str, password:str,
                  chunk_for_semantic_search:bool=True, chunk_size:int=500,
                  chunk_overlap:int=50, n_candidates:Optional[int]=None,
                  load_web_documents:bool=True, **kwargs)

A sparse vector store based on Microsoft Sharepoint using a “vectors-on-demand” approach.

	Type	Default	Details
persist_location	str		SharePoint URL
username	str		user name (e.g., CORP_username)
password	str		your SharePoint password
chunk_for_semantic_search	bool	True	whether to do dynamic chunking
chunk_size	int	500	chunk size in characters (1 word ~ 4 characters)
chunk_overlap	int	50	chunk overlap between chunks
n_candidates	Optional	None	number of candidate documents to consider for answer. Defaults to limit*10.
load_web_documents	bool	True	whether to load content from web URLs when content field is empty
kwargs	VAR_KEYWORD

source

ElasticsearchSparseStore

 ElasticsearchSparseStore (persist_location:Optional[str]=None,
                           index_name:str='myindex',
                           basic_auth:Optional[tuple]=None,
                           verify_certs:bool=True,
                           ca_certs:Optional[str]=None, timeout:int=30,
                           max_retries:int=3, retry_on_timeout:bool=True,
                           maxsize:int=25,
                           content_field:str='page_content',
                           source_field:Optional[str]='source',
                           id_field:str='id',
                           content_analyzer:str='standard',
                           chunk_for_semantic_search:bool=False,
                           chunk_size:int=500, chunk_overlap:int=50,
                           n_candidates:Optional[int]=None, **kwargs)

A sparse vector store based on Elasticsearch.

	Type	Default	Details
persist_location	Optional	None
index_name	str	myindex
basic_auth	Optional	None
verify_certs	bool	True
ca_certs	Optional	None
timeout	int	30
max_retries	int	3
retry_on_timeout	bool	True
maxsize	int	25
content_field	str	page_content	Field mapping parameters for existing indices
source_field	Optional	source
id_field	str	id
content_analyzer	str	standard
chunk_for_semantic_search	bool	False	Dynamic chunking for semantic search
chunk_size	int	500
chunk_overlap	int	50
n_candidates	Optional	None
kwargs	VAR_KEYWORD

source

WhooshStore

 WhooshStore (persist_location:Optional[str]=None,
              index_name:str='myindex',
              chunk_for_semantic_search:bool=False, chunk_size:int=500,
              chunk_overlap:int=50, **kwargs)

A sparse vector store based on the Whoosh full-text search engine.

	Type	Default	Details
persist_location	Optional	None
index_name	str	myindex
chunk_for_semantic_search	bool	False	Dynamic chunking for semantic search
chunk_size	int	500
chunk_overlap	int	50
kwargs	VAR_KEYWORD

source

get_field_analyzer

 get_field_analyzer (field_name, value, schema=None)

Get the analyzer that would be used for a field.

source

create_field_for_value

 create_field_for_value (field_name, value)

Create field definition based on value type - same logic as add_documents.

source

default_schema

 default_schema ()

source

WhooshStore.exists

 WhooshStore.exists ()

Returns True if documents have been added to search index

source

WhooshStore.add_documents

 WhooshStore.add_documents
                            (docs:Sequence[langchain_core.documents.base.D
                            ocument], limitmb:int=1024,
                            optimize:bool=False, verbose:bool=True,
                            **kwargs)

Indexes documents. Extra kwargs supplied to TextStore.ix.writer.

	Type	Default	Details
docs	Sequence		list of LangChain Documents
limitmb	int	1024	maximum memory in megabytes to use
optimize	bool	False	whether or not to also opimize index
verbose	bool	True	Set to False to disable progress bar
kwargs	VAR_KEYWORD

source

WhooshStore.remove_document

 WhooshStore.remove_document (value:str, field:str='id',
                              optimize:bool=False)

Remove document with corresponding value and field. Default field is the id field. If optimize is True, index will be optimized.

source

WhooshStore.remove_source

 WhooshStore.remove_source (source:str, optimize:bool=False)

remove all documents associated with source. The source argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed. If optimize is True, index will be optimized.

source

WhooshStore.update_documents

 WhooshStore.update_documents (doc_dicts:dict, **kwargs)

Update a set of documents (doc in index with same ID will be over-written)

	Type	Details
doc_dicts	dict	dictionary with keys ‘page_content’, ‘source’, ‘id’, etc.
kwargs	VAR_KEYWORD

source

WhooshStore.get_all_docs

 WhooshStore.get_all_docs ()

Returns a generator to iterate through all indexed documents

source

WhooshStore.get_doc

 WhooshStore.get_doc (id:str)

Get an indexed record by ID

source

WhooshStore.get_size

 WhooshStore.get_size (include_deleted:bool=False)

*Gets size of index

If include_deleted is True, will include deletd detects (prior to optimization).*

source

WhooshStore.erase

 WhooshStore.erase (confirm=True)

Clears index

source

VectorStore.query

 VectorStore.query (query:str, **kwargs)

Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.

source

SparseStore.semantic_search

 SparseStore.semantic_search (*args, **kwargs)

If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.

source

VectorStore.check

 VectorStore.check ()

Raise exception if VectorStore.exists() returns False

source

VectorStore.ingest

 VectorStore.ingest (source_directory:str, chunk_size:int=500,
                     chunk_overlap:int=50,
                     ignore_fn:Optional[Callable]=None,
                     batch_size:int=41000, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

	Type	Default	Details
source_directory	str		path to folder containing document store
chunk_size	int	500	text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap	int	50	character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
ignore_fn	Optional	None	Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
batch_size	int	41000	batch size used when processing documents
kwargs	VAR_KEYWORD
Returns	None