ingest.stores.sparse
SparseStore
SparseStore (**kwargs)
A factory for built-in SparseStore instances.
SparseStore.create
SparseStore.create (persist_location=None, kind=None, **kwargs)
*Factory method to construct a SparseStore instance.
Extra kwargs passed to object instantiation.
Args: persist_location: where the index is stored (for whoosh) or Elasticsearch URL (for elasticsearch) kind: one of {whoosh, elasticsearch}
Elasticsearch-specific kwargs: basic_auth: tuple of (username, password) for basic authentication verify_certs: whether to verify SSL certificates (default: True) ca_certs: path to CA certificate file timeout: connection timeout in seconds (default: 30, becomes request_timeout for v9+ compatibility) max_retries: maximum number of retries (default: 3) retry_on_timeout: whether to retry on timeout (default: True) maxsize: maximum number of connections in the pool (default: 25, removed for Elasticsearch v9+ compatibility)
Returns: SparseStore instance*
SparseStore.semantic_search
SparseStore.semantic_search (*args, **kwargs)
*Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s search method. The search method is expected to include “query” as first positional argument and a “limit” keyword argument. Additional kwargs can be supplied to focus the search (e.g., see where_document and filters arguments of search method). Results of invoked search method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.
If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.
Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.*
ReadOnlySparseStore
ReadOnlySparseStore (**kwargs)
A sparse vector store based on a read-only full-text search engine
ElasticsearchSparseStore
ElasticsearchSparseStore (persist_location:Optional[str]=None, index_name:str='myindex', basic_auth:Optional[tuple]=None, verify_certs:bool=True, ca_certs:Optional[str]=None, timeout:int=30, max_retries:int=3, retry_on_timeout:bool=True, maxsize:int=25, content_field:str='page_content', source_field:Optional[str]='source', id_field:str='id', content_analyzer:str='standard', chunk_for_semantic_search:bool=False, chunk_size:int=500, chunk_overlap:int=50, n_candidates:Optional[int]=None, **kwargs)
A sparse vector store based on Elasticsearch.
| Type | Default | Details | |
|---|---|---|---|
| persist_location | Optional | None | |
| index_name | str | myindex | |
| basic_auth | Optional | None | |
| verify_certs | bool | True | |
| ca_certs | Optional | None | |
| timeout | int | 30 | |
| max_retries | int | 3 | |
| retry_on_timeout | bool | True | |
| maxsize | int | 25 | |
| content_field | str | page_content | Field mapping parameters for existing indices |
| source_field | Optional | source | |
| id_field | str | id | |
| content_analyzer | str | standard | |
| chunk_for_semantic_search | bool | False | Dynamic chunking for semantic search |
| chunk_size | int | 500 | |
| chunk_overlap | int | 50 | |
| n_candidates | Optional | None | |
| kwargs | VAR_KEYWORD |
WhooshStore
WhooshStore (persist_location:Optional[str]=None, index_name:str='myindex', chunk_for_semantic_search:bool=False, chunk_size:int=500, chunk_overlap:int=50, **kwargs)
A sparse vector store based on the Whoosh full-text search engine.
| Type | Default | Details | |
|---|---|---|---|
| persist_location | Optional | None | |
| index_name | str | myindex | |
| chunk_for_semantic_search | bool | False | Dynamic chunking for semantic search |
| chunk_size | int | 500 | |
| chunk_overlap | int | 50 | |
| kwargs | VAR_KEYWORD |
get_field_analyzer
get_field_analyzer (field_name, value, schema=None)
Get the analyzer that would be used for a field.
create_field_for_value
create_field_for_value (field_name, value)
Create field definition based on value type - same logic as add_documents.
default_schema
default_schema ()
WhooshStore.exists
WhooshStore.exists ()
Returns True if documents have been added to search index
WhooshStore.add_documents
WhooshStore.add_documents (docs:Sequence[langchain_core.documents.base.D ocument], limitmb:int=1024, optimize:bool=False, verbose:bool=True, **kwargs)
Indexes documents. Extra kwargs supplied to TextStore.ix.writer.
| Type | Default | Details | |
|---|---|---|---|
| docs | Sequence | list of LangChain Documents | |
| limitmb | int | 1024 | maximum memory in megabytes to use |
| optimize | bool | False | whether or not to also opimize index |
| verbose | bool | True | Set to False to disable progress bar |
| kwargs | VAR_KEYWORD |
WhooshStore.remove_document
WhooshStore.remove_document (value:str, field:str='id', optimize:bool=False)
Remove document with corresponding value and field. Default field is the id field. If optimize is True, index will be optimized.
WhooshStore.remove_source
WhooshStore.remove_source (source:str, optimize:bool=False)
remove all documents associated with source. The source argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed. If optimize is True, index will be optimized.
WhooshStore.update_documents
WhooshStore.update_documents (doc_dicts:dict, **kwargs)
Update a set of documents (doc in index with same ID will be over-written)
| Type | Details | |
|---|---|---|
| doc_dicts | dict | dictionary with keys ‘page_content’, ‘source’, ‘id’, etc. |
| kwargs | VAR_KEYWORD |
WhooshStore.get_all_docs
WhooshStore.get_all_docs ()
Returns a generator to iterate through all indexed documents
WhooshStore.get_doc
WhooshStore.get_doc (id:str)
Get an indexed record by ID
WhooshStore.get_size
WhooshStore.get_size (include_deleted:bool=False)
*Gets size of index
If include_deleted is True, will include deletd detects (prior to optimization).*
WhooshStore.erase
WhooshStore.erase (confirm=True)
Clears index
VectorStore.query
VectorStore.query (query:str, **kwargs)
Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.
SparseStore.semantic_search
SparseStore.semantic_search (*args, **kwargs)
*Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s search method. The search method is expected to include “query” as first positional argument and a “limit” keyword argument. Additional kwargs can be supplied to focus the search (e.g., see where_document and filters arguments of search method). Results of invoked search method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.
If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.
Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.*
VectorStore.check
VectorStore.check ()
Raise exception if VectorStore.exists() returns False
VectorStore.ingest
VectorStore.ingest (source_directory:str, chunk_size:int=500, chunk_overlap:int=50, ignore_fn:Optional[Callable]=None, batch_size:int=41000, **kwargs)
Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.
| Type | Default | Details | |
|---|---|---|---|
| source_directory | str | path to folder containing document store | |
| chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
| chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
| ignore_fn | Optional | None | Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested. |
| batch_size | int | 41000 | batch size used when processing documents |
| kwargs | VAR_KEYWORD | ||
| Returns | None |