ingest.stores.sparse
SparseStore
def SparseStore(
kwargs:VAR_KEYWORD
):
A factory for built-in SparseStore instances.
SparseStore.create
def create(
persist_location:NoneType=None, kind:NoneType=None, kwargs:VAR_KEYWORD
)->SparseStore:
Factory method to construct a SparseStore instance. Extra kwargs passed to object instantiation.
Args: persist_location: where the index is stored (for whoosh) or Elasticsearch URL (for elasticsearch) kind: one of {whoosh, elasticsearch}
Elasticsearch-specific kwargs: basic_auth: tuple of (username, password) for basic authentication verify_certs: whether to verify SSL certificates (default: True) ca_certs: path to CA certificate file timeout: connection timeout in seconds (default: 30, becomes request_timeout for v9+ compatibility) max_retries: maximum number of retries (default: 3) retry_on_timeout: whether to retry on timeout (default: True) maxsize: maximum number of connections in the pool (default: 25, removed for Elasticsearch v9+ compatibility)
Returns: SparseStore instance
SparseStore.semantic_search
def semantic_search(
args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):
Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s search method. The search method is expected to include “query” as first positional argument and a “limit” keyword argument. Additional kwargs can be supplied to focus the search (e.g., see where_document and filters arguments of search method). Results of invoked search method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.
If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.
Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.
ReadOnlySparseStore
def ReadOnlySparseStore(
kwargs:VAR_KEYWORD
):
A sparse vector store based on a read-only full-text search engine
ElasticsearchSparseStore
def ElasticsearchSparseStore(
persist_location:Optional=None, index_name:str='myindex', basic_auth:Optional=None, verify_certs:bool=True,
ca_certs:Optional=None, timeout:int=30, max_retries:int=3, retry_on_timeout:bool=True, maxsize:int=25,
content_field:str='page_content', # Field mapping parameters for existing indices
source_field:Optional='source', id_field:str='id', content_analyzer:str='standard',
chunk_for_semantic_search:bool=False, # Dynamic chunking for semantic search
chunk_size:int=500, chunk_overlap:int=50, n_candidates:Optional=None, kwargs:VAR_KEYWORD
):
A sparse vector store based on Elasticsearch.
WhooshStore
def WhooshStore(
persist_location:Optional=None, index_name:str='myindex',
chunk_for_semantic_search:bool=False, # Dynamic chunking for semantic search
chunk_size:int=500, chunk_overlap:int=50, kwargs:VAR_KEYWORD
):
A sparse vector store based on the Whoosh full-text search engine.
get_field_analyzer
def get_field_analyzer(
field_name, value, schema:NoneType=None
):
Get the analyzer that would be used for a field.
create_field_for_value
def create_field_for_value(
field_name, value
):
Create field definition based on value type - same logic as add_documents.
default_schema
def default_schema(
):
WhooshStore.exists
def exists(
):
Returns True if documents have been added to search index
WhooshStore.add_documents
def add_documents(
docs:Sequence, # list of LangChain Documents
limitmb:int=1024, # maximum memory in megabytes to use
verbose:bool=True, # Set to False to disable progress bar
kwargs:VAR_KEYWORD
):
Indexes documents. Extra kwargs supplied to TextStore.ix.writer.
WhooshStore.remove_document
def remove_document(
value:str, field:str='id', kwargs:VAR_KEYWORD
):
Remove document with corresponding value and field. Default field is the id field.
WhooshStore.remove_source
def remove_source(
source:str
):
remove all documents associated with source. The source argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed.
WhooshStore.update_documents
def update_documents(
doc_dicts:dict, # dictionary with keys 'page_content', 'source', 'id', etc.
kwargs:VAR_KEYWORD
):
Update a set of documents (doc in index with same ID will be over-written)
WhooshStore.get_all_docs
def get_all_docs(
):
Returns a generator to iterate through all indexed documents
WhooshStore.get_doc
def get_doc(
id:str
):
Get an indexed record by ID
WhooshStore.get_size
def get_size(
include_deleted:bool=False
)->int:
Gets size of index
If include_deleted is True, will include deletd detects (prior to optimization).
WhooshStore.erase
def erase(
confirm:bool=True
):
Clears index
VectorStore.query
def query(
query:str, kwargs:VAR_KEYWORD
):
Generic query method that invokes the store’s search method. This provides a consistent interface across all store types.
SparseStore.semantic_search
def semantic_search(
args:VAR_POSITIONAL, kwargs:VAR_KEYWORD
):
Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s search method. The search method is expected to include “query” as first positional argument and a “limit” keyword argument. Additional kwargs can be supplied to focus the search (e.g., see where_document and filters arguments of search method). Results of invoked search method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.
If subclass supports dynamic chunking (has chunk_for_semantic_search=True), it will chunk large documents and find the best matching chunks per document.
Args: return_chunks (bool): If True (default), return individual chunks as Document objects for RAG. If False, return original documents with full content and all chunk scores. load_web_documents (bool): If True, attempt to load content from web URLs when content field is empty (default: False). verbose (bool): If True (default), show progress bar during semantic search processing.
VectorStore.check
def check(
):
Raise exception if VectorStore.exists() returns False
VectorStore.ingest
def ingest(
source_directory:str, # path to folder containing document store
chunk_size:int=1000, # text is split to this many characters by [langchain.text_splitter.RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)
chunk_overlap:int=100, # character overlap between chunks in `langchain.text_splitter.RecursiveCharacterTextSplitter`
ignore_fn:Optional=None, # Optional function that accepts the file path (including file name) as input and returns `True` if file path should not be ingested.
batch_size:int=41000, # batch size used when processing documents
kwargs:VAR_KEYWORD
)->None:
Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.