ingest.stores.sparse

full-text search engine as a backend vectorstore

source

SparseStore

 SparseStore (**kwargs)

A factory for built-in SparseStore instances.


source

SparseStore.create

 SparseStore.create (persist_directory=None, kind=None, **kwargs)

*Factory method to construct a SparseStore instance.
Extra kwargs passed to object instantiation.

Args: persist_directory: where the index is stored kind: one of {whoosh}

Returns: SparseStore instance*


source

WhooshStore

 WhooshStore (persist_directory:Optional[str]=None,
              index_name:str='myindex', **kwargs)

A sparse vector store based on the Whoosh full-text search engine.


source

default_schema

 default_schema ()

source

WhooshStore.exists

 WhooshStore.exists ()

Returns True if documents have been added to search index


source

WhooshStore.add_documents

 WhooshStore.add_documents
                            (docs:Sequence[langchain_core.documents.base.D
                            ocument], limitmb:int=1024,
                            optimize:bool=False, verbose:bool=True,
                            **kwargs)

Indexes documents. Extra kwargs supplied to TextStore.ix.writer.

Type Default Details
docs Sequence list of LangChain Documents
limitmb int 1024 maximum memory in megabytes to use
optimize bool False whether or not to also opimize index
verbose bool True Set to False to disable progress bar
kwargs VAR_KEYWORD

source

WhooshStore.remove_document

 WhooshStore.remove_document (value:str, field:str='id',
                              optimize:bool=False)

Remove document with corresponding value and field. Default field is the id field. If optimize is True, index will be optimized.


source

WhooshStore.remove_source

 WhooshStore.remove_source (source:str, optimize:bool=False)

remove all documents associated with source. The source argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed. If optimize is True, index will be optimized.


source

WhooshStore.update_documents

 WhooshStore.update_documents (doc_dicts:dict, **kwargs)

Update a set of documents (doc in index with same ID will be over-written)

Type Details
doc_dicts dict dictionary with keys ‘page_content’, ‘source’, ‘id’, etc.
kwargs VAR_KEYWORD

source

WhooshStore.get_all_docs

 WhooshStore.get_all_docs ()

Returns a generator to iterate through all indexed documents


source

WhooshStore.get_doc

 WhooshStore.get_doc (id:str)

Get an indexed record by ID


source

WhooshStore.get_size

 WhooshStore.get_size (include_deleted:bool=False)

*Gets size of index

If include_deleted is True, will include deletd detects (prior to optimization).*


source

WhooshStore.erase

 WhooshStore.erase (confirm=True)

Clears index


source

WhooshStore.query

 WhooshStore.query (query:str, fields:Sequence=['page_content'],
                    highlight:bool=True, limit:int=10, page:int=1,
                    return_dict:bool=True,
                    filters:Optional[Dict[str,str]]=None,
                    where_document:Optional[str]=None,
                    return_generator=False, **kwargs)

*Queries the index

Args

  • query: the query string
  • fields: a list of fields to search
  • highlight: If True, highlight hits
  • limit: results per page
  • page: page of hits to return
  • return_dict: If True, return list of dictionaries instead of LangChain Document objects
  • filters: filter results by field values (e.g., {‘extension’:‘pdf’})
  • where_document: optional query to further filter results
  • return_generator: If True, returns a generator, not a list*

source

SparseStore.semantic_search

 SparseStore.semantic_search (*args, **kwargs)

Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query. All arguments are fowrwarded on to the store’s query method. The query method is expected to include “query” as first positional argument and a “limit” keyword argument. Results of query method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.


source

VectorStore.check

 VectorStore.check ()

Raise exception if VectorStore.exists() returns False


source

VectorStore.ingest

 VectorStore.ingest (source_directory:str, chunk_size:int=500,
                     chunk_overlap:int=50,
                     ignore_fn:Optional[Callable]=None,
                     batch_size:int=41000, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
batch_size int 41000 batch size used when processing documents
kwargs VAR_KEYWORD
Returns None