ingest.stores.sparse
SparseStore
SparseStore (**kwargs)
A factory for built-in SparseStore instances.
SparseStore.create
SparseStore.create (persist_directory=None, kind=None, **kwargs)
*Factory method to construct a SparseStore
instance.
Extra kwargs passed to object instantiation.
Args: persist_directory: where the index is stored kind: one of {whoosh}
Returns: SparseStore instance*
SparseStore.semantic_search
SparseStore.semantic_search (*args, **kwargs)
Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query
. All arguments are fowrwarded on to the store’s query method. The query method is expected to include “query” as first positional argument and a “limit” keyword argument. Results of query method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.
WhooshStore
WhooshStore (persist_directory:Optional[str]=None, index_name:str='myindex', **kwargs)
A sparse vector store based on the Whoosh full-text search engine.
default_schema
default_schema ()
WhooshStore.exists
WhooshStore.exists ()
Returns True if documents have been added to search index
WhooshStore.add_documents
WhooshStore.add_documents (docs:Sequence[langchain_core.documents.base.D ocument], limitmb:int=1024, optimize:bool=False, verbose:bool=True, **kwargs)
Indexes documents. Extra kwargs supplied to TextStore.ix.writer
.
Type | Default | Details | |
---|---|---|---|
docs | Sequence | list of LangChain Documents | |
limitmb | int | 1024 | maximum memory in megabytes to use |
optimize | bool | False | whether or not to also opimize index |
verbose | bool | True | Set to False to disable progress bar |
kwargs | VAR_KEYWORD |
WhooshStore.remove_document
WhooshStore.remove_document (value:str, field:str='id', optimize:bool=False)
Remove document with corresponding value and field. Default field is the id field. If optimize is True, index will be optimized.
WhooshStore.remove_source
WhooshStore.remove_source (source:str, optimize:bool=False)
remove all documents associated with source
. The source
argument can either be the full path to document or a parent folder. In the latter case, ALL documents in parent folder will be removed. If optimize is True, index will be optimized.
WhooshStore.update_documents
WhooshStore.update_documents (doc_dicts:dict, **kwargs)
Update a set of documents (doc in index with same ID will be over-written)
Type | Details | |
---|---|---|
doc_dicts | dict | dictionary with keys ‘page_content’, ‘source’, ‘id’, etc. |
kwargs | VAR_KEYWORD |
WhooshStore.get_all_docs
WhooshStore.get_all_docs ()
Returns a generator to iterate through all indexed documents
WhooshStore.get_doc
WhooshStore.get_doc (id:str)
Get an indexed record by ID
WhooshStore.get_size
WhooshStore.get_size (include_deleted:bool=False)
*Gets size of index
If include_deleted is True, will include deletd detects (prior to optimization).*
WhooshStore.erase
WhooshStore.erase (confirm=True)
Clears index
WhooshStore.query
WhooshStore.query (query:str, fields:Sequence=['page_content'], highlight:bool=True, limit:int=10, page:int=1, return_dict:bool=True, filters:Optional[Dict[str,str]]=None, where_document:Optional[str]=None, return_generator=False, **kwargs)
*Queries the index
Args
- query: the query string
- fields: a list of fields to search
- highlight: If True, highlight hits
- limit: results per page
- page: page of hits to return
- return_dict: If True, return list of dictionaries instead of LangChain Document objects
- filters: filter results by field values (e.g., {‘extension’:‘pdf’})
- where_document: optional query to further filter results
- return_generator: If True, returns a generator, not a list*
SparseStore.semantic_search
SparseStore.semantic_search (*args, **kwargs)
Any subclass of SparseStore can inherit this method for on-the-fly semantic searches. Retrieves results based on semantic similarity to supplied query
. All arguments are fowrwarded on to the store’s query method. The query method is expected to include “query” as first positional argument and a “limit” keyword argument. Results of query method are expected to be in the form: {‘hits’: list_of_dicts, ‘total_hits’ : int}. Result of this method is a list of LangChain Document objects sorted by semantic similarity.
VectorStore.check
VectorStore.check ()
Raise exception if VectorStore.exists()
returns False
VectorStore.ingest
VectorStore.ingest (source_directory:str, chunk_size:int=500, chunk_overlap:int=50, ignore_fn:Optional[Callable]=None, batch_size:int=41000, **kwargs)
Ingests all documents in source_directory
(previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata
dict with the absolute path to the file in metadata["source"]
. Extra kwargs fed to ingest.load_single_document
.
Type | Default | Details | |
---|---|---|---|
source_directory | str | path to folder containing document store | |
chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
ignore_fn | Optional | None | Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested. |
batch_size | int | 41000 | batch size used when processing documents |
kwargs | VAR_KEYWORD | ||
Returns | None |