ingest.stores.sparse
SparseStore
SparseStore (persist_directory:Optional[str]=None, index_name:str='myindex', **kwargs)
Helper class that provides a standard way to create an ABC using inheritance.
default_schema
default_schema ()
SparseStore.get_db
SparseStore.get_db ()
Get raw index
SparseStore.exists
SparseStore.exists ()
Returns True if documents have been added to search index
SparseStore.add_documents
SparseStore.add_documents (docs:Sequence[langchain_core.documents.base.D ocument], limitmb:int=1024, verbose:bool=True, **kwargs)
Indexes documents. Extra kwargs supplied to TextStore.ix.writer
.
Type | Default | Details | |
---|---|---|---|
docs | Sequence | list of LangChain Documents | |
limitmb | int | 1024 | maximum memory in megabytes to use |
verbose | bool | True | Set to False to disable progress bar |
kwargs | VAR_KEYWORD |
SparseStore.remove_document
SparseStore.remove_document (value:str, field:str='id')
Remove document with corresponding value and field. Default field is the id field.
SparseStore.update_documents
SparseStore.update_documents (doc_dicts:dict, **kwargs)
Update a set of documents (doc in index with same ID will be over-written)
Type | Details | |
---|---|---|
doc_dicts | dict | dictionary with keys ‘page_content’, ‘source’, ‘id’, etc. |
kwargs | VAR_KEYWORD |
SparseStore.get_all_docs
SparseStore.get_all_docs ()
Returns a generator to iterate through all indexed documents
SparseStore.get_doc
SparseStore.get_doc (id:str)
Get an indexed record by ID
SparseStore.get_size
SparseStore.get_size ()
Gets size of index
SparseStore.erase
SparseStore.erase (confirm=True)
Clears index
SparseStore.query
SparseStore.query (q:str, fields:Sequence=['page_content'], highlight:bool=True, limit:int=10, page:int=1, return_dict:bool=False, filters:Optional[Dict[str,str]]=None, where_document:Optional[str]=None)
*Queries the index
Args
- q: the query string
- fields: a list of fields to search
- highlight: If True, highlight hits
- limit: results per page
- page: page of hits to return
- return_dict: If True, return list of dictionaries instead of LangChain Document objects
- filters: filter results by field values (e.g., {‘extension’:‘pdf’})
- where_document: optional query to further filter results*
SparseStore.semantic_search
SparseStore.semantic_search (query, k:int=4, n_candidates=50, filters:Optional[Dict[str,str]]=None, where_document:Optional[str]=None, **kwargs)
Retrieves results based on semantic similarity to supplied query
.
Type | Default | Details | |
---|---|---|---|
query | |||
k | int | 4 | number of records to return based on highest semantic similarity scores. |
n_candidates | int | 50 | Number of records to consider (for which we compute embeddings on-the-fly) |
filters | Optional | None | filter sources by field values (e.g., {‘table’:True}) |
where_document | Optional | None | a boolean query to filter results further (e.g., “climate” AND extension:pdf) |
kwargs | VAR_KEYWORD |
VectorStore.check
VectorStore.check ()
Raise exception if VectorStore.exists()
returns False
VectorStore.ingest
VectorStore.ingest (source_directory:str, chunk_size:int=500, chunk_overlap:int=50, ignore_fn:Optional[Callable]=None, batch_size:int=41000, **kwargs)
Ingests all documents in source_directory
(previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata
dict with the absolute path to the file in metadata["source"]
. Extra kwargs fed to ingest.load_single_document
.
Type | Default | Details | |
---|---|---|---|
source_directory | str | path to folder containing document store | |
chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
ignore_fn | Optional | None | Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested. |
batch_size | int | 41000 | batch size used when processing documents |
kwargs | VAR_KEYWORD | ||
Returns | None |