source
MyElmLoader
MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
**unstructured_kwargs:Any)
Wrapper to fallback to text/plain when default does not work
source
batchify_chunks
batchify_chunks (texts)
source
does_vectorstore_exist
does_vectorstore_exist (db)
Checks if vectorstore exists
source
process_documents
process_documents (source_directory:str, ignored_files:List[str]=[],
chunk_size:int=500, chunk_overlap:int=50,
ignore_fn:Optional[Callable]=None)
*Load documents and split in chunks
Args :
source_directory : path to folder containing document store
chunk_size : text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap : character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn : Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
Returns: list of langchain.docstore.document.Document
objects*
source
load_documents
load_documents (source_dir:str, ignored_files:List[str]=[],
ignore_fn:Optional[Callable]=None)
Loads all documents from the source documents directory, ignoring specified files
source
load_single_document
load_single_document (file_path:str)
Load a single document (invoked by load_documents
).
source
Ingester
Ingester (embedding_model_name:str='sentence-transformers/all-
MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
embedding_encode_kwargs:dict={'normalize_embeddings': False},
persist_directory:Optional[str]=None)
*Ingests all documents in source_folder
(previously-ingested documents are ignored)
Args :
embedding_model : name of sentence-transformers model
embedding_model_kwargs : arguments to embedding model (e.g., {device':'cpu'}
)
embedding_encode_kwargs : arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}
).
persist_directory : Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb
in user’s home directory.
Returns : None
*
source
Ingester.get_embedding_model
Ingester.get_embedding_model ()
Returns an instance to the langchain.embeddings.huggingface.HuggingFaceEmbeddings
instance
source
Ingester.get_db
Ingester.get_db ()
Returns an instance to the langchain.vectorstores.Chroma
instance
source
Ingester.get_ingested_files
Ingester.get_ingested_files ()
Returns a list of files previously added to vector database (typically via LLM.ingest
)
source
Ingester.ingest
Ingester.ingest (source_directory:str, chunk_size:int=500,
chunk_overlap:int=50, ignore_fn:Optional[Callable]=None)
Ingests all documents in source_directory
(previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata
dict with the absolute path to the file in metadata["source"]
.
source_directory
str
path to folder containing document store
chunk_size
int
500
text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap
int
50
character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn
Optional
None
Optional function that accepts the file path (including file name) as input and returns True
if file path should not be ingested.
Returns
None
ingester = Ingester()
ingester.ingest("sample_data" )
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method