source
MyElmLoader
MyElmLoader (file_path:str, mode:str='single', **unstructured_kwargs:Any)
Wrapper to fallback to text/plain when default does not work
source
does_vectorstore_exist
does_vectorstore_exist (db)
Checks if vectorstore exists
source
process_documents
process_documents (source_directory:str, ignored_files:List[str]=[],
chunk_size:int=500, chunk_overlap:int=50)
Load documents and split in chunks
Args :
source_directory : path to folder containing document store
chunk_size : text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap : character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
Returns: list of langchain.docstore.document.Document
objects
source
load_documents
load_documents (source_dir:str, ignored_files:List[str]=[])
Loads all documents from the source documents directory, ignoring specified files
source
load_single_document
load_single_document (file_path:str)
Load a single document (invoked by load_documents
).
source
Ingester
Ingester (embedding_model_name:str='sentence-transformers/all-
MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
embedding_encode_kwargs:dict={'normalize_embeddings': False},
persist_directory:Optional[str]=None)
Ingests all documents in source_folder
(previously-ingested documents are ignored)
Args :
embedding_model : name of sentence-transformers model
embedding_model_kwargs : arguments to embedding model (e.g., {device':'cpu'}
)
embedding_encode_kwargs : arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}
).
persist_directory : Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb
in user’s home directory.
Returns : None
source
Ingester.get_embedding_model
Ingester.get_embedding_model ()
Returns an instance to the langchain.embeddings.huggingface.HuggingFaceEmbeddings
instance
source
Ingester.get_db
Ingester.get_db ()
Returns an instance to the langchain.vectorstores.Chroma
instance
source
Ingester.ingest
Ingester.ingest (source_directory:str, chunk_size:int=500,
chunk_overlap:int=50)
Ingests all documents in source_directory
(previously-ingested documents are ignored).
Args :
source_directory : path to folder containing document store
chunk_size : text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap : character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
Returns : None
ingester = Ingester()
ingester.ingest("sample_data" )
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method