ingest

functionality for document ingestion into a vector database for question-answering

source

MyElmLoader

 MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
              **unstructured_kwargs:Any)

Wrapper to fallback to text/plain when default does not work


source

batchify_chunks

 batchify_chunks (texts)

source

does_vectorstore_exist

 does_vectorstore_exist (db)

Checks if vectorstore exists


source

process_documents

 process_documents (source_directory:str, ignored_files:List[str]=[],
                    chunk_size:int=500, chunk_overlap:int=50)

Load documents and split in chunks

Args:

  • source_directory: path to folder containing document store
  • chunk_size: text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
  • chunk_overlap: character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter

Returns: list of langchain.docstore.document.Document objects


source

load_documents

 load_documents (source_dir:str, ignored_files:List[str]=[])

Loads all documents from the source documents directory, ignoring specified files


source

load_single_document

 load_single_document (file_path:str)

Load a single document (invoked by load_documents).


source

Ingester

 Ingester (embedding_model_name:str='sentence-transformers/all-
           MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
           embedding_encode_kwargs:dict={'normalize_embeddings': False},
           persist_directory:Optional[str]=None)

Ingests all documents in source_folder (previously-ingested documents are ignored)

Args:

  • embedding_model: name of sentence-transformers model
  • embedding_model_kwargs: arguments to embedding model (e.g., {device':'cpu'})
  • embedding_encode_kwargs: arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}).
  • persist_directory: Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb in user’s home directory.

Returns: None


source

Ingester.get_embedding_model

 Ingester.get_embedding_model ()

Returns an instance to the langchain.embeddings.huggingface.HuggingFaceEmbeddings instance


source

Ingester.get_db

 Ingester.get_db ()

Returns an instance to the langchain.vectorstores.Chroma instance


source

Ingester.ingest

 Ingester.ingest (source_directory:str, chunk_size:int=500,
                  chunk_overlap:int=50)

Ingests all documents in source_directory (previously-ingested documents are ignored).

Args:

  • source_directory: path to folder containing document store
  • chunk_size: text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
  • chunk_overlap: character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter

Returns: None

ingester = Ingester()
ingester.ingest("sample_data")
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method