ingest

functionality for document ingestion into a vector database for question-answering

source

MyElmLoader

 MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
              **unstructured_kwargs:Any)

Wrapper to fallback to text/plain when default does not work


source

batchify_chunks

 batchify_chunks (texts)

source

does_vectorstore_exist

 does_vectorstore_exist (db)

Checks if vectorstore exists


source

process_documents

 process_documents (source_directory:str, ignored_files:List[str]=[],
                    chunk_size:int=500, chunk_overlap:int=50,
                    ignore_fn:Optional[Callable]=None)

*Load documents and split in chunks

Args:

  • source_directory: path to folder containing document store
  • chunk_size: text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
  • chunk_overlap: character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
  • ignore_fn: Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.

Returns: list of langchain.docstore.document.Document objects*


source

load_documents

 load_documents (source_dir:str, ignored_files:List[str]=[],
                 ignore_fn:Optional[Callable]=None)

Loads all documents from the source documents directory, ignoring specified files


source

load_single_document

 load_single_document (file_path:str)

Load a single document (invoked by load_documents).


source

Ingester

 Ingester (embedding_model_name:str='sentence-transformers/all-
           MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
           embedding_encode_kwargs:dict={'normalize_embeddings': False},
           persist_directory:Optional[str]=None)

*Ingests all documents in source_folder (previously-ingested documents are ignored)

Args:

  • embedding_model: name of sentence-transformers model
  • embedding_model_kwargs: arguments to embedding model (e.g., {device':'cpu'})
  • embedding_encode_kwargs: arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}).
  • persist_directory: Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb in user’s home directory.

Returns: None*


source

Ingester.get_embedding_model

 Ingester.get_embedding_model ()

Returns an instance to the langchain.embeddings.huggingface.HuggingFaceEmbeddings instance


source

Ingester.get_db

 Ingester.get_db ()

Returns an instance to the langchain.vectorstores.Chroma instance


source

Ingester.get_ingested_files

 Ingester.get_ingested_files ()

Returns a list of files previously added to vector database (typically via LLM.ingest)


source

Ingester.ingest

 Ingester.ingest (source_directory:str, chunk_size:int=500,
                  chunk_overlap:int=50, ignore_fn:Optional[Callable]=None)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"].

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
Returns None
ingester = Ingester()
ingester.ingest("sample_data")
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method