ingest

functionality for document ingestion into a vector database for question-answering

source

MyPDFLoader

 MyPDFLoader
              (file_path:Union[str,List[str],pathlib.Path,List[pathlib.Pat
              h]], mode:str='single', **unstructured_kwargs:Any)

Custom Unstructured-based PDF Loader


source

MyElmLoader

 MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
              **unstructured_kwargs:Any)

Wrapper to fallback to text/plain when default does not work


source

Ingester

 Ingester (embedding_model_name:str='sentence-transformers/all-
           MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
           embedding_encode_kwargs:dict={'normalize_embeddings': False},
           persist_directory:Optional[str]=None)

*Ingests all documents in source_folder (previously-ingested documents are ignored)

Args:

  • embedding_model: name of sentence-transformers model
  • embedding_model_kwargs: arguments to embedding model (e.g., {device':'cpu'})
  • embedding_encode_kwargs: arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}).
  • persist_directory: Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb in user’s home directory.

Returns: None*


source

batchify_chunks

 batchify_chunks (texts)

split texts into batches specifically for Chroma


source

does_vectorstore_exist

 does_vectorstore_exist (db)

Checks if vectorstore exists


source

process_documents

 process_documents (source_directory:str, chunk_size:int=500,
                    chunk_overlap:int=50, ignored_files:List[str]=[],
                    ignore_fn:Optional[Callable]=None,
                    pdf_use_unstructured:bool=False, **kwargs)

Load documents and split in chunks. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignored_files List [] list of files to ignore
ignore_fn Optional None Callable that accepts the file path (including file name) as input and ignores if returns True
pdf_use_unstructured bool False If True, use unstructured for PDF extraction
kwargs
Returns List

source

load_documents

 load_documents (source_dir:str, ignored_files:List[str]=[],
                 ignore_fn:Optional[Callable]=None,
                 pdf_use_unstructured:bool=False, **kwargs)

Loads all documents from the source documents directory, ignoring specified files. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True.

Type Default Details
source_dir str path to folder containing documents
ignored_files List [] list of filepaths to ignore
ignore_fn Optional None callable that accepts file path and returns True for ignored files
pdf_use_unstructured bool False If True, use unstructured for PDF extraction
kwargs
Returns List

source

load_single_document

 load_single_document (file_path:str, pdf_use_unstructured:bool=False,
                       **kwargs)

Load a single document. Will attempt to OCR PDFs, if necessary. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True.

Type Default Details
file_path str path to file
pdf_use_unstructured bool False use unstructured for PDF extraction if True
kwargs
Returns List

source

extract_files

 extract_files (source_dir:str)

Extract files of supported file types from folder.


source

Ingester

 Ingester (embedding_model_name:str='sentence-transformers/all-
           MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
           embedding_encode_kwargs:dict={'normalize_embeddings': False},
           persist_directory:Optional[str]=None)

*Ingests all documents in source_folder (previously-ingested documents are ignored)

Args:

  • embedding_model: name of sentence-transformers model
  • embedding_model_kwargs: arguments to embedding model (e.g., {device':'cpu'})
  • embedding_encode_kwargs: arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}).
  • persist_directory: Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb in user’s home directory.

Returns: None*


source

Ingester.get_embedding_model

 Ingester.get_embedding_model ()

Returns an instance to the langchain_huggingface.HuggingFaceEmbeddings instance


source

Ingester.get_db

 Ingester.get_db ()

Returns an instance to the langchain_chroma.Chroma instance


source

Ingester.get_ingested_files

 Ingester.get_ingested_files ()

Returns a list of files previously added to vector database (typically via LLM.ingest)


source

Ingester.ingest

 Ingester.ingest (source_directory:str, chunk_size:int=500,
                  chunk_overlap:int=50, ignore_fn:Optional[Callable]=None,
                  pdf_use_unstructured:bool=False, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
pdf_use_unstructured bool False If True, use unstructured for PDF extraction
kwargs
Returns None

source

Ingester.store_documents

 Ingester.store_documents (documents)

Stores instances of langchain_core.documents.base.Document in vectordb

ingester = Ingester()
ingester.ingest("sample_data")
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method