ingest

functionality for text extraction and document ingestion into a vector database for question-answering and other tasks

source

PDF2MarkdownLoader

 PDF2MarkdownLoader (file_path:str, headers:Optional[Dict]=None,
                     extract_images:bool=False, **kwargs:Any)

Custom PDF to Markdown Loader


source

MyUnstructuredPDFLoader

 MyUnstructuredPDFLoader
                          (file_path:Union[str,List[str],pathlib.Path,List
                          [pathlib.Path]], mode:str='single',
                          **unstructured_kwargs:Any)

Custom PDF Loader


source

MyElmLoader

 MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
              **unstructured_kwargs:Any)

Wrapper to fallback to text/plain when default does not work


source

Ingester

 Ingester (embedding_model_name:str='sentence-transformers/all-
           MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
           embedding_encode_kwargs:dict={'normalize_embeddings': False},
           persist_directory:Optional[str]=None)

*Ingests all documents in source_folder (previously-ingested documents are ignored)

Args:

  • embedding_model: name of sentence-transformers model
  • embedding_model_kwargs: arguments to embedding model (e.g., {device':'cpu'})
  • embedding_encode_kwargs: arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}).
  • persist_directory: Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb in user’s home directory.

Returns: None*


source

batchify_chunks

 batchify_chunks (texts)

split texts into batches specifically for Chroma


source

does_vectorstore_exist

 does_vectorstore_exist (db)

Checks if vectorstore exists


source

process_documents

 process_documents (source_directory:str, chunk_size:int=500,
                    chunk_overlap:int=50, ignored_files:List[str]=[],
                    ignore_fn:Optional[Callable]=None,
                    pdf_unstructured:bool=False, **kwargs)

Load documents and split in chunks. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignored_files List [] list of files to ignore
ignore_fn Optional None Callable that accepts the file path (including file name) as input and ignores if returns True
pdf_unstructured bool False If True, use unstructured for PDF extraction
kwargs
Returns List

source

load_documents

 load_documents (source_dir:str, ignored_files:List[str]=[],
                 ignore_fn:Optional[Callable]=None,
                 pdf_unstructured:bool=False, **kwargs)

Loads all documents from the source documents directory, ignoring specified files. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_dir str path to folder containing documents
ignored_files List [] list of filepaths to ignore
ignore_fn Optional None callable that accepts file path and returns True for ignored files
pdf_unstructured bool False If True, use unstructured for PDF extraction
kwargs
Returns List

source

load_single_document

 load_single_document (file_path:str, pdf_unstructured:bool=False,
                       pdf_markdown:bool=False, **kwargs)

*Extract text from a single document. Will attempt to OCR PDFs, if necessary.

Note that extra kwargs can be supplied to configure the behavior of PDF loaders. For instance, supplying infer_table_structure will cause load_single_document to try and infer and extract tables from PDFs. When pdf_unstructured=True and infer_table_structure=True, tables are represented as HTML within the main body of extracted text. In all other cases, inferred tables are represented as Markdown and appended to the end of the extracted text when infer_table_structure=True.*

Type Default Details
file_path str path to file
pdf_unstructured bool False use unstructured for PDF extraction if True (will also OCR if necessary)
pdf_markdown bool False Convert PDFs to Markdown instead of plain text if True.
kwargs
Returns List

source

extract_files

 extract_files (source_dir:str)

Extract files of supported file types from folder.


source

Ingester.get_embedding_model

 Ingester.get_embedding_model ()

Returns an instance to the langchain_huggingface.HuggingFaceEmbeddings instance


source

Ingester.get_db

 Ingester.get_db ()

Returns an instance to the langchain_chroma.Chroma instance


source

Ingester.get_ingested_files

 Ingester.get_ingested_files ()

Returns a list of files previously added to vector database (typically via LLM.ingest)


source

Ingester.ingest

 Ingester.ingest (source_directory:str, chunk_size:int=500,
                  chunk_overlap:int=50, ignore_fn:Optional[Callable]=None,
                  pdf_unstructured:bool=False, **kwargs)

Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None Optional function that accepts the file path (including file name) as input and returns True if file path should not be ingested.
pdf_unstructured bool False If True, use unstructured for PDF extraction
kwargs
Returns None

source

Ingester.store_documents

 Ingester.store_documents (documents)

Stores instances of langchain_core.documents.base.Document in vectordb

ingester = Ingester()
ingester.ingest("tests/sample_data")
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method