ingest.base

functionality for text extraction and document ingestion into a vector database for question-answering and other tasks

source

PDF2MarkdownLoader

 PDF2MarkdownLoader (file_path:Union[str,pathlib.PurePath],
                     password:Optional[str]=None,
                     mode:Literal['single','page']='page',
                     pages_delimiter:str='\n\x0c',
                     extract_images:bool=False, images_parser:Optional[lan
                     gchain_community.document_loaders.parsers.images.Base
                     ImageBlobParser]=None,
                     images_inner_format:Literal['text','markdown-
                     img','html-img']='text', extract_tables:Optional[Lite
                     ral['csv','markdown','html']]=None,
                     headers:Optional[dict]=None,
                     extract_tables_settings:Optional[dict[str,Any]]=None,
                     **kwargs:Any)

Custom PDF to Markdown Loader


source

MyUnstructuredPDFLoader

 MyUnstructuredPDFLoader (file_path:Union[str,pathlib.Path],
                          mode:str='single', **unstructured_kwargs:Any)

Custom PDF Loader


source

MyElmLoader

 MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
              **unstructured_kwargs:Any)

Wrapper to fallback to text/plain when default does not work


source

batchify_chunks

 batchify_chunks (texts, batch_size=41000)

split texts into batches specifically for Chroma


source

does_vectorstore_exist

 does_vectorstore_exist (db)

Checks if vectorstore exists


source

chunk_documents

 chunk_documents (documents:list, chunk_size:int=500,
                  chunk_overlap:int=50, infer_table_structure:bool=False,
                  **kwargs)

Process list of Documents by splitting into chunks.

Type Default Details
documents list list of LangChain Documents
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
infer_table_structure bool False This should be set to True if documents may contain contain tables (i.e., doc.metadata['table']=True).
kwargs VAR_KEYWORD
Returns List

source

process_folder

 process_folder (source_directory:str, chunk_size:int=500,
                 chunk_overlap:int=50, ignored_files:List[str]=[],
                 ignore_fn:Optional[Callable]=None, batch_size:int=41000,
                 **kwargs)

Load documents from folder, extract text from them, split texts into chunks. Extra kwargs fed to ingest.load_documents and ingest.load_single_document.

Type Default Details
source_directory str path to folder containing document store
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignored_files List [] list of files to ignore
ignore_fn Optional None Callable that accepts the file path (including file name) as input and ignores if returns True
batch_size int 41000 batch size used when processing documents
kwargs VAR_KEYWORD
Returns List

source

load_documents

 load_documents (source_dir:str, ignored_files:List[str]=[],
                 ignore_fn:Optional[Callable]=None,
                 caption_tables:bool=False,
                 extract_document_titles:bool=False, llm=None,
                 n_proc:Optional[int]=None, verbose:bool=True, **kwargs)

*Loads all documents from the source documents directory, ignoring specified files. Extra kwargs fed to ingest.load_single_document.

Returns a generator over documents.*

Type Default Details
source_dir str path to folder containing documents
ignored_files List [] list of filepaths to ignore
ignore_fn Optional None callable that accepts file path and returns True for ignored files
caption_tables bool False If True, agument table text with summaries of tables if infer_table_structure is True.
extract_document_titles bool False If True, infer document title and attach to individual chunks
llm NoneType None a reference to the LLM (used by caption_tables and extract_document_titles
n_proc Optional None number of CPU cores to use for text extraction. If None, use maximum for system.
verbose bool True verbosity
kwargs VAR_KEYWORD
Returns List

source

load_single_document

 load_single_document (file_path:str, pdf_unstructured:bool=False,
                       pdf_markdown:bool=False, store_md5:bool=False,
                       store_mimetype:bool=False,
                       store_file_dates:bool=False,
                       file_callables:Optional[dict]=None,
                       text_callables:Optional[dict]=None, **kwargs)

*Extract text from a single document. Will attempt to OCR PDFs, if necessary.

Note that extra kwargs can be supplied to configure the behavior of PDF loaders. For instance, supplying infer_table_structure will cause load_single_document to try and infer and extract tables from PDFs. When pdf_unstructured=True and infer_table_structure=True, tables are represented as HTML within the main body of extracted text. In all other cases, inferred tables are represented as Markdown and appended to the end of the extracted text when infer_table_structure=True.*

Type Default Details
file_path str path to file
pdf_unstructured bool False use unstructured for PDF extraction if True (will also OCR if necessary)
pdf_markdown bool False Convert PDFs to Markdown instead of plain text if True.
store_md5 bool False Extract and store MD5 of document in metadata
store_mimetype bool False Guess and store mime type of document in metadata
store_file_dates bool False Extract snd store file dates in metadata
file_callables Optional None optional dict with keys and functions called with filepath as argument. Results stored as metadata.
text_callables Optional None optional dict with keys and functions called with file text as argument. Results stored as metadata.
kwargs VAR_KEYWORD
Returns List

source

VectorStore

 VectorStore ()

Helper class that provides a standard way to create an ABC using inheritance.