ingest.base
PDF2MarkdownLoader
PDF2MarkdownLoader (file_path:Union[str,pathlib.PurePath], password:Optional[str]=None, mode:Literal['single','page']='page', pages_delimiter:str='\n\x0c', extract_images:bool=False, images_parser:Optional[lan gchain_community.document_loaders.parsers.images.Base ImageBlobParser]=None, images_inner_format:Literal['text','markdown- img','html-img']='text', extract_tables:Optional[Lite ral['csv','markdown','html']]=None, headers:Optional[dict]=None, extract_tables_settings:Optional[dict[str,Any]]=None, **kwargs:Any)
Custom PDF to Markdown Loader
MyUnstructuredPDFLoader
MyUnstructuredPDFLoader (file_path:Union[str,pathlib.Path], mode:str='single', **unstructured_kwargs:Any)
Custom PDF Loader
MyElmLoader
MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single', **unstructured_kwargs:Any)
Wrapper to fallback to text/plain when default does not work
batchify_chunks
batchify_chunks (texts, batch_size=41000)
split texts into batches specifically for Chroma
does_vectorstore_exist
does_vectorstore_exist (db)
Checks if vectorstore exists
chunk_documents
chunk_documents (documents:list, chunk_size:int=500, chunk_overlap:int=50, infer_table_structure:bool=False, **kwargs)
Process list of Documents by splitting into chunks.
Type | Default | Details | |
---|---|---|---|
documents | list | list of LangChain Documents | |
chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
infer_table_structure | bool | False | This should be set to True if documents may contain contain tables (i.e., doc.metadata['table']=True ). |
kwargs | VAR_KEYWORD | ||
Returns | List |
process_folder
process_folder (source_directory:str, chunk_size:int=500, chunk_overlap:int=50, ignored_files:List[str]=[], ignore_fn:Optional[Callable]=None, batch_size:int=41000, **kwargs)
Load documents from folder, extract text from them, split texts into chunks. Extra kwargs fed to ingest.load_documents
and ingest.load_single_document
.
Type | Default | Details | |
---|---|---|---|
source_directory | str | path to folder containing document store | |
chunk_size | int | 500 | text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter |
chunk_overlap | int | 50 | character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter |
ignored_files | List | [] | list of files to ignore |
ignore_fn | Optional | None | Callable that accepts the file path (including file name) as input and ignores if returns True |
batch_size | int | 41000 | batch size used when processing documents |
kwargs | VAR_KEYWORD | ||
Returns | List |
load_documents
load_documents (source_dir:str, ignored_files:List[str]=[], ignore_fn:Optional[Callable]=None, caption_tables:bool=False, extract_document_titles:bool=False, llm=None, n_proc:Optional[int]=None, verbose:bool=True, **kwargs)
*Loads all documents from the source documents directory, ignoring specified files. Extra kwargs fed to ingest.load_single_document
.
Returns a generator over documents.*
Type | Default | Details | |
---|---|---|---|
source_dir | str | path to folder containing documents | |
ignored_files | List | [] | list of filepaths to ignore |
ignore_fn | Optional | None | callable that accepts file path and returns True for ignored files |
caption_tables | bool | False | If True, agument table text with summaries of tables if infer_table_structure is True. |
extract_document_titles | bool | False | If True, infer document title and attach to individual chunks |
llm | NoneType | None | a reference to the LLM (used by caption_tables and extract_document_titles |
n_proc | Optional | None | number of CPU cores to use for text extraction. If None, use maximum for system. |
verbose | bool | True | verbosity |
kwargs | VAR_KEYWORD | ||
Returns | List |
load_single_document
load_single_document (file_path:str, pdf_unstructured:bool=False, pdf_markdown:bool=False, store_md5:bool=False, store_mimetype:bool=False, store_file_dates:bool=False, file_callables:Optional[dict]=None, text_callables:Optional[dict]=None, **kwargs)
*Extract text from a single document. Will attempt to OCR PDFs, if necessary.
Note that extra kwargs can be supplied to configure the behavior of PDF loaders. For instance, supplying infer_table_structure
will cause load_single_document
to try and infer and extract tables from PDFs. When pdf_unstructured=True
and infer_table_structure=True
, tables are represented as HTML within the main body of extracted text. In all other cases, inferred tables are represented as Markdown and appended to the end of the extracted text when infer_table_structure=True
.*
Type | Default | Details | |
---|---|---|---|
file_path | str | path to file | |
pdf_unstructured | bool | False | use unstructured for PDF extraction if True (will also OCR if necessary) |
pdf_markdown | bool | False | Convert PDFs to Markdown instead of plain text if True. |
store_md5 | bool | False | Extract and store MD5 of document in metadata |
store_mimetype | bool | False | Guess and store mime type of document in metadata |
store_file_dates | bool | False | Extract snd store file dates in metadata |
file_callables | Optional | None | optional dict with keys and functions called with filepath as argument. Results stored as metadata. |
text_callables | Optional | None | optional dict with keys and functions called with file text as argument. Results stored as metadata. |
kwargs | VAR_KEYWORD | ||
Returns | List |
VectorStore
VectorStore ()
Helper class that provides a standard way to create an ABC using inheritance.