source
PDF2MarkdownLoader
PDF2MarkdownLoader (file_path:Union[str,pathlib.PurePath],
password:Optional[str]=None,
mode:Literal['single','page']='page',
pages_delimiter:str='\n\x0c',
extract_images:bool=False, images_parser:Optional[lan
gchain_community.document_loaders.parsers.images.Base
ImageBlobParser]=None,
images_inner_format:Literal['text','markdown-
img','html-img']='text', extract_tables:Optional[Lite
ral['csv','markdown','html']]=None,
headers:Optional[dict]=None,
extract_tables_settings:Optional[dict[str,Any]]=None,
**kwargs:Any)
Custom PDF to Markdown Loader
source
MyUnstructuredPDFLoader
MyUnstructuredPDFLoader (file_path:Union[str,pathlib.Path],
mode:str='single', **unstructured_kwargs:Any)
Custom PDF Loader
source
MyElmLoader
MyElmLoader (file_path:Union[str,pathlib.Path], mode:str='single',
**unstructured_kwargs:Any)
Wrapper to fallback to text/plain when default does not work
source
Ingester
Ingester (embedding_model_name:str='sentence-transformers/all-
MiniLM-L6-v2', embedding_model_kwargs:dict={'device': 'cpu'},
embedding_encode_kwargs:dict={'normalize_embeddings': False},
persist_directory:Optional[str]=None)
*Ingests all documents in source_folder
(previously-ingested documents are ignored)
Args :
embedding_model : name of sentence-transformers model
embedding_model_kwargs : arguments to embedding model (e.g., {device':'cpu'}
)
embedding_encode_kwargs : arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}
).
persist_directory : Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb
in user’s home directory.
Returns : None
*
source
batchify_chunks
batchify_chunks (texts)
split texts into batches specifically for Chroma
source
does_vectorstore_exist
does_vectorstore_exist (db)
Checks if vectorstore exists
source
process_documents
process_documents (documents:list, chunk_size:int=500,
chunk_overlap:int=50, pdf_unstructured:bool=False,
**kwargs)
Process list of Documents by splitting into chunks. Extra kwargs fed to ingest.load_single_document
.
documents
list
list of LangChain Documents
chunk_size
int
500
text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap
int
50
character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
pdf_unstructured
bool
False
If True, use unstructured for PDF extraction
kwargs
VAR_KEYWORD
Returns
List
source
process_folder
process_folder (source_directory:str, chunk_size:int=500,
chunk_overlap:int=50, ignored_files:List[str]=[],
ignore_fn:Optional[Callable]=None,
pdf_unstructured:bool=False, **kwargs)
Load documents from folder, extract text from them, split texts into chunks. Extra kwargs fed to ingest.load_single_document
.
source_directory
str
path to folder containing document store
chunk_size
int
500
text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap
int
50
character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignored_files
List
[]
list of files to ignore
ignore_fn
Optional
None
Callable that accepts the file path (including file name) as input and ignores if returns True
pdf_unstructured
bool
False
If True, use unstructured for PDF extraction
kwargs
VAR_KEYWORD
Returns
List
source
load_documents
load_documents (source_dir:str, ignored_files:List[str]=[],
ignore_fn:Optional[Callable]=None,
pdf_unstructured:bool=False, caption_tables:bool=False,
extract_document_titles:bool=False, llm=None, **kwargs)
Loads all documents from the source documents directory, ignoring specified files. Extra kwargs fed to ingest.load_single_document
.
source_dir
str
path to folder containing documents
ignored_files
List
[]
list of filepaths to ignore
ignore_fn
Optional
None
callable that accepts file path and returns True for ignored files
pdf_unstructured
bool
False
If True, use unstructured for PDF extraction
caption_tables
bool
False
If True, agument table text with summaries of tables if infer_table_structure is True.
extract_document_titles
bool
False
If True, infer document title and attach to individual chunks
llm
NoneType
None
a reference to the LLM (used by caption_tables
and extract_document_titles
kwargs
VAR_KEYWORD
Returns
List
source
load_single_document
load_single_document (file_path:str, pdf_unstructured:bool=False,
pdf_markdown:bool=False, **kwargs)
*Extract text from a single document. Will attempt to OCR PDFs, if necessary.
Note that extra kwargs can be supplied to configure the behavior of PDF loaders. For instance, supplying infer_table_structure
will cause load_single_document
to try and infer and extract tables from PDFs. When pdf_unstructured=True
and infer_table_structure=True
, tables are represented as HTML within the main body of extracted text. In all other cases, inferred tables are represented as Markdown and appended to the end of the extracted text when infer_table_structure=True
.*
file_path
str
path to file
pdf_unstructured
bool
False
use unstructured for PDF extraction if True (will also OCR if necessary)
pdf_markdown
bool
False
Convert PDFs to Markdown instead of plain text if True.
kwargs
VAR_KEYWORD
Returns
List
source
Ingester.get_embedding_model
Ingester.get_embedding_model ()
Returns an instance to the langchain_huggingface.HuggingFaceEmbeddings
instance
source
Ingester.get_db
Ingester.get_db ()
Returns an instance to the langchain_chroma.Chroma
instance
source
Ingester.get_ingested_files
Ingester.get_ingested_files ()
Returns a list of files previously added to vector database (typically via LLM.ingest
)
source
Ingester.ingest
Ingester.ingest (source_directory:str, chunk_size:int=500,
chunk_overlap:int=50, ignore_fn:Optional[Callable]=None,
pdf_unstructured:bool=False, **kwargs)
Ingests all documents in source_directory
(previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata
dict with the absolute path to the file in metadata["source"]
. Extra kwargs fed to ingest.load_single_document
.
source_directory
str
path to folder containing document store
chunk_size
int
500
text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap
int
50
character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn
Optional
None
Optional function that accepts the file path (including file name) as input and returns True
if file path should not be ingested.
pdf_unstructured
bool
False
If True, use unstructured for PDF extraction
kwargs
VAR_KEYWORD
Returns
None
source
Ingester.store_documents
Ingester.store_documents (documents)
Stores instances of langchain_core.documents.base.Document
in vectordb
ingester = Ingester()
ingester.ingest("tests/sample_data" )
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method