Load documents and split in chunks. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True
Type
Default
Details
source_directory
str
path to folder containing document store
chunk_size
int
500
text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap
int
50
character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignored_files
List
[]
list of files to ignore
ignore_fn
Optional
None
Callable that accepts the file path (including file name) as input and ignores if returns True
Loads all documents from the source documents directory, ignoring specified files. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True.
Type
Default
Details
source_dir
str
path to folder containing documents
ignored_files
List
[]
list of filepaths to ignore
ignore_fn
Optional
None
callable that accepts file path and returns True for ignored files
Load a single document. Will attempt to OCR PDFs, if necessary. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True.
Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method