*Extract text from a single document. Will attempt to OCR PDFs, if necessary.
Note that extra kwargs can be supplied to configure the behavior of PDF loaders. For instance, supplying infer_table_structure will cause load_single_document to try and infer and extract tables from PDFs. When pdf_unstructured=True and infer_table_structure=True, tables are represented as HTML within the main body of extracted text. In all other cases, inferred tables are represented as Markdown and appended to the end of the extracted text when infer_table_structure=True.*
Type
Default
Details
file_path
str
path to file
pdf_unstructured
bool
False
use unstructured for PDF extraction if True (will also OCR if necessary)
pdf_markdown
bool
False
Convert PDFs to Markdown instead of plain text if True.
Ingests all documents in source_directory (previously-ingested documents are ignored). When retrieved, the Document objects will each have a metadata dict with the absolute path to the file in metadata["source"]. Extra kwargs fed to ingest.load_single_document.
2023-09-12 11:35:20.660565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from sample_data
Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 16.76it/s]
Loaded 11 new documents from sample_data
Split into 62 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method