The load_single_document function can extract text from a range of different document formats (e.g., PDFs, Microsoft PowerPoint, Microsoft Word, etc.). It is automatically invoked when calling LLM.ingest. Extracted text is represented as LangChain Document objects, where Document.page_content stores the extracted text and Document.metadata stores any extracted document metadata.
For PDFs, in particular, a number of different options are available depending on your use case.
Fast PDF Extraction (default)
Pro: Fast
Con: Does not infer/retain structure of tables in PDF documents
from onprem.ingest import load_single_documentdocs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')docs[0].metadata
Pro: Automatically extracts text from scanned PDFs
Con: Slow
The load_single_document function will automatically OCR PDFs that require it (i.e., PDFs that are scanned hard-copies of documents). If a document is OCR’ed during extraction, the metadata['ocr'] field will be populated with True.
Converting to Markdown can facilitate downstream tasks like question-answering. For instance, when supplying pdf_markdown=True to LLM.ingest, documents are chunked in a Markdown-aware fashion (e.g., the abstract of a research paper tends to be kept together into a single chunk instead of being split up). Note that Markdown will not be extracted if the document requires OCR.
Inferring Table Structure in PDFs
Pro: Makes it easier for LLMs to analyze information in tables
Con: Slower than default PDF extraction
When supplying infer_table_structure=True to either load_single_document or LLM.ingest, tables are inferred and extracted from PDFs using a TableTransformer model. Tables are represented as Markdown (or HTML if Markdown conversion is not possible).
Parsing Extracted Text Into Sentences or Paragraphs
For some analyses (e.g., using prompts for information extraction), it may be useful to parse the text extracted from documents into individual sentences or paragraphs. This can be accomplished using the segment function:
from onprem.ingest import load_single_documentfrom onprem.utils import segmenttext = load_single_document('tests/sample_data/sotu/state_of_the_union.txt')[0].page_content
segment(text, unit='paragraph')[0]
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'
segment(text, unit='sentence')[0]
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.'