Text Extraction

The load_single_document function can extract text from a range of different document formats (e.g., PDFs, Microsoft PowerPoint, Microsoft Word, etc.). It is automatically invoked when calling LLM.ingest. Extracted text is represented as LangChain Document objects, where Document.page_content stores the extracted text and Document.metadata stores any extracted document metadata.

For PDFs, in particular, a number of different options are available depending on your use case.

Fast PDF Extraction (default)

Pro: Fast
Con: Does not infer/retain structure of tables in PDF documents

from onprem.ingest import load_single_document

docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')
docs[0].metadata

{'source': '/home/amaiya/projects/ghub/onprem/nbs/tests/sample_data/ktrain_paper/ktrain_paper.pdf',
 'page': 0,
 'extension': 'pdf',
 'ocr': False,
 'table': False,
 'markdown': False,
 'table_captions': '',
 'document_title': ''}

Automatic OCR of PDFs

Pro: Automatically extracts text from scanned PDFs
Con: Slow

The load_single_document function will automatically OCR PDFs that require it (i.e., PDFs that are scanned hard-copies of documents). If a document is OCR’ed during extraction, the metadata['ocr'] field will be populated with True.

docs = load_single_document('tests/sample_data/ocr_document/lynn1975.pdf')
docs[0].metadata

{'source': '/home/amaiya/projects/ghub/onprem/nbs/tests/sample_data/ocr_document/lynn1975.pdf',
 'extension': 'pdf',
 'ocr': True,
 'table': False,
 'markdown': False,
 'page': -1,
 'table_captions': '',
 'document_title': ''}

Markdown Conversion in PDFs

Pro: Better chunking for QA
Con: Slower than default PDF extraction

The load_single_document function can convert PDFs to Markdown instead of plain text by supplying the pdf_markdown=True as an argument:

docs = load_single_document('your_pdf_document.pdf', 
                            pdf_markdown=True)

Converting to Markdown can facilitate downstream tasks like question-answering. For instance, when supplying pdf_markdown=True to LLM.ingest, documents are chunked in a Markdown-aware fashion (e.g., the abstract of a research paper tends to be kept together into a single chunk instead of being split up). Note that Markdown will not be extracted if the document requires OCR.

Inferring Table Structure in PDFs

Pro: Makes it easier for LLMs to analyze information in tables
Con: Slower than default PDF extraction

When supplying infer_table_structure=True to either load_single_document or LLM.ingest, tables are inferred and extracted from PDFs using a TableTransformer model. Tables are represented as Markdown (or HTML if Markdown conversion is not possible).

docs = load_single_document('your_pdf_document.pdf', 
                            infer_table_structure=True)

Parsing Extracted Text Into Sentences or Paragraphs

For some analyses (e.g., using prompts for information extraction), it may be useful to parse the text extracted from documents into individual sentences or paragraphs. This can be accomplished using the segment function:

from onprem.ingest import load_single_document
from onprem.utils import segment
text = load_single_document('tests/sample_data/sotu/state_of_the_union.txt')[0].page_content

segment(text, unit='paragraph')[0]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.  Members of Congress and the Cabinet.  Justices of the Supreme Court.  My fellow Americans.'

segment(text, unit='sentence')[0]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.'