from onprem import LLM
from onprem.pipelines import Extractor
pipelines.extractor
Pipline for information extraction
Extractor
Extractor (llm, prompt_template:Optional[str]=None, **kwargs)
Extractor
applies a given prompt to each sentence or paragraph in a document and returns the results.
Type | Default | Details | |
---|---|---|---|
llm | An onprem.LLM object |
||
prompt_template | Optional | None | A model specific prompt_template with a single placeholder named “{prompt}”. If supplied, overrides the prompt_template supplied to the LLM constructor. |
kwargs |
Extractor.apply
Extractor.apply (ex_prompt_template:str, fpath:Optional[str]=None, content:Optional[str]=None, unit:str='paragraph', preproc_fn:Optional[Callable]=None, filter_fn:Optional[Callable]=None, clean_fn:Optional[Callable]=None, pdf_pages:List[int]=[], maxchars=2048, stop:list=[], pdf_use_unstructured:bool=False, **kwargs)
Apply the prompt to each unit
(where a “unit” is either a paragraph or sentence) optionally filtered by filter_fn
. Extra kwargs fed directly to langchain_community.document_loaders.pdf.UnstructuredPDFLoader
when pdf_use_unstructured is True. Results are stored in a pandas.Dataframe
.
Type | Default | Details | |
---|---|---|---|
ex_prompt_template | str | A prompt to apply to each unit in document. Should have a single variable, {text} |
|
fpath | Optional | None | A path to to a single file of interest (e.g., a PDF or MS Word document). Mutually-exclusive with content . |
content | Optional | None | Text content of a document of interest. Mutually-exclusive with fpath . |
unit | str | paragraph | One of {‘sentence’, ‘paragraph’}. |
preproc_fn | Optional | None | Function should accept a text string and returns a new preprocessed input. |
filter_fn | Optional | None | A function that accepts a sentence or paragraph and returns True if prompt should be applied to it. |
clean_fn | Optional | None | A function that accepts a sentence or paragraph and returns “cleaned” version of the text. (applied after filter_fn ) |
pdf_pages | List | [] | If fpath is a PDF document, only apply prompt to text on page numbers listed in pdf_pages (starts at 1). |
maxchars | int | 2048 | units (i.e., paragraphs or sentences) larger than maxchars split. |
stop | list | [] | list of characters to trigger the LLM to stop generating. |
pdf_use_unstructured | bool | False | If True, use unstructured package to extract text from PDF. |
kwargs |
= "<s>[INST] {prompt} [/INST]" # prompt template for Mistral
prompt_template = LLM(model_url='https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf',
llm =33, # change based on your system
n_gpu_layers=False, mute_stream=True,
verbose=prompt_template)
prompt_template= Extractor(llm) extractor
/home/amaiya/mambaforge/envs/llm/lib/python3.9/site-packages/langchain_core/language_models/llms.py:239: DeprecationWarning: callback_manager is deprecated. Please use callbacks instead.
warnings.warn(
= """Extract citations from the following sentences. Return #NA# if there are no citations in the text. Here are some examples:
prompt
[SENTENCE]:pretrained BERT text classifier (Devlin et al., 2018), models for sequence tagging (Lample et al., 2016)
[CITATIONS]:(Devlin et al., 2018), (Lample et al., 2016)
[SENTENCE]:Machine learning (ML) is a powerful tool.
[CITATIONS]:#NA#
[SENTENCE]:Following inspiration from a blog post by Rachel Thomas of fast.ai (Howard and Gugger, 2020), we refer to this as Augmented Machine Learning or AugML
[CITATIONS]:(Howard and Gugger, 2020)
[SENTENCE]:{text}
[CITATIONS]:"""
= """
content For instance, the fit_onecycle method employs a 1cycle policy (Smith, 2018).
"""
= extractor.apply(prompt, content=content, stop=['\n'])
df assert df['Extractions'][0].strip().startswith('(Smith, 2018)')
="""In the case of text, this may involve language-specific preprocessing (e.g., tokenization)."""
content = extractor.apply(prompt, content=content, stop=['\n'])
df assert df['Extractions'][0].strip().startswith('#NA#')