Information Extraction

The pipelines module in OnPrem.LLM includes an Extractor to extract information of interest from a document using an LLM. This notebook we will show this module in action.

The Extractor runs multiple intermediate prompts and inferences, so we will set verbose-False and mute_stream=True. We will also set temperature=0 for more consistency in outputs. Finally, we will use OpenAI’s GPT-3.5-Turbo for this example, as it performs well out-of-box on extraction tasks with less prompt engineering.

from onprem import LLM
from onprem.pipelines import Extractor
import pandas as pd

pd.set_option('display.max_colwidth', None)
llm = LLM(model_url='openai://gpt-3.5-turbo', verbose=False, mute_stream=True, temperature=0)
extractor = Extractor(llm)

/home/amaiya/projects/ghub/onprem/onprem/core.py:147: UserWarning: The model you supplied is gpt-3.5-turbo, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.
  warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\

When using a cloud-based model with OnPrem.LLM, a warning will be issued notifying you that your prompts are being sent off-premises.

Example: Extracting Institutions from Research Papers

Let’s extract the institutions for ArXiv research papers using the prompt below.

prompt = """Extract the names of research institutions (e.g., universities, research labs, corporations, etc.) 
from the following sentence delimitated by three backticks. If there are no organizations, return NA.  
If there are multiple organizations, separate them with commas.
```{text}```
"""

!wget --user-agent="Mozilla" https://arxiv.org/pdf/2104.12871.pdf -O /tmp/mitchell.pdf -q

df = extractor.apply(prompt, fpath='/tmp/mitchell.pdf', pdf_pages=[1], stop=['\n'])
df.loc[df['Extractions'] != 'NA'].Extractions[0]

'Santa Fe Institute'

The apply method returns a dataframe of texts and prompt responses:

df.head()

	Extractions	Texts
0	Santa Fe Institute	arXiv:2104.12871v2 [cs.AI] 28 Apr 2021 Why AI is Harder Than We Think Melanie Mitchell Santa Fe Institute Santa Fe, NM, USA mm@santafe.edu Abstract Since its beginning in the 1950s, the ﬁeld of artiﬁcial intelligence has cycled several times between periods of optimistic predictions and massive investment (“AI spring”) and periods of disappointment, loss of conﬁ- dence, and reduced funding (“AI winter”). Even with today’s seemingly fast pace of AI breakthroughs, the development of long-promised technologies such as self-driving cars, housekeeping robots, and conversational companions has turned out to be much harder than many people expected. One reason for these repeating cycles is our limited understanding of the nature and complexity of intelligence itself. In this paper I describe four fallacies in common assumptions made by AI researchers, which can lead to overconﬁdent predictions about the ﬁeld. I conclude by discussing the open questions spurred by these fallacies, including the age-old challenge of imbuing machines with humanlike common sense. Introduction The year 2020 was supposed to herald the arrival of self-driving cars. Five years earlier, a headline in The Guardian predicted that “From 2020 you will become a permanent backseat driver” [1]. In 2016 Business Insider assured us that “10 million self-driving cars will be on the road by 2020” [2]. Tesla Motors CEO Elon Musk promised in 2019 that “A year from now, we’ll have over a million cars with full self-driving, software...everything” [3]. And 2020 was the target announced by several automobile companies to bring self-driving cars to market [4, 5, 6]. Despite attempts to redeﬁne “full self-driving” into existence [7], none of these predictions has come true. It’s worth quoting AI expert Drew McDermott on what can happen when over-optimism about AI systems—in particular, self-driving cars—turns out to be wrong: Perhaps expectations are too high, and... this will eventually result in disaster. [S]uppose that ﬁve years from now
1	NA	[funding] collapses miserably as autonomous vehicles fail to roll. Every startup company fails. And there’s a big backlash so that you can’t get money for anything connected with AI. Everybody hurriedly changes the names of their research projects to something else. This condition [is] called the “AI Winter” [8]. What’s most notable is that McDermott’s warning is from 1984, when, like today, the ﬁeld of AI was awash with conﬁdent optimism about the near future of machine intelligence. McDermott was writing about a cyclical pattern in the ﬁeld. New, apparent breakthroughs would lead AI practitioners to predict rapid progress, successful commercialization, and the near-term prospects of “true AI.” Governments and companies would get caught up in the enthusiasm, and would shower the ﬁeld with research and development funding. AI Spring would be in bloom. When progress stalled, the enthusiasm, funding, and jobs would dry up. AI Winter would arrive. Indeed, about ﬁve years after McDermott’s warning, a new AI winter set in. In this chapter I explore the reasons for the repeating cycle of overconﬁdence followed by disappointment in expectations about AI. I argue that over-optimism among the public, the media, and even experts can 1

Let’s try another:

!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/ktrain.pdf -q

df = extractor.apply(prompt, fpath='/tmp/ktrain.pdf', pdf_pages=[1], stop=['\n'])
df.loc[df['Extractions'] != 'NA'].Extractions[0]

'Institute for Defense Analyses'

Let’s try a paper with multiple affiliations.

!wget --user-agent="Mozilla" https://arxiv.org/pdf/2310.06643.pdf -O /tmp/multi-example.pdf -q

df = extractor.apply(prompt, fpath='/tmp/multi-example.pdf', pdf_pages=[1], stop=['\n'])
df.loc[df['Extractions'] != 'NA'].Extractions[0]

'Technical University of Denmark, University of Copenhagen'