from onprem import LLM, utils as U
import tempfile
from textwrap import wrap
Document Question-Answering
This example of OnPrem.LLM demonstrates retrieval augmented generation or RAG.
Basic RAG
STEP 1: Setup the LLM
instance
In this notebook, we will use a model called Zephyr-7B-beta, which performs well on RAG tasks. When selecting a model, it is important to inspect the model’s home page and identify the correct prompt format. The prompt format for this model is located here, and we will supply it directly to the LLM
constructor along with the URL to the specific model file we want (i.e., zephyr-7b-beta.Q4_K_M.gguf). We will offload layers to our GPU(s) to speed up inference using the n_gpu_layers
parameter. (For more information on GPU acceleration, see here.) For the purposes of this notebook, we also supply temperature=0
so that there is no variability in outputs. You can increase this value for more creativity in the outputs. Finally, we will choose a non-default location for our vector database.
= tempfile.mkdtemp()
vectordb_path
= LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf',
llm = "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
prompt_template=-1,
n_gpu_layers=0,
temperature='dense',
store_type=vectordb_path,
vectordb_path=False) verbose
llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Since OnPrem.LLM includes built-in support for Zephyr, an easier way to instantiate the LLM with Zephyr is as follows:
= LLM(default_model='zephyr',
llm =-1,
n_gpu_layers=0,
temperature='dense',
store_type=vectordb_path) vectordb_path
STEP 2: Ingest Documents
When ingesting documents, they can be stored in one of two ways: 1. a dense vector store: a conventional vector database like Chroma 2. a sparse vector store: a keyword-search engine
Sparse vector stores compute embeddings on-the-fly at inference time. As a result, sparse vector stores sacrifice a small amount of inference speed for significant speed ups in ingestion speed. This makes it better suited for larger document sets. Note that sparse vector stores include the contraint that any passages considered as sources for answers should have at least one word in common with the question being asked. You can specify the kind of vector store by supplying either store_type="dense"
or store_type="sparse"
when creating the LLM
above. We use a dense vector store in this example, as shown above.
For this example, we will download the 2024 National Defense Autorization Act (NDAA) report and ingest it.
'https://www.congress.gov/118/crpt/hrpt125/CRPT-118hrpt125.pdf', '/tmp/ndaa/ndaa.pdf', verify=True) U.download(
[██████████████████████████████████████████████████]
"/tmp/ndaa/") llm.ingest(
Creating new vectorstore at /tmp/tmpmnt6g6l8/dense
Loading documents from /tmp/ndaa/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00, 1.62it/s]
Processing and chunking 672 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.22it/s]
Split into 5202 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:17<00:00, 2.95s/it]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
STEP 3: Asking Questions to Your Documents
= llm.ask("What is said about artificial intelligence training and education?") result
The context provided discusses the implementation of an AI education strategy required by Section 256 of the National Defense Authorization Act for Fiscal Year 2020. The strategy aims to educate servicemembers in relevant occupational fields, with a focus on data literacy across a broader population within the Department of Defense. The committee encourages the Air Force and Space Force to leverage government-owned training platforms informed by private sector expertise to accelerate learning and career path development. Additionally, the committee suggests expanding existing mobile enabled platforms to train and develop the cyber workforce of the Air Force and Space Force. Overall, there is a recognition that AI continues to be central to warfighting and that proper implementation of these new technologies requires a focus on education and training.
The answer is stored in results['answer']
. The documents retrieved from the vector store used to generate the answer are stored in results['source_documents']
above.
print('ANSWER:')
print("\n".join(wrap(result['answer'])))
print()
print()
print('REFERENCES')
print()
for d in result['source_documents']:
print(f"On Page {d.metadata['page']} in {d.metadata['source']}:")
print(d.page_content)
print('----------------------------------------')
print()
ANSWER:
The context provided discusses the implementation of an AI education
strategy required by Section 256 of the National Defense Authorization
Act for Fiscal Year 2020. The strategy aims to educate servicemembers
in relevant occupational fields, with a focus on data literacy across
a broader population within the Department of Defense. The committee
encourages the Air Force and Space Force to leverage government-owned
training platforms informed by private sector expertise to accelerate
learning and career path development. Additionally, the committee
suggests expanding existing mobile enabled platforms to train and
develop the cyber workforce of the Air Force and Space Force. Overall,
there is a recognition that AI continues to be central to warfighting
and that proper implementation of these new technologies requires a
focus on education and training.
REFERENCES
On Page 359 in /tmp/ndaa/ndaa.pdf:
‘‘servicemembers in relevant occupational fields on matters relating
to artificial intelligence.’’
Given the continued centrality of AI to warfighting, the com-
mittee directs the Chief Digital and Artificial Intelligence Officer of
the Department of Defense to provide a briefing to the House Com-
mittee on Armed Services not later than March 31, 2024, on the
implementation status of the AI education strategy, with emphasis
on current efforts underway, such as the AI Primer course within
----------------------------------------
On Page 359 in /tmp/ndaa/ndaa.pdf:
intelligence (AI) and machine learning capabilities available within
the Department of Defense. To ensure the proper implementation
of these new technologies, there must be a focus on data literacy
across a broader population within the Department. Section 256 of
the National Defense Authorization Act for Fiscal Year 2020 (Pub-
lic Law 116–92) required the Department of Defense to develop an
AI education strategy, with the stated objective to educate
----------------------------------------
On Page 102 in /tmp/ndaa/ndaa.pdf:
tificial intelligence and machine learning (AI/ML), and cloud com-
puting. The committee encourages the Air Force and Space Force
to leverage government owned training platforms with curricula in-
formed by private sector expertise to accelerate learning and career
path development.
To that end, the committee encourages the Secretary of the Air
Force to expand existing mobile enabled platforms to train and de-
velop the cyber workforce of Air Force and Space Force. To better
----------------------------------------
On Page 109 in /tmp/ndaa/ndaa.pdf:
70
role of senior official with principal responsibility for artificial intel-
ligence and machine learning. In February 2022, the Department
stood up the Chief Digital and Artificial Intelligence Office to accel-
erate the Department’s adoption of AI. The committee encourages
the Department to build upon this progress and sustain efforts to
research, develop, test, and where appropriate, operationalize AI
capabilities.
Artificial intelligence capabilities of foreign adversaries
----------------------------------------
Advanced Example: NSF Awards
The example above employed the use of the default dense vector store, Chroma. By supplying store_type="sparse"
to LLM
, a sparse vector store (i.e., keyword search engine) is used instead. Sparse vector stores index documents faster but requires keyword matches between sources containing answers and the question or query. Semantic search is still supported through on-demand dense vectorization in OnPrem.LLM.
In this example, we will instantiate a spare store directly and customize the ingestion process to include custom fields using a dataset of 2024 NSF Awards.
STEP 1: Download the Pre-Process the NSF Data
NSF awards data stores as thousands of JSON files. The code below downloads and parses each JSON file.
import os
import zipfile
import requests
import json
from pathlib import Path
from tqdm.notebook import tqdm
# Step 1: Download the ZIP file
= "https://www.nsf.gov/awardsearch/download?DownloadFileName=2024&All=true&isJson=true"
url = "/tmp/nsf_awards_2024.zip"
zip_path
if not os.path.exists(zip_path):
print("Downloading NSF data...")
with requests.get(url, stream=True) as r:
r.raise_for_status()with open(zip_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)print("Download complete.")
else:
print("ZIP file already exists.")
# Step 2: Unzip the file
= "nsf_awards_2024"
extract_dir
if not os.path.exists(extract_dir):
print("Extracting ZIP file...")
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extract_dir)print("Extraction complete.")
else:
print("Already extracted.")
# Step 3: Function to extract fields from JSON
def extract_fields(data):
= data.get("awd_titl_txt", "N/A")
title = data.get("awd_abstract_narration", "N/A")
abstract
= data.get("pgm_ele")
pgm_ele if isinstance(pgm_ele, list) and pgm_ele:
= pgm_ele[0].get("pgm_ele_name", "N/A")
category else:
= "N/A"
category
# Authors
= []
authors for pi in data.get("pi", []):
= pi.get("pi_full_name", "")
full_name if full_name:
authors.append(full_name)= ", ".join(authors) if authors else "N/A"
authors_str
# Affiliation
= data.get("inst", {}).get("inst_name", "N/A")
affiliation
# Amount
= data.get("awd_amount", data.get("tot_intn_awd_amt", None))
raw_amount try:
= float(raw_amount)
amount except (TypeError, ValueError):
= None
amount
return {
"title": title or '',
"abstract": f'{title or ""}' + '\n\n' + f'{abstract or ""}',
"category": category,
"authors": authors_str,
"affiliation": affiliation,
"amount": amount
}
# Step 4: Process all JSON files and write results to .txt
= "/tmp/nsf_text_output"
output_dir =True)
os.makedirs(output_dir, exist_ok
= list(Path(extract_dir).glob("*.json"))
json_files
print(f"Processing {len(json_files)} JSON files...")
= []
nsf_data for json_file in tqdm(json_files):
with open(json_file, 'r', encoding='utf-8') as f:
try:
= json.load(f)
data except json.JSONDecodeError:
continue # skip bad files
= extract_fields(data)
fields 'source'] = str(json_file)
fields[
nsf_data.append(fields)
print("All JSON files processed and saved to list of dictionaries.")
ZIP file already exists.
Already extracted.
Processing 11687 JSON files...
All JSON files processed and saved to list of dictionaries.
STEP 2: Ingest Documents
Let’s now store these NSF awards data in a Whoosh-backed sparse vector store. This is equivalent to supplying store_type="sparse"
to LLM
. However, we will explicitly create the SparseStore instance to customize the ingestion process for NSF data.
Since award abstracts are not lengthy, we will forgo chunking the document (e.g., using either onprem.ingest.chunk_documents
or another chunking tool like chonkie) and instead store each award as a single record in the index.
from onprem.ingest import VectorStoreFactory, helpers, chunk_documents
= VectorStoreFactory.create(
store ='whoosh',
kind='/tmp/nsf_store'
persist_location )
= []
docs for d in nsf_data:
= helpers.doc_from_dict(d, content_field='abstract')
doc
docs.append(doc) store.add_documents(docs)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 11687/11687 [00:10<00:00, 1119.63it/s]
Let’s examine the total number of awards stored.
store.get_size()
11687
STEP 3: Explore NSF Awards
We can explore NSF awards by either using an LLM or querying the vector store directly.
The NSF buckets awards into different catgories. Let’s examine all the material-related categories.
set([d['category'] for d in store.search('category:*material*', limit=100)['hits']])
{'BIOMATERIALS PROGRAM',
'ELECTRONIC/PHOTONIC MATERIALS',
'Mechanics of Materials and Str',
'SOLID STATE & MATERIALS CHEMIS'}
Let’s see how many of the material-related awards mention AI.
One of the advantages of sparse vector stores is the ability to easily use complex boolean queries to target specific documents.
'("machine learning" OR "artificial intelligence") AND category:*material*', limit=100)['total_hits'] store.search(
15
We wil now use an LLM to summarize how AI is utilized in this research.
Since NSF awards data are publicly-available, we will use OpenAI’s GPT-4o-mini, a cloud LLM.
from onprem import LLM
= LLM('openai/gpt-4o-mini')
llm =store)
llm.load_vectorstore(custom_vectorstore= llm.ask('How is articial intelligene and machine learning used in these research projects?',
result =15,
limit='("machine learning" OR "artificial intelligence") AND category:*material*') where_document
Artificial intelligence (AI) and machine learning (ML) are utilized in various research projects described in the provided context in several ways:
1. **Data-Driven Approaches**: Many projects leverage AI techniques to analyze complex datasets and identify patterns that are not easily discernible through traditional methods. For example, in the project on engineered photonic materials, AI is used to develop new materials with tailored properties by consolidating information on material compositions and geometries.
2. **Model Development and Prediction**: AI and ML are employed to create predictive models that can simulate the behavior of materials under different conditions. The project on recycled polymers utilizes AI to predict deformation and failure mechanisms in recyclates, enhancing their mechanical performance.
3. **Optimization**: Machine learning algorithms are used for optimizing the design and synthesis processes of materials. In the project focused on luminescent biomaterials, iterative Bayesian optimization is applied to screen proteins for desired luminescent properties, facilitating the discovery of new materials.
4. **Enhanced Characterization**: In the study of nanofibers, machine learning aids in understanding and mitigating defects that compromise strength by predicting stress fields around nanovoids and assessing the impact of atomic crosslinks on material properties.
5. **Multiscale Simulations**: AI enhances multiscale modeling approaches, where simulations at different scales (from atomic to macro) are integrated to provide insights into material performance. For instance, the project on 3D printed polymer composites combines computational modeling with experimental data to understand fracture mechanisms.
6. **Education and Workforce Development**: Several projects include educational components that involve teaching AI and ML techniques to students from diverse backgrounds, thereby preparing the next generation of engineers and scientists with skills relevant to emerging technologies.
Overall, AI and ML are essential tools in these projects, facilitating advancements in material science, improving predictive capabilities, and enhancing the understanding of complex physical phenomena.
Awards used to answer the question are shown below.
for d in result['source_documents']:
print(d.metadata['title'])
Conference: Uncertainty Quantification for Machine Learning Integrated Physics Modeling (UQ-MLIP 2024); Arlington, Virginia; 12-14 August 2024
Collaborative Research: DMREF: Accelerating the Design and Development of Engineered Photonic Materials based on a Data-Driven Deep Learning Approach
Collaborative Research: DMREF: Accelerating the Design and Development of Engineered Photonic Materials based on a Data-Driven Deep Learning Approach
EAGER: Generative AI for Learning Emergent Complexity in Mechanics-driven Coupled Physics Problems
CAREER: Investigating the Role of Microstructure in the High Strain Rate Behavior of Stable Nanocrystalline Alloys
Conference: 10th International Conference on Spectroscopic Ellipsometry
CAREER: Informed Testing — From Full-Field Characterization of Mechanically Graded Soft Materials to Student Equity in the Classroom
2024 Solid State Chemistry Gordon Research Conference and Gordon Research Seminar
Designing Luminescent Biomaterials from First Principles
CAREER: Recycled Polymers of Enhanced Strength and Toughness: Predicting Failure and Unraveling Deformation to Enable Circular Transitions
Collaborative Research: DMREF: Organic Materials Architectured for Researching Vibronic Excitations with Light in the Infrared (MARVEL-IR)
Designing Pyrolyzed Nanofibers at the Atomic Level: Toward Synthesis of Ultra-high-strength Nano-carbon
CAREER: Design and synthesis of functional van der Waals magnets
Integrated Multiscale Computational and Experimental Investigations on Fracture of Additively Manufactured Polymer Composites