Document Question-Answering

This example of OnPrem.LLM demonstrates retrieval augmented generation or RAG.

Basic RAG

STEP 1: Setup the LLM instance

In this notebook, we will use a model called Zephyr-7B-beta, which performs well on RAG tasks. When selecting a model, it is important to inspect the model’s home page and identify the correct prompt format. The prompt format for this model is located here, and we will supply it directly to the LLM constructor along with the URL to the specific model file we want (i.e., zephyr-7b-beta.Q4_K_M.gguf). We will offload layers to our GPU(s) to speed up inference using the n_gpu_layers parameter. (For more information on GPU acceleration, see here.) For the purposes of this notebook, we also supply temperature=0 so that there is no variability in outputs. You can increase this value for more creativity in the outputs. Finally, we will choose a non-default location for our vector database.

from onprem import LLM, utils as U
import tempfile
from textwrap import wrap
vectordb_path = tempfile.mkdtemp()

llm = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf', 
          prompt_template= "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>",
          n_gpu_layers=-1,
          temperature=0,
          store_type='dense',
          vectordb_path=vectordb_path,
         verbose=False)
llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

Since OnPrem.LLM includes built-in support for Zephyr, an easier way to instantiate the LLM with Zephyr is as follows:

llm = LLM(default_model='zephyr', 
          n_gpu_layers=-1,
          temperature=0,
          store_type='dense',
          vectordb_path=vectordb_path)

STEP 2: Ingest Documents

When ingesting documents, they can be stored in one of two ways: 1. a dense vector store: a conventional vector database like Chroma 2. a sparse vector store: a keyword-search engine

Sparse vector stores compute embeddings on-the-fly at inference time. As a result, sparse vector stores sacrifice a small amount of inference speed for significant speed ups in ingestion speed. This makes it better suited for larger document sets. Note that sparse vector stores include the contraint that any passages considered as sources for answers should have at least one word in common with the question being asked. You can specify the kind of vector store by supplying either store_type="dense" or store_type="sparse" when creating the LLM above. We use a dense vector store in this example, as shown above.

For this example, we will download the 2024 National Defense Autorization Act (NDAA) report and ingest it.

U.download('https://www.congress.gov/118/crpt/hrpt125/CRPT-118hrpt125.pdf', '/tmp/ndaa/ndaa.pdf', verify=True)
[██████████████████████████████████████████████████]
llm.ingest("/tmp/ndaa/")
Creating new vectorstore at /tmp/tmpmnt6g6l8/dense
Loading documents from /tmp/ndaa/
Loading new documents: 100%|██████████████████████| 1/1 [00:00<00:00,  1.62it/s]
Processing and chunking 672 new documents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.22it/s]
Split into 5202 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:17<00:00,  2.95s/it]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods

STEP 3: Asking Questions to Your Documents

result = llm.ask("What is said about artificial intelligence training and education?")

The context provided discusses the implementation of an AI education strategy required by Section 256 of the National Defense Authorization Act for Fiscal Year 2020. The strategy aims to educate servicemembers in relevant occupational fields, with a focus on data literacy across a broader population within the Department of Defense. The committee encourages the Air Force and Space Force to leverage government-owned training platforms informed by private sector expertise to accelerate learning and career path development. Additionally, the committee suggests expanding existing mobile enabled platforms to train and develop the cyber workforce of the Air Force and Space Force. Overall, there is a recognition that AI continues to be central to warfighting and that proper implementation of these new technologies requires a focus on education and training.

The answer is stored in results['answer']. The documents retrieved from the vector store used to generate the answer are stored in results['source_documents'] above.

print('ANSWER:')
print("\n".join(wrap(result['answer'])))
print()
print()
print('REFERENCES')
print()
for d in result['source_documents']:
    print(f"On Page {d.metadata['page']} in {d.metadata['source']}:")
    print(d.page_content)
    print('----------------------------------------')
    print()
ANSWER:
 The context provided discusses the implementation of an AI education
strategy required by Section 256 of the National Defense Authorization
Act for Fiscal Year 2020. The strategy aims to educate servicemembers
in relevant occupational fields, with a focus on data literacy across
a broader population within the Department of Defense. The committee
encourages the Air Force and Space Force to leverage government-owned
training platforms informed by private sector expertise to accelerate
learning and career path development. Additionally, the committee
suggests expanding existing mobile enabled platforms to train and
develop the cyber workforce of the Air Force and Space Force. Overall,
there is a recognition that AI continues to be central to warfighting
and that proper implementation of these new technologies requires a
focus on education and training.


REFERENCES

On Page 359 in /tmp/ndaa/ndaa.pdf:
‘‘servicemembers in relevant occupational fields on matters relating 
to artificial intelligence.’’ 
Given the continued centrality of AI to warfighting, the com-
mittee directs the Chief Digital and Artificial Intelligence Officer of 
the Department of Defense to provide a briefing to the House Com-
mittee on Armed Services not later than March 31, 2024, on the 
implementation status of the AI education strategy, with emphasis 
on current efforts underway, such as the AI Primer course within
----------------------------------------

On Page 359 in /tmp/ndaa/ndaa.pdf:
intelligence (AI) and machine learning capabilities available within 
the Department of Defense. To ensure the proper implementation 
of these new technologies, there must be a focus on data literacy 
across a broader population within the Department. Section 256 of 
the National Defense Authorization Act for Fiscal Year 2020 (Pub-
lic Law 116–92) required the Department of Defense to develop an 
AI education strategy, with the stated objective to educate
----------------------------------------

On Page 102 in /tmp/ndaa/ndaa.pdf:
tificial intelligence and machine learning (AI/ML), and cloud com-
puting. The committee encourages the Air Force and Space Force 
to leverage government owned training platforms with curricula in-
formed by private sector expertise to accelerate learning and career 
path development. 
To that end, the committee encourages the Secretary of the Air 
Force to expand existing mobile enabled platforms to train and de-
velop the cyber workforce of Air Force and Space Force. To better
----------------------------------------

On Page 109 in /tmp/ndaa/ndaa.pdf:
70 
role of senior official with principal responsibility for artificial intel-
ligence and machine learning. In February 2022, the Department 
stood up the Chief Digital and Artificial Intelligence Office to accel-
erate the Department’s adoption of AI. The committee encourages 
the Department to build upon this progress and sustain efforts to 
research, develop, test, and where appropriate, operationalize AI 
capabilities. 
Artificial intelligence capabilities of foreign adversaries
----------------------------------------

Advanced Example: NSF Awards

The example above employed the use of the default dense vector store, Chroma. By supplying store_type="sparse" to LLM, a sparse vector store (i.e., keyword search engine) is used instead. Sparse vector stores index documents faster but requires keyword matches between sources containing answers and the question or query. Semantic search is still supported through on-demand dense vectorization in OnPrem.LLM.

In this example, we will instantiate a spare store directly and customize the ingestion process to include custom fields using a dataset of 2024 NSF Awards.

STEP 1: Download the Pre-Process the NSF Data

NSF awards data stores as thousands of JSON files. The code below downloads and parses each JSON file.

import os
import zipfile
import requests
import json
from pathlib import Path
from tqdm.notebook import tqdm

# Step 1: Download the ZIP file
url = "https://www.nsf.gov/awardsearch/download?DownloadFileName=2024&All=true&isJson=true"
zip_path = "/tmp/nsf_awards_2024.zip"

if not os.path.exists(zip_path):
    print("Downloading NSF data...")
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(zip_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    print("Download complete.")
else:
    print("ZIP file already exists.")

# Step 2: Unzip the file
extract_dir = "nsf_awards_2024"

if not os.path.exists(extract_dir):
    print("Extracting ZIP file...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)
    print("Extraction complete.")
else:
    print("Already extracted.")

# Step 3: Function to extract fields from JSON
def extract_fields(data):
    title = data.get("awd_titl_txt", "N/A")
    abstract = data.get("awd_abstract_narration", "N/A")
    
    pgm_ele = data.get("pgm_ele")
    if isinstance(pgm_ele, list) and pgm_ele:
        category = pgm_ele[0].get("pgm_ele_name", "N/A")
    else:
        category = "N/A"

    # Authors
    authors = []
    for pi in data.get("pi", []):
        full_name = pi.get("pi_full_name", "")
        if full_name:
            authors.append(full_name)
    authors_str = ", ".join(authors) if authors else "N/A"

    # Affiliation
    affiliation = data.get("inst", {}).get("inst_name", "N/A")

    # Amount
    raw_amount = data.get("awd_amount", data.get("tot_intn_awd_amt", None))
    try:
        amount = float(raw_amount)
    except (TypeError, ValueError):
        amount = None

    return {
        "title": title or '',
        "abstract": f'{title or ""}' + '\n\n' + f'{abstract or ""}',
        "category": category,
        "authors": authors_str,
        "affiliation": affiliation,
        "amount": amount
    }

# Step 4: Process all JSON files and write results to .txt
output_dir = "/tmp/nsf_text_output"
os.makedirs(output_dir, exist_ok=True)

json_files = list(Path(extract_dir).glob("*.json"))

print(f"Processing {len(json_files)} JSON files...")

nsf_data = []
for json_file in tqdm(json_files):
    with open(json_file, 'r', encoding='utf-8') as f:
        try:
            data = json.load(f)
        except json.JSONDecodeError:
            continue  # skip bad files

    fields = extract_fields(data)
    fields['source'] = str(json_file)
    nsf_data.append(fields)

print("All JSON files processed and saved to list of dictionaries.")
ZIP file already exists.
Already extracted.
Processing 11687 JSON files...
All JSON files processed and saved to list of dictionaries.

STEP 2: Ingest Documents

Let’s now store these NSF awards data in a Whoosh-backed sparse vector store. This is equivalent to supplying store_type="sparse" to LLM. However, we will explicitly create the SparseStore instance to customize the ingestion process for NSF data.

Since award abstracts are not lengthy, we will forgo chunking the document (e.g., using either onprem.ingest.chunk_documents or another chunking tool like chonkie) and instead store each award as a single record in the index.

from onprem.ingest import VectorStoreFactory, helpers, chunk_documents
store = VectorStoreFactory.create(
    kind='whoosh',
    persist_location='/tmp/nsf_store'
)
docs = []
for d in nsf_data:
    doc = helpers.doc_from_dict(d, content_field='abstract')
    docs.append(doc)
store.add_documents(docs)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 11687/11687 [00:10<00:00, 1119.63it/s]

Let’s examine the total number of awards stored.

store.get_size()
11687

STEP 3: Explore NSF Awards

We can explore NSF awards by either using an LLM or querying the vector store directly.

The NSF buckets awards into different catgories. Let’s examine all the material-related categories.

set([d['category'] for d in store.search('category:*material*', limit=100)['hits']])
{'BIOMATERIALS PROGRAM',
 'ELECTRONIC/PHOTONIC MATERIALS',
 'Mechanics of Materials and Str',
 'SOLID STATE & MATERIALS CHEMIS'}

Let’s see how many of the material-related awards mention AI.

One of the advantages of sparse vector stores is the ability to easily use complex boolean queries to target specific documents.

store.search('("machine learning" OR "artificial intelligence") AND category:*material*', limit=100)['total_hits']
15

We wil now use an LLM to summarize how AI is utilized in this research.

Since NSF awards data are publicly-available, we will use OpenAI’s GPT-4o-mini, a cloud LLM.

from onprem import LLM
llm = LLM('openai/gpt-4o-mini')
llm.load_vectorstore(custom_vectorstore=store)
result = llm.ask('How is articial intelligene and machine learning used in these research projects?', 
                 limit=15,
                  where_document='("machine learning" OR "artificial intelligence") AND category:*material*')
Artificial intelligence (AI) and machine learning (ML) are utilized in various research projects described in the provided context in several ways:

1. **Data-Driven Approaches**: Many projects leverage AI techniques to analyze complex datasets and identify patterns that are not easily discernible through traditional methods. For example, in the project on engineered photonic materials, AI is used to develop new materials with tailored properties by consolidating information on material compositions and geometries.

2. **Model Development and Prediction**: AI and ML are employed to create predictive models that can simulate the behavior of materials under different conditions. The project on recycled polymers utilizes AI to predict deformation and failure mechanisms in recyclates, enhancing their mechanical performance.

3. **Optimization**: Machine learning algorithms are used for optimizing the design and synthesis processes of materials. In the project focused on luminescent biomaterials, iterative Bayesian optimization is applied to screen proteins for desired luminescent properties, facilitating the discovery of new materials.

4. **Enhanced Characterization**: In the study of nanofibers, machine learning aids in understanding and mitigating defects that compromise strength by predicting stress fields around nanovoids and assessing the impact of atomic crosslinks on material properties.

5. **Multiscale Simulations**: AI enhances multiscale modeling approaches, where simulations at different scales (from atomic to macro) are integrated to provide insights into material performance. For instance, the project on 3D printed polymer composites combines computational modeling with experimental data to understand fracture mechanisms.

6. **Education and Workforce Development**: Several projects include educational components that involve teaching AI and ML techniques to students from diverse backgrounds, thereby preparing the next generation of engineers and scientists with skills relevant to emerging technologies.

Overall, AI and ML are essential tools in these projects, facilitating advancements in material science, improving predictive capabilities, and enhancing the understanding of complex physical phenomena.

Awards used to answer the question are shown below.

for d in result['source_documents']:
    print(d.metadata['title'])
Conference: Uncertainty Quantification for Machine Learning Integrated Physics Modeling (UQ-MLIP 2024); Arlington, Virginia; 12-14 August 2024
Collaborative Research: DMREF: Accelerating the Design and Development of Engineered Photonic Materials based on a Data-Driven Deep Learning Approach
Collaborative Research: DMREF: Accelerating the Design and Development of Engineered Photonic Materials based on a Data-Driven Deep Learning Approach
EAGER: Generative AI for Learning Emergent Complexity in  Mechanics-driven Coupled Physics Problems
CAREER: Investigating the Role of Microstructure in the High Strain Rate Behavior of Stable Nanocrystalline Alloys
Conference: 10th International Conference on Spectroscopic Ellipsometry
CAREER: Informed Testing — From Full-Field Characterization of Mechanically Graded Soft Materials to Student Equity in the Classroom
2024 Solid State Chemistry Gordon Research Conference and Gordon Research Seminar
Designing Luminescent Biomaterials from First Principles
CAREER: Recycled Polymers of Enhanced Strength and Toughness: Predicting Failure and Unraveling Deformation to Enable Circular Transitions
Collaborative Research: DMREF: Organic Materials Architectured for Researching Vibronic Excitations with Light in the Infrared (MARVEL-IR)
Designing Pyrolyzed Nanofibers at the Atomic Level: Toward Synthesis of Ultra-high-strength Nano-carbon
CAREER: Design and synthesis of functional van der Waals magnets
Integrated Multiscale Computational and Experimental Investigations on Fracture of Additively Manufactured Polymer Composites