from onprem import LLM
= LLM() llm
OnPrem.LLM
A toolkit for running large language models on-premises using non-public data
OnPrem.LLM is a simple Python package that makes it easier to apply large language models (LLMs) to non-public data on your own machines (possibly behind corporate firewalls). Inspired largely by the privateGPT GitHub repo, OnPrem.LLM is intended to help integrate local LLMs into practical applications.
The full documentation is here.
A Google Colab demo of installing and using OnPrem.LLM is here.
Latest News π₯
[2024/12] v0.7.0 released and now includes support for structured outputs.
[2024/12] v0.6.0 released and now includes support for PDF to Markdown conversion (which includes Markdown representations of tables), as shown here.
[2024/11] v0.5.0 released and now includes support for running LLMs with Hugging Face transformers as the backend instead of llama.cpp. See this example.
[2024/11] v0.4.0 released and now includes a
default_model
parameter to more easily use models like Llama-3.1 and Zephyr-7B-beta.[2024/10] v0.3.0 released and now includes support for concept-focused summarization
[2024/09] v0.2.0 released and now includes PDF OCR support and better PDF table handling.
[2024/06] v0.1.0 of OnPrem.LLM has been released. Lots of new updates!
- Ability to use with any OpenAI-compatible API (e.g., vLLM, Ollama, OpenLLM, etc.).
- Pipeline for information extraction from raw documents.
- Pipeline for few-shot text classification (i.e., training a classifier on a tiny number of labeled examples) along with the ability to explain few-shot predictions.
- Default model changed to Mistral-7B-Instruct-v0.2
- API augmentations and bug fixes
Install
Once you have installed PyTorch, you can install OnPrem.LLM with the following steps:
- Install llama-cpp-python:
- CPU:
pip install llama-cpp-python
(extra steps required for Microsoft Windows) - GPU: Follow instructions below.
- CPU:
- Install OnPrem.LLM:
pip install onprem
On GPU-Accelerated Inference
When installing llama-cpp-python with pip install llama-cpp-python
, the LLM will run on your CPU. To generate answers much faster, you can run the LLM on your GPU by building llama-cpp-python based on your operating system.
- Linux:
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
- Mac:
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
- Windows 11: Follow the instructions here.
- Windows Subsystem for Linux (WSL2): Follow the instructions here.
For Linux and Windows, you will need an up-to-date NVIDIA driver along with the CUDA toolkit installed before running the installation commands above.
After following the instructions above, supply the n_gpu_layers=-1
parameter when instantiating an LLM to use your GPU for fast inference:
= LLM(n_gpu_layers=-1, ...) llm
Quantized models with 8B parameters and below can typically run on GPUs with as little as 6GB of VRAM. If a model does not fit on your GPU (e.g., you get a βCUDA Error: Out-of-Memoryβ error), you can offload a subset of layers to the GPU by experimenting with different values for the n_gpu_layers
parameter (e.g., n_gpu_layers=20
). Setting n_gpu_layers=-1
, as shown above, offloads all layers to the GPU.
See the FAQ for extra tips, if you experience issues with llama-cpp-python installation.
Note: Installing llama-cpp-python is optional if either the following is true:
- You use Hugging Face Transformers (instead of llama-cpp-python) as the LLM backend by supplying the
model_id
parameter when instantiating an LLM, as shown here. - You are using OnPrem.LLM with an LLM being served through an external REST API (e.g., vLLM, OpenLLM, Ollama).
How to Use
Setup
By default, a 7B-parameter model (Mistral-7B-Instruct-v0.2) is downloaded and used. If default_model='llama'
is supplied, then a Llama-3.1-8B-Instsruct model is automatically downloaded and used (which is useful if the default Mistral model struggles with a particular task):
# Llama 3.1 is downloaded here and the correct prompt template for Llama-3.1 is automatically configured and used
= LLM(default_model='llama') llm
Similarly, suppyling default_model='zephyr
, will use Zephyr-7B-beta. Of course, you can also easily supply the URL to an LLM of your choosing to LLM
(see the the code generation example or the FAQ for examples). Any extra parameters supplied to LLM
are forwarded directly to llama-cpp-python
.
Note: The default context window size (n_ctx
) is set to 3900 and the default output size (max_tokens
) is set 512. Both are configurable parameters to LLM
. Increase if you have larger prompts or need longer outputs.
Send Prompts to the LLM to Solve Problems
This is an example of few-shot prompting, where we provide an example of what we want the LLM to do.
= """Extract the names of people in the supplied sentences. Here is an example:
prompt Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""
= llm.prompt(prompt) saved_output
Cillian Murphy, Florence Pugh.
Additional prompt examples are shown here.
Talk to Your Documents
Answers are generated from the content of your documents (i.e., retrieval augmented generation or RAG). Here, we will use GPU offloading to speed up answer generation using the default model. However, the Zephyr-7B model may perform even better, responds faster, and is used in our example notebook.
from onprem import LLM
= LLM(n_gpu_layers=-1) llm
Step 1: Ingest the Documents into a Vector Database
"./tests/sample_data") llm.ingest(
Creating new vectorstore at /home/amaiya/onprem_data/vectordb
Loading documents from ./sample_data
Loading new documents: 100%|ββββββββββββββββββββββ| 3/3 [00:00<00:00, 13.71it/s]
Loaded 12 new documents from ./sample_data
Split into 153 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.49s/it]
Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods
Step 2: Answer Questions About the Documents
= """What is ktrain?"""
question = llm.ask(question) result
Ktrain is a low-code machine learning library designed to facilitate the full machine learning workflow from curating and preprocessing inputs to training, tuning, troubleshooting, and applying models. Ktrain is well-suited for domain experts who may have less experience with machine learning and software coding.
The sources used by the model to generate the answer are stored in result['source_documents']
:
print("\nSources:\n")
for i, document in enumerate(result["source_documents"]):
print(f"\n{i+1}.> " + document.metadata["source"] + ":")
print(document.page_content)
Sources:
1.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workο¬ow. For these reasons, ktrain is less of a traditional Au-
2
2.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best ο¬t their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely replace them. In doing so, the strengths of both are
better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai
3.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
with custom models and data formats, as well.
Inspired by other low-code (and no-
code) open-source ML libraries such as fastai (Howard and Gugger, 2020) and ludwig
(Molino et al., 2019), ktrain is intended to help further democratize machine learning by
enabling beginners and domain experts with minimal programming or data science experi-
4. http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
6
4.> /home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf:
ktrain: A Low-Code Library for Augmented Machine Learning
toML platform and more of what might be called a βlow-codeβ ML platform. Through
automation or semi-automation, ktrain facilitates the full machine learning workο¬ow from
curating and preprocessing inputs (i.e., ground-truth-labeled training data) to training,
tuning, troubleshooting, and applying models. In this way, ktrain is well-suited for domain
experts who may have less experience with machine learning and software coding. Where
Extract Text from Documents
The load_single_document
function can extract text from a range of different document formats (e.g., PDFs, Microsoft PowerPoint, Microsoft Word, etc.). It is automatically invoked when calling LLM.ingest
. Extracted text is represented as LangChain Document
objects, where Document.page_content
stores the extracted text and Document.metadata
stores any extracted document metadata.
For PDFs, in particular, a number of different options are available depending on your use case.
Fast PDF Extraction (default)
- Pro: Fast
- Con: Does not infer/retain structure of tables in PDF documents
from onprem.ingest import load_single_document
= load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')
docs 0].metadata docs[
{'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',
'file_path': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/1/ktrain_paper.pdf',
'page': 0,
'total_pages': 9,
'format': 'PDF 1.4',
'title': '',
'author': '',
'subject': '',
'keywords': '',
'creator': 'LaTeX with hyperref',
'producer': 'dvips + GPL Ghostscript GIT PRERELEASE 9.22',
'creationDate': "D:20220406214054-04'00'",
'modDate': "D:20220406214054-04'00'",
'trapped': ''}
Automatic OCR of PDFs
- Pro: Automatically extracts text from scanned PDFs
- Con: Slow
The load_single_document
function will automatically OCR PDFs that require it (i.e., PDFs that are scanned hard-copies of documents). If a document is OCRβed during extraction, the metadata['ocr']
field will be populated with True
.
= load_single_document('tests/sample_data/ocr_document/lynn1975.pdf')
docs 0].metadata docs[
{'source': '/home/amaiya/projects/ghub/onprem/nbs/sample_data/4/lynn1975.pdf',
'ocr': True}
Markdown Conversion in PDFs
- Pro: Better chunking for QA
- Con: Slower than default PDF extraction
The load_single_document
function can convert PDFs to Markdown instead of plain text by supplying the pdf_markdown=True
as an argument:
= load_single_document('your_pdf_document.pdf',
docs =True) pdf_markdown
Converting to Markdown can facilitate downstream tasks like question-answering. For instance, when supplying pdf_markdown=True
to LLM.ingest
, documents are chunked in a Markdown-aware fashion (e.g., the abstract of a research paper tends to be kept together into a single chunk instead of being split up). Note that Markdown will not be extracted if the document requires OCR.
Inferring Table Structure in PDFs
- Pro: Makes it easier for LLMs to analyze information in tables
- Con: Slower than default PDF extraction
When supplying infer_table_structure=True
to either load_single_document
or LLM.ingest
, tables are inferred and extracted from PDFs using a TableTransformer model. Tables are represented as Markdown (or HTML if Markdown conversion is not possible).
= load_single_document('your_pdf_document.pdf',
docs =True) infer_table_structure
Parsing Extracted Text Into Sentences or Paragraphs
For some analyses (e.g., using prompts for information extraction), it may be useful to parse the text extracted from documents into individual sentences or paragraphs. This can be accomplished using the segment
function:
from onprem.ingest import load_single_document
from onprem.utils import segment
= load_single_document('tests/sample_data/sotu/state_of_the_union.txt')[0].page_content text
='paragraph')[0] segment(text, unit
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'
='sentence')[0] segment(text, unit
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.'
Summarization Pipeline
Summarize your raw documents (e.g., PDFs, MS Word) with an LLM.
Map-Reduce Summarization
Summarize each chunk in a document and then generate a single summary from the individual summaries.
from onprem import LLM
= LLM(n_gpu_layers=-1, verbose=False, mute_stream=True) # disabling viewing of intermediate summarization prompts/inferences llm
from onprem.pipelines import Summarizer
= Summarizer(llm)
summ
= summ.summarize('tests/sample_data/ktrain_paper/ktrain_paper.pdf', max_chunks_to_use=5) # omit max_chunks_to_use parameter to consider entire document
resp print(resp['output_text'])
Ktrain is an open-source machine learning library that offers a unified interface for various machine learning tasks. The library supports both supervised and non-supervised machine learning, and includes methods for training models, evaluating models, making predictions on new data, and providing explanations for model decisions. Additionally, the library integrates with various explainable AI libraries such as shap, eli5 with lime, and others to provide more interpretable models.
Concept-Focused Summarization
Summarize a large document with respect to a particular concept of interest.
from onprem import LLM
from onprem.pipelines import Summarizer
= LLM(default_model='zephyr', n_gpu_layers=-1, verbose=False, temperature=0)
llm = Summarizer(llm)
summ = summ.summarize_by_concept('tests/sample_data/ktrain_paper/ktrain_paper.pdf', concept_description="question answering") summary, sources
The context provided describes the implementation of an open-domain question-answering system using ktrain, a low-code library for augmented machine learning. The system follows three main steps: indexing documents into a search engine, locating documents containing words in the question, and extracting candidate answers from those documents using a BERT model pretrained on the SQuAD dataset. Confidence scores are used to sort and prune candidate answers before returning results. The entire workflow can be implemented with only three lines of code using ktrain's SimpleQA module. This system allows for the submission of natural language questions and receives exact answers, as demonstrated in the provided example. Overall, the context highlights the ease and accessibility of building sophisticated machine learning models, including open-domain question-answering systems, through ktrain's low-code interface.
Information Extraction Pipeline
Extract information from raw documents (e.g., PDFs, MS Word documents) with an LLM.
from onprem import LLM
from onprem.pipelines import Extractor
# Notice that we're using a cloud-based, off-premises model here! See "OpenAI" section below.
= LLM(model_url='openai://gpt-3.5-turbo', verbose=False, mute_stream=True, temperature=0)
llm = Extractor(llm)
extractor = """Extract the names of research institutions (e.g., universities, research labs, corporations, etc.)
prompt from the following sentence delimited by three backticks. If there are no organizations, return NA.
If there are multiple organizations, separate them with commas.
```{text}```
"""
= extractor.apply(prompt, fpath='tests/sample_data/ktrain_paper/ktrain_paper.pdf', pdf_pages=[1], stop=['\n'])
df 'Extractions'] != 'NA'].Extractions[0] df.loc[df[
/home/amaiya/projects/ghub/onprem/onprem/core.py:159: UserWarning: The model you supplied is gpt-3.5-turbo, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.
warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\
'Institute for Defense Analyses'
Few-Shot Classification
Make accurate text classification predictions using only a tiny number of labeled examples.
# create classifier
from onprem.pipelines import FewShotClassifier
= FewShotClassifier(use_smaller=True)
clf
# Fetching data
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
= ["soc.religion.christian", "sci.space"]
classes = fetch_20newsgroups(subset="all", categories=classes)
newsgroups = np.array(newsgroups.data), np.array(newsgroups.target_names)[newsgroups.target]
corpus, group_labels
# Wrangling data into a dataframe and selecting training examples
= pd.DataFrame({"text": corpus, "label": group_labels})
data = data.groupby("label").sample(5)
train_df = data.drop(index=train_df.index)
test_df
# X_sample only contains 5 examples of each class!
= train_df['text'].values, train_df['label'].values
X_sample, y_sample
# test set
= test_df['text'].values, test_df['label'].values
X_test, y_test
# train
=20)
clf.train(X_sample, y_sample, max_steps
# evaluate
print(clf.evaluate(X_test, y_test)['accuracy'])
#output: 0.98
# make predictions
'Elon Musk likes launching satellites.']).tolist()[0]
clf.predict([#output: sci.space
Using Hugging Face Transformers Instead of Llama.cpp
By default, the LLM backend employed by OnPrem.LLM is llama-cpp-python, which requires models in GGUF format. As of v0.5.0, it is now possible to use Hugging Face transformers as the LLM backend instead. This is accomplished by using the model_id
parameter (instead of supplying a model_url
argument). In the example below, we run the Llama-3.1-8B model.
# llama-cpp-python does NOT need to be installed when using model_id parameter
= LLM(model_id="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", device_map='cuda') llm
This allows you to more easily use any model on the Hugging Face hub in SafeTensors format provided it can be loaded with the Hugging Face transformers.pipeline
. Note that, when using the model_id
parameter, the prompt_template
is set automatically by transformers
.
The Llama-3.1 model loaded above was quantized using AWQ, which allows the model to fit onto smaller GPUs (e.g., laptop GPUs with 6GB of VRAM) similar to the default GGUF format. AWQ models will require the autoawq package to be installed: pip install autoawq
(AWQ only supports Linux system, including Windows Subsystem for Linux). If you do need to load a model that is not quantized, you can supply a quantization configuration at load time (known as βinflight quantizationβ). In the following example, we load an unquantized Zephyr-7B-beta model that will be quantized during loading to fit on GPUs with as little as 6GB of VRAM:
from transformers import BitsAndBytesConfig
= BitsAndBytesConfig(
quantization_config =True,
load_in_4bit="nf4",
bnb_4bit_quant_type="float16",
bnb_4bit_compute_dtype=True,
bnb_4bit_use_double_quant
)= LLM(model_id="HuggingFaceH4/zephyr-7b-beta", device_map='cuda',
llm ={"quantization_config":quantization_config}) model_kwargs
When supplying a quantization_config
, the bitsandbytes library, a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions, is used. There are ongoing efforts by the bitsandbytes team to support multiple backends in addition to CUDA. If you receive errors related to bitsandbytes, please refer to the bitsandbytes documentation.
Connecting to LLMs Served Through REST APIs
OnPrem.LLM can be used with LLMs being served through any OpenAI-compatible REST API. This means you can easily use OnPrem.LLM with tools like vLLM, OpenLLM, Ollama, and the llama.cpp server.
For instance, using vLLM, you can serve a LLaMA 3 model as follows:
python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
You can then connect OnPrem.LLM to the LLM by supplying the URL of the server you just started:
from onprem import LLM
= LLM(model_url='http://localhost:8000/v1', api_key='token-abc123')
llm # Note: The API key can either be supplied directly or stored in the OPENAI_API_KEY environment variable.
# If the server does not require an API key, `api_key` should still be supplied with a dummy value like 'na'.
Thatβs it! Solve problems with OnPrem.LLM as you normally would (e.g., RAG question-answering, summarization, few-shot prompting, code generation, etc.).
Using OpenAI Models with OnPrem.LLM
Even when using on-premises language models, it can sometimes be useful to have easy access to non-local, cloud-based models (e.g., OpenAI) for testing, producing baselines for comparison, and generating synthetic examples for fine-tuning. For these reasons, in spite of the name, OnPrem.LLM now includes support for OpenAI chat models:
from onprem import LLM
= LLM(model_url='openai://gpt-4o', temperature=0) llm
/home/amaiya/projects/ghub/onprem/onprem/core.py:196: UserWarning: The model you supplied is gpt-4o, an external service (i.e., not on-premises). Use with caution, as your data and prompts will be sent externally.
warnings.warn(f'The model you supplied is {self.model_name}, an external service (i.e., not on-premises). '+\
This OpenAI LLM
instance can now be used with as the engine for most features in OnPrem.LLM (e.g., RAG, information extraction, summarization, etc.). Here we simply use it for general prompting:
= llm.prompt('List three cute names for a cat and explain why each is cute.') saved_result
Certainly! Here are three cute names for a cat, along with explanations for why each is adorable:
1. **Whiskers**: This name is cute because it highlights one of the most distinctive and charming features of a catβtheir whiskers. It's playful and endearing, evoking the image of a curious cat twitching its whiskers as it explores its surroundings.
2. **Mittens**: This name is cute because it conjures up the image of a cat with little white paws that look like they are wearing mittens. It's a cozy and affectionate name that suggests warmth and cuddliness, much like a pair of soft mittens.
3. **Pumpkin**: This name is cute because it brings to mind the warm, orange hues of a pumpkin, which can be reminiscent of certain cat fur colors. It's also associated with the fall season, which is often linked to comfort and coziness. Plus, the name "Pumpkin" has a sweet and affectionate ring to it, making it perfect for a beloved pet.
Using Vision Capabilities in GPT-4o
= "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_url = llm.prompt('Describe the weather in this image.', image_path_or_url=image_url) saved_result
The weather in the image appears to be clear and sunny. The sky is mostly blue with some scattered clouds, suggesting a pleasant day with good visibility. The sunlight is bright, illuminating the green grass and landscape.
Using OpenAI-Style Message Dictionaries
= [
messages 'content': [{'text': 'describe the weather in this image',
{'type': 'text'},
'image_url': {'url': image_url},
{'type': 'image_url'}],
'role': 'user'}]
= llm.prompt(messages) saved_result
The weather in the image appears to be clear and sunny. The sky is mostly blue with some scattered clouds, suggesting a pleasant day with good visibility. The sunlight is bright, casting clear shadows and illuminating the green landscape.
Azure OpenAI
For Azure OpenAI models, use the following URL format:
= LLM(model_url='azure://<deployment_name>', ...)
llm # <deployment_name> is the Azure deployment name and additional Azure-specific parameters
# can be supplied as extra arguments to LLM (or set as environment variables)
Structured and Guided Outputs
The LLM.pydantic_prompt
method allows you to specify the desired structure of the LLMβs output as a Pydantic model.
from pydantic import BaseModel, Field
class Joke(BaseModel):
str = Field(description="question to set up a joke")
setup: str = Field(description="answer to resolve the joke")
punchline:
from onprem import LLM
= LLM(default_model='llama', verbose=False)
llm = llm.pydantic_prompt('Tell me a joke.', pydantic_model=Joke) structured_output
llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
{
"setup": "Why couldn't the bicycle stand alone?",
"punchline": "Because it was two-tired!"
}
The output is a Pydantic object instead of a string:
structured_output
Joke(setup="Why couldn't the bicycle stand alone?", punchline='Because it was two-tired!')
print(structured_output.setup)
print()
print(structured_output.punchline)
Why couldn't the bicycle stand alone?
Because it was two-tired!
You can also use OnPrem.LLM with the Guidance package to guide the LLM to generate outputs based on your conditions and constraints. Weβll show a couple of examples here, but see our documentation on guided prompts for more information.
from onprem import LLM
= LLM(n_gpu_layers=-1, verbose=False)
llm from onprem.pipelines.guider import Guider
= Guider(llm) guider
With the Guider, you can use use Regular Expressions to control LLM generation:
= f"""Question: Luke has ten balls. He gives three to his brother. How many balls does he have left?
prompt Answer: """ + gen(name='answer', regex='\d+')
=False) guider.prompt(prompt, echo
{'answer': '7'}
= '19, 18,' + gen(name='output', max_tokens=50, stop_regex='[^\d]7[^\d]')
prompt guider.prompt(prompt)
19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8,
{'output': ' 17, 16, 15, 14, 13, 12, 11, 10, 9, 8,'}
See the documentation for more examples of how to use Guidance with OnPrem.LLM.
Built-In Web App
OnPrem.LLM includes a built-in Web app to access the LLM. To start it, run the following command after installation:
onprem --port 8000
Then, enter localhost:8000
(or <domain_name>:8000
if running on remote server) in a Web browser to access the application:
For more information, see the corresponding documentation.
FAQ
How do I use other models with OnPrem.LLM?
You can supply the URL to other models to the
LLM
constructor, as we did above in the code generation example.As of v0.0.20, we support models in GGUF format, which supersedes the older GGML format. You can find llama.cpp-supported models with
GGUF
in the file name on huggingface.co.Make sure you are pointing to the URL of the actual GGUF model file, which is the βdownloadβ link on the modelβs page. An example for Mistral-7B is shown below:
Note that some models have specific prompt formats. For instance, the prompt template required for Zephyr-7B, as described on the modelβs page, is:
<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>
So, to use the Zephyr-7B model, you must supply the
prompt_template
argument to theLLM
constructor (or specify it in thewebapp.yml
configuration for the Web app).# how to use Zephyr-7B with OnPrem.LLM = LLM(model_url='https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf', llm = "<|system|>\n</s>\n<|user|>\n{prompt}</s>\n<|assistant|>", prompt_template =33) n_gpu_layers"List three cute names for a cat.") llm.prompt(
When installing
onprem
, Iβm getting βbuildβ errors related tollama-cpp-python
(orchroma-hnswlib
) on Windows/Mac/Linux?See this LangChain documentation on LLama.cpp for help on installing the
llama-cpp-python
package for your system. Additional tips for different operating systems are shown below:For Linux systems like Ubuntu, try this:
sudo apt-get install build-essential g++ clang
. Other tips are here.For Windows systems, please try following these instructions. We recommend you use Windows Subsystem for Linux (WSL) instead of using Microsoft Windows directly. If you do need to use Microsoft Window directly, be sure to install the Microsoft C++ Build Tools and make sure the Desktop development with C++ is selected.
For Macs, try following these tips.
There are also various other tips for each of the above OSes in this privateGPT repo thread. Of course, you can also easily use OnPrem.LLM on Google Colab.
Finally, if you still canβt overcome issues with building
llama-cpp-python
, you can try installing the pre-built wheel file for your system:Example:
pip install llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
Tip: There are pre-built wheel files for
chroma-hnswlib
, as well. If runningpip install onprem
fails on buildingchroma-hnswlib
, it may be because a pre-built wheel doesnβt yet exist for the version of Python youβre using (in which case you can try downgrading Python).Iβm behind a corporate firewall and am receiving an SSL error when trying to download the model?
Try this:
from onprem import LLM =False) LLM.download_model(url, ssl_verify
You can download the embedding model (used by
LLM.ingest
andLLM.ask
) as follows:wget --no-check-certificate https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/all-MiniLM-L6-v2.zip
Supply the unzipped folder name as the
embedding_model_name
argument toLLM
.If youβre getting SSL errors when even running
pip install
, try this:pip install β-trusted-host pypi.org β-trusted-host files.pythonhosted.org pip_system_certs
How do I use this on a machine with no internet access?
Use the
LLM.download_model
method to download the model files to<your_home_directory>/onprem_data
and transfer them to the same location on the air-gapped machine.For the
ingest
andask
methods, you will need to also download and transfer the embedding model files:from sentence_transformers import SentenceTransformer = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') model '/some/folder') model.save(
Copy the
some/folder
folder to the air-gapped machine and supply the path toLLM
via theembedding_model_name
parameter.My model is not loading when I call
llm = LLM(...)
?This can happen if the model file is corrupt (in which case you should delete from
<home directory>/onprem_data
and re-download). It can also happen if the version ofllama-cpp-python
needs to be upgraded to the latest.Iβm getting an
βIllegal instruction (core dumped)
error when instantiating alangchain.llms.Llamacpp
oronprem.LLM
object?Your CPU may not support instructions that
cmake
is using for one reason or another (e.g., due to Hyper-V in VirtualBox settings). You can try turning them off when building and installingllama-cpp-python
:# example CMAKE_ARGS="-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir
How can I speed up
LLM.ingest
using my GPU?Try using the
embedding_model_kwargs
argument:from onprem import LLM = LLM(embedding_model_kwargs={'device':'cuda'}) llm