core

Core functionality for onprem

source

LLM

 LLM (model_url:Optional[str]=None, model_id:Optional[str]=None,
      default_model:str='mistral', default_engine:str='llama.cpp',
      n_gpu_layers:Optional[int]=None, prompt_template:Optional[str]=None,
      model_download_path:Optional[str]=None,
      vectordb_path:Optional[str]=None, max_tokens:int=512,
      n_ctx:int=3900, n_batch:int=1024, stop:list=[],
      mute_stream:bool=False, callbacks=[],
      embedding_model_name:str='sentence-transformers/all-MiniLM-L6-v2',
      embedding_model_kwargs:dict={'device': 'cpu'},
      embedding_encode_kwargs:dict={'normalize_embeddings': False},
      rag_num_source_docs:int=4, rag_score_threshold:float=0.0,
      check_model_download:bool=True, confirm:bool=True,
      verbose:bool=True, **kwargs)

*LLM Constructor. Extra kwargs (e.g., temperature) are fed directly to langchain.llms.LlamaCpp or langchain.hugging_face.HuggingFacePipeline.

Args:

  • model_url: URL to .GGUF model (or the filename if already been downloaded to model_download_path). To use an OpenAI-compatible REST API (e.g., vLLM, OpenLLM, Ollama), supply the URL (e.g., http://localhost:8080/v1). To use a cloud-based OpenAI model, replace URL with: openai://<name_of_model> (e.g., openai://gpt-3.5-turbo). To use Azure OpenAI, replace URL with: with: azure://<deployment_name>. If None, use the model indicated by default_model.
  • model_id: Name of or path to Hugging Face model (e.g., in SafeTensor format). Hugging Face Transformers is used for LLM generation instead of llama-cpp-python. Mutually-exclusive with model_url and default_model. The n_gpu_layers and model_download_path parameters are ignored if model_id is supplied.
  • default_model: One of {‘mistral’, ‘zephyr’, ‘llama’}, where mistral is Mistral-Instruct-7B-v0.2, zephyr is Zephyr-7B-beta, and llama is Llama-3.1-8B.
  • default_engine: The engine used to run the default_model. One of {‘llama.cpp’, ‘transformers’}.
  • n_gpu_layers: Number of layers to be loaded into gpu memory. Default is None.
  • prompt_template: Optional prompt template (must have a variable named “prompt”). Prompt templates are not typically needed when using the model_id parameter, as transformers sets it automatically.
  • model_download_path: Path to download model. Default is onprem_data in user’s home directory.
  • vectordb_path: Path to vector database (created if it doesn’t exist). Default is onprem_data/vectordb in user’s home directory.
  • max_tokens: The maximum number of tokens to generate.
  • n_ctx: Token context window. (Llama2 models have max of 4096.)
  • n_batch: Number of tokens to process in parallel.
  • stop: a list of strings to stop generation when encountered (applied to all calls to LLM.prompt)
  • mute_stream: Mute ChatGPT-like token stream output during generation
  • callbacks: Callbacks to supply model
  • embedding_model_name: name of sentence-transformers model. Used for LLM.ingest and LLM.ask.
  • embedding_model_kwargs: arguments to embedding model (e.g., {device':'cpu'}).
  • embedding_encode_kwargs: arguments to encode method of embedding model (e.g., {'normalize_embeddings': False}).
  • rag_num_source_docs: The maximum number of documents retrieved and fed to LLM.ask and LLM.chat to generate answers
  • rag_score_threshold: Minimum similarity score for source to be considered by LLM.ask and LLM.chat
  • confirm: whether or not to confirm with user before downloading a model
  • verbose: Verbosity*

source

AnswerConversationBufferMemory

 AnswerConversationBufferMemory (*args:Any,
                                 chat_memory:langchain_core.chat_history.B
                                 aseChatMessageHistory=None,
                                 output_key:Optional[str]=None,
                                 input_key:Optional[str]=None,
                                 return_messages:bool=False,
                                 human_prefix:str='Human',
                                 ai_prefix:str='AI',
                                 memory_key:str='history')

*.. deprecated:: 0.3.1 Please see the migration guide at: https://python.langchain.com/docs/versions/migrating_memory/

A basic memory implementation that simply stores the conversation history.

This stores the entire conversation history in memory without any additional processing.

Note that additional processing may be required in some situations when the conversation history is too large to fit in the context window of the model.*


source

LLM.download_model

 LLM.download_model (model_url:Optional[str]=None,
                     default_model:str='mistral',
                     model_download_path:Optional[str]=None,
                     confirm:bool=True, ssl_verify:bool=True)

*Download an LLM in GGML format supported by lLama.cpp.

Args:

  • model_url: URL of model. If None, then use default_model.
  • default_model: One of {‘mistral’, ‘zephyr’, ‘llama’}, where mistral is Mistral-Instruct-7B-v0.2, zephyr is Zephyr-7B-beta, and llama is Llama-3.1-8B.
  • model_download_path: Path to download model. Default is onprem_data in user’s home directory.
  • confirm: whether or not to confirm with user before downloading
  • ssl_verify: If True, SSL certificates are verified. You can set to False if corporate firewall gives you problems.*

source

LLM.load_llm

 LLM.load_llm ()

Loads the LLM from the model path.


source

LLM.load_ingester

 LLM.load_ingester ()

Get Ingester instance. You can access the langchain_chroma.Chroma instance with load_ingester().get_db().


source

LLM.load_qa

 LLM.load_qa (prompt_template:str='"Use the following pieces of context
              delimited by three backticks to answer the question at the
              end. If you don\'t know the answer, just say that you don\'t
              know, don\'t try to make up an
              answer.\n\n```{context}```\n\nQuestion: {question}\nHelpful
              Answer:')

*Prepares and loads the langchain.chains.RetrievalQA object

Args:

  • prompt_template: A string representing the prompt with variables “context” and “question”*

source

LLM.load_chatqa

 LLM.load_chatqa ()

Prepares and loads a langchain.chains.ConversationalRetrievalChain instance


source

LLM.prompt

 LLM.prompt (prompt:Union[str,List[Dict]],
             image_path_or_url:Optional[str]=None,
             prompt_template:Optional[str]=None, stop:list=[], **kwargs)

*Send prompt to LLM to generate a response. Extra keyword arguments are sent directly to the model invocation.

Args:

  • prompt: The prompt to supply to the model. Either a string or OpenAI-style list of dictionaries representing messages (e.g., “human”, “system”).
  • image_path_or_url: Path or URL to an image file
  • prompt_template: Optional prompt template (must have a variable named “prompt”). This value will override any prompt_template value supplied to LLM constructor.
  • stop: a list of strings to stop generation when encountered. This value will override the stop parameter supplied to LLM constructor.*

source

LLM.ingest

 LLM.ingest (source_directory:str, chunk_size:int=500,
             chunk_overlap:int=50, ignore_fn:Optional[Callable]=None,
             pdf_use_unstructured:bool=False, **kwargs)

Ingests all documents in source_folder into vector database. Previously-ingested documents are ignored. Extra kwargs fed directly to langchain_community.document_loaders.pdf.UnstructuredPDFLoader when pdf_use_unstructured is True.

Type Default Details
source_directory str path to folder containing documents
chunk_size int 500 text is split to this many characters by langchain.text_splitter.RecursiveCharacterTextSplitter
chunk_overlap int 50 character overlap between chunks in langchain.text_splitter.RecursiveCharacterTextSplitter
ignore_fn Optional None callable that accepts the file path and returns True for ignored files
pdf_use_unstructured bool False If True, use unstructured for PDF extraction
kwargs

source

LLM.ask

 LLM.ask (question:str, qa_template='"Use the following pieces of context
          delimited by three backticks to answer the question at the end.
          If you don\'t know the answer, just say that you don\'t know,
          don\'t try to make up an answer.\n\n```{context}```\n\nQuestion:
          {question}\nHelpful Answer:', prompt_template=None, **kwargs)

*Answer a question based on source documents fed to the ingest method. Extra keyword arguments are sent directly to the model invocation.

Args:

  • question: a question you want to ask
  • qa_template: A string representing the prompt with variables “context” and “question”
  • prompt_template: the model-specific template in which everything (including QA template) should be wrapped. Should have a single variable “{prompt}”. Overrides the prompt_template parameter supplied to LLM constructor.

Returns:

  • A dictionary with keys: answer, source_documents, question*

source

LLM.chat

 LLM.chat (question:str, **kwargs)

*Chat with documents fed to the ingest method. Unlike LLM.ask, LLM.chat includes conversational memory. Extra keyword arguments are sent directly to the model invocation.

Args:

  • question: a question you want to ask

Returns:

  • A dictionary with keys: answer, source_documents, question, chat_history*

Example Usage

We’ll use a small 3B-parameter model here for testing purposes. The vector database is stored under ~/onprem_data by default. In this example, we will store the vector store in temporary folders.

import tempfile
vectordb_path = tempfile.mkdtemp()
url = "https://huggingface.co/TheBloke/orca_mini_3B-GGML/resolve/main/orca-mini-3b.ggmlv3.q4_1.bin"
llm = LLM(model_url=url, vectordb_path=vectordb_path, confirm=False)
assert os.path.isfile(
    os.path.join(U.get_datadir(), os.path.basename(url))
), "missing model"
prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""
saved_output = llm.prompt(prompt)
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA TITAN V, compute capability 7.0
  Device 1: NVIDIA TITAN V, compute capability 7.0
llama.cpp: loading model from /home/amaiya/onprem_data/orca-mini-3b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA TITAN V) as main device
llama_model_load_internal: mem required  = 3066.94 MB (+  682.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/29 layers to GPU
llama_model_load_internal: total VRAM used: 384 MB
llama_new_context_with_model: kv self size  =  650.00 MB
 Cillian Murphy, Florence Pugh
assert saved_output.strip() == "Cillian Murphy, Florence Pugh", "bad response"
llm.ingest("./sample_data", chunk_size=500, chunk_overlap=50)
Creating new vectorstore at /tmp/tmpl6ww9w5p
Loading documents from ./sample_data
Loading new documents: 100%|██████████████████████| 3/3 [00:00<00:00, 24.82it/s]
Loaded 12 new documents from ./sample_data
Split into 153 chunks of text (max. 500 chars each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method
question = """What is ktrain?"""
result = llm.ask(question)
print("\n\nReferences:\n\n")
for i, document in enumerate(result["source_documents"]):
    print(f"\n{i+1}.> " + document.metadata["source"] + ":")
    print(document.page_content)
 ktrain is a low-code library for augmented machine learning that enables beginners and domain experts with minimal programming or data science expertise to further democratize machine learning by facilitating the full machine learning workow from curating and preprocessing inputs (i.e., ground-truth-labeled training data) to training, tuning, troubleshooting, and applying models.

References:



1.> ./sample_data/1/ktrain_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workflow. For these reasons, ktrain is less of a traditional Au-
2

2.> ./sample_data/1/ktrain_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best fit their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely replace them. In doing so, the strengths of both are
better exploited. Following inspiration from a blog post1 by Rachel Thomas of fast.ai

3.> ./sample_data/1/ktrain_paper.pdf:
with custom models and data formats, as well.
Inspired by other low-code (and no-
code) open-source ML libraries such as fastai (Howard and Gugger, 2020) and ludwig
(Molino et al., 2019), ktrain is intended to help further democratize machine learning by
enabling beginners and domain experts with minimal programming or data science experi-
4. http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
6

4.> ./sample_data/1/ktrain_paper.pdf:
ktrain: A Low-Code Library for Augmented Machine Learning
toML platform and more of what might be called a “low-code” ML platform. Through
automation or semi-automation, ktrain facilitates the full machine learning workflow from
curating and preprocessing inputs (i.e., ground-truth-labeled training data) to training,
tuning, troubleshooting, and applying models. In this way, ktrain is well-suited for domain
experts who may have less experience with machine learning and software coding. Where

Pro-Tip: Smaller models like this tend to hallucinate more easily than larger ones. If you see the model hallucinating answers, you can supply use_larger=True to LLM and use the slightly larger default model better-suited to this use case (or supply the URL to a different model of your choosing to LLM), which can provide better performance.

The LLM.chat method answers questions with consideration of conversational memory. Note that LLM.chat is better suited to larger/better models than the one used below, as the models are required to do more work (e.g., condensing the question and chat history into a standalone question).

question = "What is ktrain?"
result = llm.chat(question)
 ktrain is a low-code library for augmented machine learning that allows users to automate or semi-automate various aspects of the machine learning workow, such as curating and preprocessing inputs, training, tuning, troubleshooting, and applying models. It is designed to improve the strengths of both humans and machines by enabling beginners and domain experts with minimal programming or data science expertise to use machine learning in their applications.
question = "Can it be used for image classification?"
result = llm.chat(question)
 What is ktrain and how can it be used for image classification? ktrain is a low-code library for augmented machine learning that enables the full machine learning workow, including automation or semi-automation of tasks such as data curation, preprocessing, model training, tuning, troubleshooting, and application. It can be used with any machine learning model implemented in TensorFlow Keras (tf.keras). ktrain includes out-of-the-box support for various data types and tasks, including image classification using custom models and data formats, as well.