llm helpers

Helper utilities for using LLMs

source

truncate_prompt


def truncate_prompt(
    model_or_pipeline, prompt, max_gen_tokens:int=512, truncate_from:str='start', prompt_template:NoneType=None
):

Truncate only the user prompt (not the full formatted prompt) to ensure the total prompt (template + user prompt) fits in model’s context.

Args: model_or_pipeline: - llama_cpp.Llama - HuggingFace pipeline (with .tokenizer) - HuggingFace tokenizer prompt (str): The user-supplied prompt (to be truncated if needed). max_gen_tokens (int): Tokens to reserve for generation. truncate_from (str): ‘start’ or ‘end’. prompt_template (str, optional): Template with {prompt}. Used to reserve space for non-user text.

Returns: str: Truncated user prompt only (template not applied).


source

parse_code_markdown


def parse_code_markdown(
    text:str, only_last:bool
)->List:

Parsing embedded code out of markdown string


source

parse_json_markdown


def parse_json_markdown(
    text:str
)->Any:

Parse json embedded in markdown into dictionary


source

extract_json


def extract_json(
    text:str
)->str:

Atttempts to extract json from markdown string. If no json exists, then return empty string.


source

extract_title


def extract_title(
    docs_or_text:Union, llm, max_words:int=1024, retries:int=1, kwargs:VAR_KEYWORD
):

Extract or infer the title for the given text

Args - docs_or_text: Either a list of LangChain Document objects or a single text string - llm: An onprem.LLM instance - max_words: Maximum words to consider - retries: Number of tries to correctly extract title


source

Title


def Title(
    data:Any
)->None:

!!! abstract “Usage Documentation” Models

A base class for creating Pydantic models.

Attributes: class_vars: The names of the class variables defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.

source

needs_followup


def needs_followup(
    question:str, llm, parse:bool=True, kwargs:VAR_KEYWORD
):

Decide if follow-up questions are needed


source

extract_title


def extract_title(
    docs_or_text:Union, llm, max_words:int=1024, retries:int=1, kwargs:VAR_KEYWORD
):

Extract or infer the title for the given text

Args - docs_or_text: Either a list of LangChain Document objects or a single text string - llm: An onprem.LLM instance - max_words: Maximum words to consider - retries: Number of tries to correctly extract title


source

Title


def Title(
    data:Any
)->None:

!!! abstract “Usage Documentation” Models

A base class for creating Pydantic models.

Attributes: class_vars: The names of the class variables defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.

source

summarize_tables


def summarize_tables(
    docs:List, llm, max_chars:int=4096, max_tables:int=3, retries:int=1, attempt_exact:bool=False,
    only_caption_missing:bool=False, kwargs:VAR_KEYWORD
):

Given a list of Documents, auto-caption or summarize any tables within list.

Args - docs_or_text: A list of LangChain Document objects - llm: An onprem.LLM instance - max_chars: Maximum characters to consider - retries: Number of tries to correctly auto-caption table - attempt_exact: Try to exact existing caption if it exists. - only_caption_missing: Only caption tables without a caption


source

caption_table_text


def caption_table_text(
    table_text:str, llm, max_chars:int=4096, retries:int=1, attempt_exact:bool=False, kwargs:VAR_KEYWORD
):

Caption table text


source

TableSummary


def TableSummary(
    data:Any
)->None:

!!! abstract “Usage Documentation” Models

A base class for creating Pydantic models.

Attributes: class_vars: The names of the class variables defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.

Examples

from onprem import LLM
llm = LLM(default_model='llama', n_gpu_layers=-1, verbose=False, mute_stream=True)
llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

Extract Titles

from onprem.ingest import load_single_document
docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')
title = extract_title(docs, llm=llm)
print(title)
A Low-Code Library for Augmented Machine Learning

Auto-Caption Tables

docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf', infer_table_structure=True)
table_doc = [d for d in docs if d.metadata['table']][0]
summarize_tables([table_doc], llm, only_caption_missing=False)
print(table_doc.page_content)
Comparison of ML Tasks Supported in Popular Libraries

Table 1: A comparison of ML tasks supported out-of-the-box in popular low-code and AutoML libraries for tabular, image, audio, text and graph data.

The following table in markdown format has the caption: Table 1: A comparison of ML tasks supported out-of-the-box in popular low-code and AutoML libraries for tabular, image, audio, text and graph data. The following table in markdown format includes this list of columns:
- Task
- ktrain
- fastai
- Ludwig
- AutoKeras
- AutoGluon

|Task|ktrain|fastai|Ludwig|AutoKeras|AutoGluon|
|---|---|---|---|---|---|
|Tabular: Classification/Regression|✓|✓|✓|✓|✓|
|Tabular: Causal Machine Learning|✓|None|None|None|None|
|Tabular: Time Series Forecasting|None|None|✓|✓|None|
|Tabular: Collaborative Filtering|None|✓|None|None|None|
|Image: Classification/Regression|✓|✓|✓|✓|✓|
|Image: Object Detection|prefitted*|✓|None|None|✓|
|Image: Image Captioning|prefitted*|None|✓|None|None|
|Image: Segmentation|None|✓|None|None|None|
|Image: GANs|None|✓|None|None|None|
|Image: Keypoint/Pose Estimation|None|✓|None|None|None|
|Audio: Classification/Regression|None|None|✓|None|None|
|Audio: Speech Transcription|prefitted*|None|✓|None|None|
|Text: Classification/Regression|✓|✓|✓|✓|✓|
|Text: Sequence-Tagging|✓|None|✓|None|None|
|Text: Unsupervised Topic Modeling|✓|None|None|None|None|
|Text: Semantic Search|✓|None|None|None|None|
|Text: End-to-End Question-Answering|✓*|None|None|None|None|
|Text: Zero-Shot Learning|✓|None|None|None|None|
|Text: Language Translation|prefitted*|None|✓|None|None|
|Text: Summarization|prefitted*|None|✓|None|None|
|Text: Text Extraction|✓|None|None|None|None|
|Text: QA-Based Information Extraction|✓*|None|None|None|None|
|Text: Keyphrase Extraction|✓|None|None|None|None|
|Graph: Node Classification|✓|None|None|None|None|
|Graph: Link Prediction|✓|None|None|None|None|

The caption_tables function pre-pended the table text with an alternative caption in this example. You can skip over tables that already have captions by supplying only_caption_missing=True.