llm helpers

Helper utilities for using LLMs

truncate_prompt

 truncate_prompt (model_or_pipeline, prompt, max_gen_tokens=512,
                  truncate_from='start', prompt_template=None)

*Truncate only the user prompt (not the full formatted prompt) to ensure the total prompt (template + user prompt) fits in model’s context.

Args: model_or_pipeline: - llama_cpp.Llama - HuggingFace pipeline (with .tokenizer) - HuggingFace tokenizer prompt (str): The user-supplied prompt (to be truncated if needed). max_gen_tokens (int): Tokens to reserve for generation. truncate_from (str): ‘start’ or ‘end’. prompt_template (str, optional): Template with {prompt}. Used to reserve space for non-user text.

Returns: str: Truncated user prompt only (template not applied).*

source

parse_code_markdown

 parse_code_markdown (text:str, only_last:bool)

Parsing embedded code out of markdown string

source

parse_json_markdown

 parse_json_markdown (text:str)

Parse json embedded in markdown into dictionary

source

extract_json

 extract_json (text:str)

Atttempts to extract json from markdown string. If no json exists, then return empty string.

source

decompose_question

 decompose_question (question:str, llm, parse=True, **kwargs)

Decompose a question into subquestions

source

needs_followup

 needs_followup (question:str, llm, parse=True, **kwargs)

Decide if follow-up questions are needed

source

extract_title

 extract_title
                (docs_or_text:Union[List[langchain_core.documents.base.Doc
                ument],str], llm, max_words=1024, retries=1, **kwargs)

*Extract or infer the title for the given text

Args - docs_or_text: Either a list of LangChain Document objects or a single text string - llm: An onprem.LLM instance - max_words: Maximum words to consider - retries: Number of tries to correctly extract title*

source

Title

 Title (title:str)

*!!! abstract “Usage Documentation” Models

A base class for creating Pydantic models.

Attributes: class_vars: The names of the class variables defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.*

source

summarize_tables

 summarize_tables (docs:List[langchain_core.documents.base.Document], llm,
                   max_chars=4096, max_tables=3, retries=1,
                   attempt_exact=False, only_caption_missing=False,
                   **kwargs)

*Given a list of Documents, auto-caption or summarize any tables within list.

Args - docs_or_text: A list of LangChain Document objects - llm: An onprem.LLM instance - max_chars: Maximum characters to consider - retries: Number of tries to correctly auto-caption table - attempt_exact: Try to exact existing caption if it exists. - only_caption_missing: Only caption tables without a caption*

source

caption_table_text

 caption_table_text (table_text:str, llm, max_chars=4096, retries=1,
                     attempt_exact=False, **kwargs)

Caption table text

source

TableSummary

 TableSummary (summary:str)

*!!! abstract “Usage Documentation” Models

A base class for creating Pydantic models.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.*

Examples

from onprem import LLM

llm = LLM(default_model='llama', n_gpu_layers=-1, verbose=False, mute_stream=True)

llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

Deciding on Follow-Up Questions

needs_followup('What is ktrain?', llm=llm)

False

needs_followup('What is the capital of France?', llm=llm)

False

needs_followup("How was Paul Grahams life different before, during, and after YC?", llm)

True

needs_followup("Compare and contrast the customer segments and geographies of Lyft and Uber that grew the fastest.", llm)

True

needs_followup("Compare and contrast Uber and Lyft.", llm)

True

Generating Follow-Up Questions

question = "Compare and contrast the customer segments and geographies of Lyft and Uber that grew the fastest."
subquestions = decompose_question(question, llm=llm, parse=False)
print()
print(subquestions)


['What are the customer segments that drove growth for Uber', 'What are the geographies where Uber grew the fastest', 'What are the customer segments that drove growth for Lyft', 'What are the geographies where Lyft grew the fastest']

Extract Titles

from onprem.ingest import load_single_document

docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')
title = extract_title(docs, llm=llm)
print(title)

A Low-Code Library for Augmented Machine Learning

Auto-Caption Tables

docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf', infer_table_structure=True)
table_doc = [d for d in docs if d.metadata['table']][0]

summarize_tables([table_doc], llm, only_caption_missing=False)

print(table_doc.page_content)

Comparison of ML Tasks Supported in Popular Libraries

Table 1: A comparison of ML tasks supported out-of-the-box in popular low-code and AutoML libraries for tabular, image, audio, text and graph data.

The following table in markdown format has the caption: Table 1: A comparison of ML tasks supported out-of-the-box in popular low-code and AutoML libraries for tabular, image, audio, text and graph data. The following table in markdown format includes this list of columns:
- Task
- ktrain
- fastai
- Ludwig
- AutoKeras
- AutoGluon

|Task|ktrain|fastai|Ludwig|AutoKeras|AutoGluon|
|---|---|---|---|---|---|
|Tabular: Classification/Regression|✓|✓|✓|✓|✓|
|Tabular: Causal Machine Learning|✓|None|None|None|None|
|Tabular: Time Series Forecasting|None|None|✓|✓|None|
|Tabular: Collaborative Filtering|None|✓|None|None|None|
|Image: Classification/Regression|✓|✓|✓|✓|✓|
|Image: Object Detection|prefitted*|✓|None|None|✓|
|Image: Image Captioning|prefitted*|None|✓|None|None|
|Image: Segmentation|None|✓|None|None|None|
|Image: GANs|None|✓|None|None|None|
|Image: Keypoint/Pose Estimation|None|✓|None|None|None|
|Audio: Classification/Regression|None|None|✓|None|None|
|Audio: Speech Transcription|prefitted*|None|✓|None|None|
|Text: Classification/Regression|✓|✓|✓|✓|✓|
|Text: Sequence-Tagging|✓|None|✓|None|None|
|Text: Unsupervised Topic Modeling|✓|None|None|None|None|
|Text: Semantic Search|✓|None|None|None|None|
|Text: End-to-End Question-Answering|✓*|None|None|None|None|
|Text: Zero-Shot Learning|✓|None|None|None|None|
|Text: Language Translation|prefitted*|None|✓|None|None|
|Text: Summarization|prefitted*|None|✓|None|None|
|Text: Text Extraction|✓|None|None|None|None|
|Text: QA-Based Information Extraction|✓*|None|None|None|None|
|Text: Keyphrase Extraction|✓|None|None|None|None|
|Graph: Node Classification|✓|None|None|None|None|
|Graph: Link Prediction|✓|None|None|None|None|

The caption_tables function pre-pended the table text with an alternative caption in this example. You can skip over tables that already have captions by supplying only_caption_missing=True.