llm helpers

Helper utilities for using LLMs

source

parse_code_markdown

 parse_code_markdown (text:str, only_last:bool)

Parsing embedded code out of markdown string


source

parse_json_markdown

 parse_json_markdown (text:str)

Parse json embedded in markdown into dictionary


source

decompose_question

 decompose_question (question:str, llm, parse=True, **kwargs)

Decompose a question into subquestions


source

needs_followup

 needs_followup (question:str, llm, parse=True, **kwargs)

Decide if follow-up questions are needed


source

extract_title

 extract_title
                (docs_or_text:Union[List[langchain_core.documents.base.Doc
                ument],str], llm, max_words=1024, retries=1, **kwargs)

*Extract or infer the title for the given text

Args - docs_or_text: Either a list of LangChain Document objects or a single text string - llm: An onprem.LLM instance - max_words: Maximum words to consider - retries: Number of tries to correctly extract title*


source

Title

 Title (title:str)

*Usage docs: https://docs.pydantic.dev/2.10/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of the class variables defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.*

source

caption_tables

 caption_tables (docs:List[langchain_core.documents.base.Document], llm,
                 max_chars=4096, max_tables=3, retries=1,
                 attempt_exact=False, only_caption_missing=False,
                 **kwargs)

*Given a list of Documents, auto-caption or summarize any tables within list.

Args - docs_or_text: A list of LangChain Document objects - llm: An onprem.LLM instance - max_chars: Maximum characters to consider - retries: Number of tries to correctly auto-caption table - attempt_exact: Try to exact existing caption if it exists. - only_caption_missing: Only caption tables without a caption*


source

caption_table_text

 caption_table_text (table_text:str, llm, max_chars=4096, retries=1,
                     attempt_exact=False, **kwargs)

Caption table text


source

TableSummary

 TableSummary (summary:str)

*Usage docs: https://docs.pydantic.dev/2.10/concepts/models/

A base class for creating Pydantic models.

Attributes: class_vars: The names of the class variables defined on the model. private_attributes: Metadata about the private attributes of the model. signature: The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__: Whether model building is completed, or if there are still undefined fields.
__pydantic_core_schema__: The core schema of the model.
__pydantic_custom_init__: Whether the model has a custom `__init__` function.
__pydantic_decorators__: Metadata containing the decorators defined on the model.
    This replaces `Model.__validators__` and `Model.__root_validators__` from Pydantic V1.
__pydantic_generic_metadata__: Metadata for generic models; contains data used for a similar purpose to
    __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
__pydantic_parent_namespace__: Parent namespace of the model, used for automatic rebuilding of models.
__pydantic_post_init__: The name of the post-init method for the model, if defined.
__pydantic_root_model__: Whether the model is a [`RootModel`][pydantic.root_model.RootModel].
__pydantic_serializer__: The `pydantic-core` `SchemaSerializer` used to dump instances of the model.
__pydantic_validator__: The `pydantic-core` `SchemaValidator` used to validate instances of the model.

__pydantic_fields__: A dictionary of field names and their corresponding [`FieldInfo`][pydantic.fields.FieldInfo] objects.
__pydantic_computed_fields__: A dictionary of computed field names and their corresponding [`ComputedFieldInfo`][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__: A dictionary containing extra values, if [`extra`][pydantic.config.ConfigDict.extra]
    is set to `'allow'`.
__pydantic_fields_set__: The names of fields explicitly set during instantiation.
__pydantic_private__: Values of private attributes set on the model instance.*

Examples

from onprem import LLM
llm = LLM(default_model='llama', n_gpu_layers=-1, verbose=False, mute_stream=True)
llama_new_context_with_model: n_ctx_per_seq (3904) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

Deciding on Follow-Up Questions

needs_followup('What is ktrain?', llm=llm)
False
needs_followup('What is the capital of France?', llm=llm)
False
needs_followup("How was Paul Grahams life different before, during, and after YC?", llm)
True
needs_followup("Compare and contrast the customer segments and geographies of Lyft and Uber that grew the fastest.", llm)
True
needs_followup("Compare and contrast Uber and Lyft.", llm)
True

Generating Follow-Up Questions

question = "Compare and contrast the customer segments and geographies of Lyft and Uber that grew the fastest."
subquestions = decompose_question(question, llm=llm, parse=False)
print()
print(subquestions)

['What are the customer segments that drove growth for Uber', 'What are the geographies where Uber grew the fastest', 'What are the customer segments that drove growth for Lyft', 'What are the geographies where Lyft grew the fastest']

Extract Titles

from onprem.ingest import load_single_document
docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf')
title = extract_title(docs, llm=llm)
print(title)
ktrain: A Low-Code Library for Augmented Machine Learning

Auto-Caption Tables

docs = load_single_document('tests/sample_data/ktrain_paper/ktrain_paper.pdf', infer_table_structure=True)
table_doc = [d for d in docs if 'table' in d.metadata][0]
caption_tables([table_doc], llm, only_caption_missing=False)
print(table_doc.page_content)
Comparison of AutoML Libraries for Different Data Types

The following table in markdown format has the caption: Table 1: A comparison of ML tasks supported out-of-the-box in popular low-code and AutoML libraries for tabular, image, audio, text and graph data..
|Task|ktrain|fastai|Ludwig|AutoKeras|AutoGluon|
|---|---|---|---|---|---|
|Tabular: Classification/Regression|✓|✓|✓|✓|✓|
|Tabular: Causal Machine Learning|✓|None|None|None|None|
|Tabular: Time Series Forecasting|None|None|✓|✓|None|
|Tabular: Collaborative Filtering|None|✓|None|None|None|
|Image: Classification/Regression|✓|✓|✓|✓|✓|
|Image: Object Detection|prefitted*|✓|None|None|✓|
|Image: Image Captioning|prefitted*|None|✓|None|None|
|Image: Segmentation|None|✓|None|None|None|
|Image: GANs|None|✓|None|None|None|
|Image: Keypoint/Pose Estimation|None|✓|None|None|None|
|Audio: Classification/Regression|None|None|✓|None|None|
|Audio: Speech Transcription|prefitted*|None|✓|None|None|
|Text: Classification/Regression|✓|✓|✓|✓|✓|
|Text: Sequence-Tagging|✓|None|✓|None|None|
|Text: Unsupervised Topic Modeling|✓|None|None|None|None|
|Text: Semantic Search|✓|None|None|None|None|
|Text: End-to-End Question-Answering|✓*|None|None|None|None|
|Text: Zero-Shot Learning|✓|None|None|None|None|
|Text: Language Translation|prefitted*|None|✓|None|None|
|Text: Summarization|prefitted*|None|✓|None|None|
|Text: Text Extraction|✓|None|None|None|None|
|Text: QA-Based Information Extraction|✓*|None|None|None|None|
|Text: Keyphrase Extraction|✓|None|None|None|None|
|Graph: Node Classification|✓|None|None|None|None|
|Graph: Link Prediction|✓|None|None|None|None|

The caption_tables function pre-pended the table text with an alternative caption in this example. You can skip over tables that already have captions by supplying only_caption_missing=True.