pipelines.summarizer

Pipelines for specific tasks like summarization

source

Summarizer


def Summarizer(
    llm, prompt_template:Optional=None, map_prompt:Optional=None, reduce_prompt:Optional=None,
    refine_prompt:Optional=None, kwargs:VAR_KEYWORD
):

Summarizer summarizes one or more documents

Args:

  • llm: An onprem.LLM object
  • prompt_template: A model specific prompt_template with a single placeholder named “{prompt}”. All prompts (e.g., Map-Reduce prompts) are wrapped within this prompt. If supplied, overrides the prompt_template supplied to the LLM constructor.
  • map_prompt: Map prompt for Map-Reduce summarization. If None, default is used.
  • reduce_prompt: Reduce prompt for Map-Reduce summarization. If None, default is used.
  • refine_prompt: Refine prompt for Refine-based summarization. If None, default is used.

source

Summarizer.summarize


def summarize(
    fpath:str=None, # path to either a folder of documents or a single file
    strategy:str='map_reduce', # One of {'map_reduce', 'refine'}
    chunk_size:int=1000, # Number of characters of each chunk to summarize
    chunk_overlap:int=0, # Number of characters that overlap between chunks
    token_max:int=2000, # Maximum number of tokens to group documents into
    max_chunks_to_use:Optional=None, # Maximum number of chunks (starting from beginning) to use
    raw_text:str=None, # Optional: raw text to process (skips file loading)
):

Summarize one or more documents (e.g., PDFs, MS Word, MS Powerpoint, plain text) using either Langchain’s Map-Reduce strategy or Refine strategy. The max_chunks parameter may be useful for documents that have abstracts or informative introductions. If max_chunks=None, all chunks are considered for summarizer.

Args: fpath: Path to file or directory strategy: Summarization strategy (‘map_reduce’ or ‘refine’) chunk_size: Characters per chunk chunk_overlap: Character overlap between chunks
token_max: Maximum tokens to group documents into max_chunks_to_use: Maximum chunks to process raw_text: Raw text to process (skips file loading)

Note: Provide exactly one of: fpath or raw_text


source

Summarizer.summarize_by_concept


def summarize_by_concept(
    fpath:NoneType=None, # path to file, raw text, or list of pre-chunked text
    concept_description:str=None, # Summaries are generated with respect to the described concept.
    similarity_threshold:float=0.0, # Minimum similarity for consideration. Tip: Increase this when using similarity_method="senttransform" to mitigate hallucination. A value of 0.0 is sufficient for TF-IDF or should be kept near-zero.
    max_chunks:int=4, # Only this many snippets above similarity_threshold are considered.
    similarity_method:str='tfidf', # One of "senttransform" (sentence-transformer embeddings) or "tfidf" (TF-IDF)
    summary_prompt:str='What does the following context say with respect "{concept_description}"? \n\nCONTEXT:\n{text}', # The prompt used for summarization. Should have exactly two variables, {concept_description} and {text}.
    raw_text:str=None, # Optional: raw text to process (skips file loading)
    chunks:list=None, # Optional: pre-chunked text as list (skips both file loading and chunking)
):

Summarize document with respect to concept described by concept_description. Returns a tuple of the form (summary, sources).

Args: fpath: Path to file concept_description: The concept to focus summarization on similarity_threshold: Minimum similarity score for chunk consideration max_chunks: Maximum number of chunks to use for summarization similarity_method: “tfidf” or “senttransform” summary_prompt: Template for summarization prompt raw_text: Raw text to process (skips file loading) chunks: Pre-chunked text as list (skips file loading and chunking)

Note: Provide exactly one of: fpath, raw_text, or chunks