Skip to content

Context Management API

PatchPal's context management system handles token estimation, context window limits, and automatic compaction.

TokenEstimator

patchpal.context.TokenEstimator(model_id)

Estimate tokens in messages for context management.

Uses character-based estimation (~3 chars per token) as a fallback when actual token counts from API responses are not available. This works reliably for all models without requiring network access or external dependencies.

Source code in patchpal/context.py
def __init__(self, model_id: str):
    self.model_id = model_id
    # Character-based estimation is used as fallback (primary: actual API token counts)
    self._encoder = None

estimate_tokens(text)

Estimate tokens in text using character-based heuristic.

Uses ~3 chars per token which is accurate for code-heavy content and works reliably without requiring network access for tokenizer data.

Parameters:

Name Type Description Default
text str

Text to estimate tokens for

required

Returns:

Type Description
int

Estimated token count

Source code in patchpal/context.py
def estimate_tokens(self, text: str) -> int:
    """Estimate tokens in text using character-based heuristic.

    Uses ~3 chars per token which is accurate for code-heavy content
    and works reliably without requiring network access for tokenizer data.

    Args:
        text: Text to estimate tokens for

    Returns:
        Estimated token count
    """
    if not text:
        return 0

    # Character-based estimation: ~3 chars per token
    # This is more accurate than 4 chars/token for technical content
    # and works reliably for all models without network dependencies
    return len(str(text)) // 3

estimate_message_tokens(message)

Estimate tokens in a single message.

Parameters:

Name Type Description Default
message Dict[str, Any]

Message dict with role, content, tool_calls, etc.

required

Returns:

Type Description
int

Estimated token count

Source code in patchpal/context.py
def estimate_message_tokens(self, message: Dict[str, Any]) -> int:
    """Estimate tokens in a single message.

    Args:
        message: Message dict with role, content, tool_calls, etc.

    Returns:
        Estimated token count
    """
    tokens = 0

    # Role and content
    if "role" in message:
        tokens += 4  # Role overhead

    if "content" in message and message["content"]:
        content = message["content"]

        # Handle multimodal content (list of content blocks)
        if isinstance(content, list):
            for block in content:
                if isinstance(block, dict):
                    if block.get("type") == "text":
                        # Text content
                        tokens += self.estimate_tokens(str(block.get("text", "")))
                    elif block.get("type") == "image_url":
                        # Image content - vision models charge varying amounts depending on:
                        # - Image dimensions (larger = more tokens)
                        # - Detail level (low/high/auto)
                        # - Provider (OpenAI: 765-2,298, Anthropic/others: similar)
                        # Use 1200 tokens as conservative cross-provider estimate
                        # Reference: oh-my-pi uses same value for reliable context management
                        tokens += 1200
                else:
                    # Fallback for unexpected structure
                    tokens += self.estimate_tokens(str(block))
        else:
            # Regular text content
            tokens += self.estimate_tokens(str(content))

    # Tool calls
    if message.get("tool_calls"):
        for tool_call in message["tool_calls"]:
            tokens += 10  # Tool call overhead
            if hasattr(tool_call, "function"):
                tokens += self.estimate_tokens(tool_call.function.name)
                tokens += self.estimate_tokens(tool_call.function.arguments)

    # Tool call ID
    if message.get("tool_call_id"):
        tokens += 5

    # Name field
    if message.get("name"):
        tokens += self.estimate_tokens(message["name"])

    # Reasoning fields (for reasoning models like gpt-oss)
    # These are passed back to the API by LiteLLM and count as input tokens
    reasoning_fields = ["reasoning_content", "reasoning", "reasoning_text"]
    for field in reasoning_fields:
        if message.get(field):
            tokens += self.estimate_tokens(str(message[field]))

    # Thinking blocks (for Anthropic extended thinking)
    # These are also passed back to the API and count as input tokens
    if message.get("thinking_blocks"):
        for block in message["thinking_blocks"]:
            if isinstance(block, dict) and block.get("thinking"):
                tokens += self.estimate_tokens(str(block["thinking"]))

    return tokens

estimate_messages_tokens(messages)

Estimate tokens in a list of messages.

Parameters:

Name Type Description Default
messages List[Dict[str, Any]]

List of message dicts

required

Returns:

Type Description
int

Total estimated token count

Source code in patchpal/context.py
def estimate_messages_tokens(self, messages: List[Dict[str, Any]]) -> int:
    """Estimate tokens in a list of messages.

    Args:
        messages: List of message dicts

    Returns:
        Total estimated token count
    """
    return sum(self.estimate_message_tokens(msg) for msg in messages)

ContextManager

patchpal.context.ContextManager(model_id, system_prompt)

Manage context window with auto-compaction and pruning.

Initialize context manager.

Parameters:

Name Type Description Default
model_id str

LiteLLM model identifier

required
system_prompt str

System prompt text

required
Source code in patchpal/context.py
def __init__(self, model_id: str, system_prompt: str):
    """Initialize context manager.

    Args:
        model_id: LiteLLM model identifier
        system_prompt: System prompt text
    """
    self.model_id = model_id
    self.system_prompt = system_prompt
    self.estimator = TokenEstimator(model_id)
    self.context_limit = self._get_context_limit()
    # Reserve 16% of context for output (min 4K, max 32K)
    # This ensures older models like GPT-4 (8K) get 1.28K reserve
    # while modern models get full 32K reserve
    self.output_reserve = min(32_000, max(4_000, int(self.context_limit * 0.16)))

needs_compaction(messages, actual_prompt_tokens=None)

Check if context window needs compaction.

ALWAYS estimates current messages to avoid staleness issues when predicting whether the NEXT API call will overflow. Using actual_prompt_tokens from a previous call can cause false negatives when large messages are added between the last API call and the compaction check.

Example of staleness bug (fixed): - Previous API call: 120K tokens (60% usage) - User pastes huge changelog: +90K tokens - Total: 210K tokens (exceeds 200K limit) - Bug: If we used actual_prompt_tokens=120K, we'd think we're at 60% - Fix: Always re-estimate to see the 210K total

The actual_prompt_tokens parameter is kept for API compatibility but ignored for compaction decisions. Use get_usage_stats() for display purposes where actual tokens are appropriate (staleness OK for showing recent stats).

Parameters:

Name Type Description Default
messages List[Dict[str, Any]]

Current message history

required
actual_prompt_tokens int

IGNORED - kept for API compatibility only

None

Returns:

Type Description
bool

True if compaction is needed

Source code in patchpal/context.py
def needs_compaction(
    self, messages: List[Dict[str, Any]], actual_prompt_tokens: int = None
) -> bool:
    """Check if context window needs compaction.

    ALWAYS estimates current messages to avoid staleness issues when predicting
    whether the NEXT API call will overflow. Using actual_prompt_tokens from a
    previous call can cause false negatives when large messages are added between
    the last API call and the compaction check.

    Example of staleness bug (fixed):
    - Previous API call: 120K tokens (60% usage)
    - User pastes huge changelog: +90K tokens
    - Total: 210K tokens (exceeds 200K limit)
    - Bug: If we used actual_prompt_tokens=120K, we'd think we're at 60%
    - Fix: Always re-estimate to see the 210K total

    The actual_prompt_tokens parameter is kept for API compatibility but ignored
    for compaction decisions. Use get_usage_stats() for display purposes where
    actual tokens are appropriate (staleness OK for showing recent stats).

    Args:
        messages: Current message history
        actual_prompt_tokens: IGNORED - kept for API compatibility only

    Returns:
        True if compaction is needed
    """
    # ALWAYS estimate current messages - never use stale actual_prompt_tokens
    # This ensures we detect large message additions that happen between API calls
    # Note: Dynamic date/time message adds ~30 tokens on each LLM call
    system_tokens = self.estimator.estimate_tokens(self.system_prompt)
    datetime_tokens = 30  # Approximate size of dynamic date/time message
    message_tokens = self.estimator.estimate_messages_tokens(messages)
    total_tokens = system_tokens + datetime_tokens + message_tokens + self.output_reserve

    # Check threshold
    usage_ratio = total_tokens / self.context_limit
    return usage_ratio >= self.COMPACT_THRESHOLD

get_usage_stats(messages, actual_prompt_tokens=None)

Get current context usage statistics.

Parameters:

Name Type Description Default
messages List[Dict[str, Any]]

Current message history

required
actual_prompt_tokens int

Optional actual prompt tokens from latest API response (includes cache operations)

None

Returns:

Type Description
Dict[str, Any]

Dict with usage statistics

Source code in patchpal/context.py
def get_usage_stats(
    self, messages: List[Dict[str, Any]], actual_prompt_tokens: int = None
) -> Dict[str, Any]:
    """Get current context usage statistics.

    Args:
        messages: Current message history
        actual_prompt_tokens: Optional actual prompt tokens from latest API response (includes cache operations)

    Returns:
        Dict with usage statistics
    """
    # If we have actual prompt tokens from API (includes cache writes/reads), use those
    if actual_prompt_tokens is not None:
        total_tokens = actual_prompt_tokens + self.output_reserve
        # For display purposes, estimate system vs message breakdown
        system_tokens = self.estimator.estimate_tokens(self.system_prompt)
        datetime_tokens = 30
        message_tokens = actual_prompt_tokens - system_tokens - datetime_tokens

        return {
            "system_tokens": system_tokens + datetime_tokens,
            "message_tokens": max(0, message_tokens),  # Ensure non-negative
            "output_reserve": self.output_reserve,
            "total_tokens": total_tokens,
            "context_limit": self.context_limit,
            "usage_ratio": total_tokens / self.context_limit,
            "usage_percent": int((total_tokens / self.context_limit) * 100),
        }

    # Fallback to estimation when actual tokens not available
    system_tokens = self.estimator.estimate_tokens(self.system_prompt)
    datetime_tokens = 30  # Approximate size of dynamic date/time message
    message_tokens = self.estimator.estimate_messages_tokens(messages)
    total_tokens = system_tokens + datetime_tokens + message_tokens + self.output_reserve

    return {
        "system_tokens": system_tokens + datetime_tokens,  # Include datetime in system count
        "message_tokens": message_tokens,
        "output_reserve": self.output_reserve,
        "total_tokens": total_tokens,
        "context_limit": self.context_limit,
        "usage_ratio": total_tokens / self.context_limit,
        "usage_percent": int((total_tokens / self.context_limit) * 100),
    }

Usage Example

from patchpal.agent import create_agent

agent = create_agent()

# Check context usage
stats = agent.context_manager.get_usage_stats(agent.messages)
print(f"Token usage: {stats['total_tokens']:,} / {stats['context_limit']:,}")
print(f"Usage: {stats['usage_percent']}%")
print(f"Output budget remaining: {stats['output_budget_remaining']:,} tokens")

# Check if compaction is needed
if agent.context_manager.needs_compaction(agent.messages):
    print("Context window getting full - compaction will trigger soon")

# Manually trigger compaction (usually automatic)
agent._perform_auto_compaction()

How Context Management Works

  1. Token Estimation: Uses tiktoken (or fallback character estimation) to estimate message tokens
  2. Context Limits: Tracks model-specific context window sizes (e.g., 200K for Claude Sonnet)
  3. Automatic Compaction: When context reaches 70% full, summarizes old messages to free space
  4. Output Budget: Reserves tokens for model output based on context window size

Context Limits by Model Family

The context manager automatically detects limits for common models:

  • Claude 3.5 Sonnet: 200,000 tokens
  • Claude 3 Opus: 200,000 tokens
  • GPT-4 Turbo: 128,000 tokens
  • GPT-4: 8,192 tokens
  • GPT-3.5: 16,385 tokens

For unknown models, falls back to 128,000 tokens.