Cost-Aware LLM Pipeline

Overview

Composable patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a single pipeline.

Core Techniques

1. Model Routing by Task Complexity

Route simple tasks to cheap models, reserve expensive models for complex ones.

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(text_length: int, item_count: int, force_model: str | None = None) -> str:
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET
    return MODEL_HAIKU  # 3-4x cheaper

2. Immutable Cost Tracking

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        return CostTracker(budget_limit=self.budget_limit, records=(*self.records, record))

    @property
    def over_budget(self) -> bool:
        return sum(r.cost_usd for r in self.records) > self.budget_limit

3. Narrow Retry Logic

Retry only on transient errors. Fail fast on authentication or bad request errors.

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)

def call_with_retry(func, *, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

4. Prompt Caching

Cache long system prompts to avoid resending on every request:

{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}

Composed Pipeline

def process(text, config, tracker):
    model = select_model(len(text), estimated_items, config.force_model)
    if tracker.over_budget:
        raise BudgetExceededError(...)
    response = call_with_retry(lambda: client.messages.create(
        model=model, messages=build_cached_messages(system_prompt, text)))
    tracker = tracker.add(CostRecord(...))
    return parse_result(response), tracker

Pricing Reference (2025-2026)

Model	Input ($/1M tokens)	Output ($/1M tokens)	Relative Cost
Haiku 4.5	$0.80	$4.00	1x
Sonnet 4.6	$3.00	$15.00	~4x
Opus 4.5	$15.00	$75.00	~19x

Best Practices

Start with the cheapest model; upgrade only when complexity thresholds are met
Set explicit budget limits before batch processing
Log model selection decisions for threshold tuning
Use prompt caching for system prompts over 1024 tokens
Never retry on authentication or validation errors

Anti-Patterns

Using the most expensive model for all requests regardless of complexity
Retrying on all errors (wastes budget on permanent failures)
Mutating cost tracking state
Hardcoding model names throughout the codebase
Ignoring prompt caching for repetitive system prompts

Cost-Aware LLM Pipeline

Cost-Aware LLM Pipeline

Overview

Core Techniques

1. Model Routing by Task Complexity

2. Immutable Cost Tracking

3. Narrow Retry Logic

4. Prompt Caching

Composed Pipeline

Pricing Reference (2025-2026)

Best Practices

Anti-Patterns

相关技能 Related Skills

Continuous Learning System

Iterative Retrieval

Agent Orchestration