Regex vs LLM Decision Framework for Structured Text

Overview

A practical decision framework for parsing structured text. The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.

Decision Tree

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

Architecture

Source Text → [Regex Parser] → [Confidence Scorer]
                                    │
                    ├── High confidence (≥0.95) → Direct output
                    └── Low confidence (<0.95) → [LLM Validator] → Output

Implementation Pattern

1. Regex Parser (handles 95-98%)

def parse_structured_text(content: str) -> list[ParsedItem]:
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    # Extract matches into frozen dataclasses

2. Confidence Scoring

Flag items that may need LLM review based on heuristics (few choices, missing answers, short text).

3. LLM Validator (edge cases only)

Use the cheapest available model (Haiku-class) to fix low-confidence extractions.

4. Hybrid Pipeline

def process_document(content, *, llm_client=None, confidence_threshold=0.95):
    items = parse_structured_text(content)         # Step 1: Regex
    low_confidence = identify_low_confidence(items) # Step 2: Score
    # Step 3: LLM only for flagged items

Production Metrics (410 items)

Metric	Value
Regex success rate	98.0%
Low confidence items	8 (2.0%)
LLM calls needed	~5
Cost savings vs all-LLM	~95%

Best Practices

Start with regex even if imperfect — it provides a baseline to improve
Use confidence scoring to programmatically identify what needs LLM help
Use the cheapest LLM for validation (Haiku-class models suffice)
Never mutate parsed items — return new instances
TDD works well for parsers — test known patterns first, then edge cases
Log metrics (regex success rate, LLM call count) for pipeline health

Anti-Patterns

Sending all text to an LLM when regex handles 95%+ (expensive and slow)
Using regex for free-form, highly variable text (LLM is better)
Skipping confidence scoring
Mutating parsed objects during validation

Regex vs LLM Decision Framework for Structured Text

Regex vs LLM Decision Framework for Structured Text

Overview

Decision Tree

Architecture

Implementation Pattern

1. Regex Parser (handles 95-98%)

2. Confidence Scoring

3. LLM Validator (edge cases only)

4. Hybrid Pipeline

Production Metrics (410 items)

Best Practices

Anti-Patterns

相关技能 Related Skills

MCP Configuration Guide

Performance & Token Optimization

Claude Code Longform Guide