Agents That Actually Finish: Five Patterns From Production

Most agent demos are built to look impressive in a tweet. Most agents I build have to run unattended on a Tuesday afternoon and not page me. Different game.

Here are five patterns I keep reaching for. None are novel — that's the point. They're load-bearing precisely because they're boring.

1. A planner that emits structured tasks, not free-form prose

The single biggest reliability win is making the planner emit JSON, not narration. Free-form plans drift; structured ones can be validated, deduped, and replayed.

planner.py

from pydantic import BaseModel, Field
from typing import Literal
 
class Task(BaseModel):
    id: str
    kind: Literal["search", "fetch", "extract", "summarize"]
    args: dict
    depends_on: list[str] = Field(default_factory=list)
 
class Plan(BaseModel):
    tasks: list[Task]
 
PLANNER_PROMPT = """You are planning a research task.
Emit a JSON object matching the Plan schema. Do not narrate.
Each task should be doable in one tool call.
"""

The validator does double duty as a fallback prompt: if Plan.model_validate_json() fails, you can feed the error back to the model and ask it to fix the JSON specifically. This recovers ~80% of malformed outputs in my pipelines.

2. Idempotent tools, hashed by input

Every tool should return the same result for the same input — and the agent should know it. I wrap every tool in a thin cache keyed by a hash of the call args.

cached_tool.py

import hashlib, json
from functools import wraps
 
def idempotent(fn):
    cache: dict[str, object] = {}
    @wraps(fn)
    def wrapped(**kwargs):
        key = hashlib.sha256(
            json.dumps(kwargs, sort_keys=True).encode()
        ).hexdigest()
        if key not in cache:
            cache[key] = fn(**kwargs)
        return cache[key]
    return wrapped

This means a retry never doubles the cost, and parallel sub-agents don't trample each other when they happen to ask the same question.

Don't cache now(), randomness, or anything paid-per-call without a TTL. The cache is for shape — not for skipping invalidation.

3. Sub-agents with explicit budgets

Whenever I split work across sub-agents, each one gets a budget and a hard deadline. The orchestrator reads the budgets, not the agent's optimism.

orchestrator.py

from dataclasses import dataclass
 
@dataclass
class Budget:
    max_tokens: int
    max_tool_calls: int
    deadline_seconds: float
 
async def run_subagent(task: Task, budget: Budget) -> Result:
    async with deadline(budget.deadline_seconds):
        return await agent_loop(task, budget)

A sub-agent that hits its budget returns whatever it has. The orchestrator then decides whether to spawn a follow-up. This is the single reliability pattern I'd keep if I had to drop the rest.

4. Verifier as a separate model call

Don't ask the agent that produced the output to also grade it. Confirmation bias is real and shows up immediately as inflated self-scores. Use a second, smaller, cheaper model — and give it a different prompt.

verify.py

VERIFIER_PROMPT = """You are reviewing the output of a research agent.
Score on three axes (0-3 each):
- Coverage: Did it answer all parts of the question?
- Sourcing: Are claims tied to retrieved evidence?
- Hallucination: Are there assertions not in the evidence?
Return JSON: {coverage, sourcing, hallucination, notes}.
"""

The verifier doesn't need to be smart — it needs to be independent.

5. Logs you'd actually read

The last pattern is unglamorous: log every tool call, every model output, every plan revision, with timestamps and a trace ID. Make them easy to grep.

trace.py

import structlog
 
log = structlog.get_logger().bind(trace_id="run-2026-04-22-3a")
 
log.info("tool.call", tool="search", args={"q": "verse pure functions"})
log.info("tool.result", tool="search", n_results=12, ms=438)

Without this you cannot tell, three weeks from now, why the agent did the thing it did. With it, every weird run becomes a learning artifact.

What I left out

I deliberately didn't talk about retrievers, memory layers, or multi-modal inputs. Those are the parts everyone writes about — and they matter — but in my experience they're not where production agents actually fail.

Production agents fail because the loop drifts, the tools aren't deterministic, and nobody set a budget. Fix those three first. The rest is decoration.