Agents That Actually Finish: Five Patterns From Production
Five LLM-agent patterns I keep reaching for when I need an agent to reliably complete real work — not just demo well.
Most agent demos are built to look impressive in a tweet. Most agents I build have to run unattended on a Tuesday afternoon and not page me. Different game.
Here are five patterns I keep reaching for. None are novel — that's the point. They're load-bearing precisely because they're boring.
1. A planner that emits structured tasks, not free-form prose
The single biggest reliability win is making the planner emit JSON, not narration. Free-form plans drift; structured ones can be validated, deduped, and replayed.
from pydantic import BaseModel, Field
from typing import Literal
class Task(BaseModel):
id: str
kind: Literal["search", "fetch", "extract", "summarize"]
args: dict
depends_on: list[str] = Field(default_factory=list)
class Plan(BaseModel):
tasks: list[Task]
PLANNER_PROMPT = """You are planning a research task.
Emit a JSON object matching the Plan schema. Do not narrate.
Each task should be doable in one tool call.
"""The validator does double duty as a fallback prompt: if Plan.model_validate_json() fails, you can feed the error back to the model and ask it to fix the JSON specifically. This recovers ~80% of malformed outputs in my pipelines.
2. Idempotent tools, hashed by input
Every tool should return the same result for the same input — and the agent should know it. I wrap every tool in a thin cache keyed by a hash of the call args.
import hashlib, json
from functools import wraps
def idempotent(fn):
cache: dict[str, object] = {}
@wraps(fn)
def wrapped(**kwargs):
key = hashlib.sha256(
json.dumps(kwargs, sort_keys=True).encode()
).hexdigest()
if key not in cache:
cache[key] = fn(**kwargs)
return cache[key]
return wrappedThis means a retry never doubles the cost, and parallel sub-agents don't trample each other when they happen to ask the same question.
Don't cache now(), randomness, or anything paid-per-call without a TTL. The cache is for shape — not for skipping invalidation.
3. Sub-agents with explicit budgets
Whenever I split work across sub-agents, each one gets a budget and a hard deadline. The orchestrator reads the budgets, not the agent's optimism.
from dataclasses import dataclass
@dataclass
class Budget:
max_tokens: int
max_tool_calls: int
deadline_seconds: float
async def run_subagent(task: Task, budget: Budget) -> Result:
async with deadline(budget.deadline_seconds):
return await agent_loop(task, budget)A sub-agent that hits its budget returns whatever it has. The orchestrator then decides whether to spawn a follow-up. This is the single reliability pattern I'd keep if I had to drop the rest.
4. Verifier as a separate model call
Don't ask the agent that produced the output to also grade it. Confirmation bias is real and shows up immediately as inflated self-scores. Use a second, smaller, cheaper model — and give it a different prompt.
VERIFIER_PROMPT = """You are reviewing the output of a research agent.
Score on three axes (0-3 each):
- Coverage: Did it answer all parts of the question?
- Sourcing: Are claims tied to retrieved evidence?
- Hallucination: Are there assertions not in the evidence?
Return JSON: {coverage, sourcing, hallucination, notes}.
"""The verifier doesn't need to be smart — it needs to be independent.
5. Logs you'd actually read
The last pattern is unglamorous: log every tool call, every model output, every plan revision, with timestamps and a trace ID. Make them easy to grep.
import structlog
log = structlog.get_logger().bind(trace_id="run-2026-04-22-3a")
log.info("tool.call", tool="search", args={"q": "verse pure functions"})
log.info("tool.result", tool="search", n_results=12, ms=438)Without this you cannot tell, three weeks from now, why the agent did the thing it did. With it, every weird run becomes a learning artifact.
What I left out
I deliberately didn't talk about retrievers, memory layers, or multi-modal inputs. Those are the parts everyone writes about — and they matter — but in my experience they're not where production agents actually fail.
Production agents fail because the loop drifts, the tools aren't deterministic, and nobody set a budget. Fix those three first. The rest is decoration.
canonical: https://islandside.dev/blog/agents-that-actually-finish