They fail because the system around the model was poorly designed.
After building production agent systems, one pattern keeps showing up: 4 decisions made before writing a single line of code determine ~80% of your reliability ceiling.
1. Define Scope (Ruthlessly)
“Customer support agent” sounds like a scope. It isn’t.
Billing disputes, technical troubleshooting, returns, product questions — each has different latency tolerances, compliance rules, and failure modes.
Before touching code, formally answer:
The Single Responsibility Principle applies to agents. Narrow scope raises the performance ceiling.
2. Your System Prompt Is the Agent’s OS
Vague system prompts are the #1 source of non-deterministic behavior in production.
A production-grade system prompt has 4 parts:
Role — Specific identity with explicit exclusions. “You are a Kafka reliability engineer. You do not answer general programming questions.” The exclusions matter as much as the description.
Methodology — Tell it how to reason, not just what to produce. A numbered reasoning protocol produces far more consistent output than “analyze this and suggest improvements.”
Guardrails — Hard prohibitions. For a database agent: never execute DELETE or DROP. Never include PII in responses. And verify these in your eval harness — a guardrail that’s never tested is just a wish.
Output format — Define the exact schema, including what it returns when it can’t complete the task. A silent null is always worse than a structured error payload.
3. The Most Powerful Model Is Not Always the Right Model
Using a frontier model for every subtask is like spinning up a GPU cluster to compute a hash.
At scale, the difference between frontier and efficient model pricing (often 10–50x) is the delta between a profitable product and a money-losing one.
The framework:
4. Tools Are What Make It an Agent
Without tools, an LLM is a text generation engine. Tools are what make it an agent.
The core design principle: atomicity. Each tool does exactly one thing.
A tool called execute_database_operation with an action parameter of read/write/delete is an antipattern. Three separate tools with unambiguous schemas is the right design.
What makes a tool schema effective:
query_customer_record) and specify when the tool should not be called.{status, data, error}. A tool that returns different shapes on success vs. failure forces the model to guess, which degrades reliability.One more thing: adopt MCP (Model Context Protocol). It standardizes external service connections and enforces access control at the integration layer — not just the prompt. Prompt guardrails are probabilistic. Integration-layer controls are deterministic. Use both.
The through-line across all four:
A well-scoped agent on a cheaper model with a tight system prompt and atomic tool schemas will outperform a powerful model wrapped in a vague prompt — every time.