A Diagnostic Framework for Agent Failure

The goal of any well-designed agent is simple: supply the LLM with instructions, context, and tools that are sufficient for a reasonable intelligence to understand what's being asked, reason about it, and respond correctly. When an agent fails, the first question isn't "what did the LLM do wrong?" It's "where did the system break down?"

But before we can diagnose failure, we need to be clear about what success requires.

The Three Essentials

Every agent call — every single interaction where an LLM is expected to do useful work — depends on three inputs:

Instructions. What you're asking the LLM to do. The task definition, the constraints, the expected output format. Instructions tell the model what role it's playing and what a good result looks like.

Context. The information the LLM needs to reason about. Documents, data, conversation history, prior results — whatever the model needs to have in front of it to make informed decisions. An LLM can only work with what's in its context window. Everything else might as well not exist.

Tools. The capabilities the LLM can act through. Search functions, APIs, databases, calculation engines — the mechanisms by which the agent interacts with the world beyond generating text. Tools extend what the LLM can do from "reason and write" to "reason, write, and act."

The Cogency Requirement

Having all three isn't enough. They need to be cogent — and cogency is more demanding than it first appears.

Cogent inputs have five properties:

Coherent — the inputs are unambiguous, internally consistent (they align with each other), and externally consistent (they are plausible given world knowledge)
Correct — the right instructions, the right context, the right tools for the actual goal
Complete — all the pieces are present for a coherent attempt from start to finish
Sufficient — there is enough information to produce the right outcome for this specific case, not just a plausible one
Dense — the signal-to-noise ratio is high enough that the model can actually find what matters

These form a progressive chain. Each property assumes the previous ones hold. Coherence comes first because if the inputs are ambiguous or contradict each other, "correct" doesn't even mean anything — correct relative to which interpretation? Which part? You have to know the inputs form a clear, unified picture before you can ask whether that picture is right. From there: Is it correct? Is it complete? Is it sufficient? Is it dense enough? Each is a finer filter.

When all five properties hold, you've created the conditions for success. The LLM has a clear task, the information to reason about it, and the capabilities to act on its reasoning. Most of the time, that's enough.

Why This Matters More for Agents Than Humans

A human analyst given the same messy inputs would notice. They'd push back: "What exactly do you want here?" "I don't have access to X." "These two requirements contradict each other." A human drives toward cogency before doing the work.

An LLM doesn't. It takes whatever you gave it and produces output. It doesn't have a threshold for "this doesn't make sense, I should stop." It satisfices on whatever cogency it can find, even when a human would recognize the inputs as inadequate.

This is worth understanding mechanistically. An LLM isn't reasoning about truth. It's finding the strongest pattern in whatever you gave it. If you gave it a cogent picture, it finds the right pattern. If you gave it a mess, it finds the best pattern available in the mess — which looks fluent, sounds confident, and is wrong. The output isn't random garbage. It's a perfectly reasonable response to the wrong inputs. That's what makes these failures so hard to spot: they look like the model made a mistake, when actually the model did exactly what it does — you just conditioned it on the wrong signal.

The cogency requirement exists because the LLM won't create it for itself. With a human, you can be sloppy and they'll compensate. With an agent, whatever you put in comes out the other side — transformed into fluent, confident output regardless of whether the inputs warranted that confidence.

Specification Failures

When the inputs aren't cogent, the agent will fail. This is garbage in, garbage out — but it's more nuanced than that phrase suggests, because the garbage comes in five distinct flavors, each with different symptoms and different detectability. The five properties of cogency give us five ways the specification can break down.

Coherence Failure

The first thing to check, because if the inputs don't form a clear, unified picture, nothing else matters. Coherence fails in three ways: the inputs are ambiguous, they contradict each other, or they don't hold up against world knowledge.

Ambiguity is the most common and least noticed coherence failure. The inputs aren't contradictory — they're just unclear enough that the model has to choose an interpretation. Vague instructions that could mean two different things. Terms that are used loosely. Implicit assumptions that are obvious to the person writing the prompt but invisible to the model. The model doesn't flag the ambiguity. It picks an interpretation — usually the one that matches the strongest pattern in its training — and proceeds with full confidence. You get fluent, assured output that answered the wrong question, and nothing in the output signals that a choice was made.

Internal consistency is about whether the inputs align with each other. The instructions contradict themselves. The context doesn't match the task. The tools don't fit what the instructions describe. Step 3 depends on something step 2 doesn't produce. These show up in recognizable ways: the output is plausible but solving the wrong problem, behavior varies wildly with small prompt tweaks, tool usage is inconsistent or irrelevant. The inputs don't tell a consistent story, so the model latched onto whatever pattern was strongest.

Common flavors of internal inconsistency:

Instruction–context mismatch — "Summarize X" but X is missing or irrelevant to the actual need
Instruction–tool mismatch — told to fetch data but no retrieval tool available, or given tools it doesn't need
Conflicting directives — "Be concise" and "be exhaustive" in the same prompt
Logical gaps — steps that don't follow from each other, analysis frameworks that ask you to compare things that aren't comparable

External consistency (plausibility) is about whether the inputs align with what should be plausible given world knowledge. The data shows revenue tripling year over year in a declining market. A tool returns results that don't match known patterns. The context describes a regulatory framework that was replaced two years ago. A human domain expert would raise an eyebrow. The LLM takes it at face value.

External consistency failures are harder to catch because LLMs are much better at spotting internal contradictions ("these two instructions conflict") than external implausibility ("this revenue data doesn't make sense for this industry"). The model lacks the skepticism reflex that domain experience provides — it doesn't cross-check inputs against what should be plausible. It just finds the strongest pattern in whatever it was given, plausible or not.

Correctness Failure

Once you know the inputs are coherent — unambiguous, internally consistent, externally plausible — you can ask: are they right? The inputs might form a perfectly clear picture and still be wrong — clear and wrong.

The instructions might encode the wrong approach, the wrong criteria, or the wrong objective. The context might contain outdated or inaccurate information. The tools might connect to the wrong data source.

You spot correctness failures when the output is technically sound but answers the wrong question. The model did exactly what it was told. It was just told the wrong thing.

Completeness Failure

The inputs are coherent and correct as far as they go, but the specification itself has gaps — pieces of the task that aren't covered. This is distinct from sufficiency, which we'll get to. Completeness is about whether the specification is whole as a system. Does it cover the full task? Are all the necessary tools available? Do the instructions address every phase of the work?

Sometimes completeness failures are visible in execution. Think of furniture assembly instructions that stop before all the pieces are together. Each step made sense. The agent followed them all. But the output trails off, covers only part of what was needed, or produces something that's visibly thin in places. You can see the seams.

A knowledge work example: "Summarize the key findings from these three research papers and compare their methodologies." You give the agent three papers and clear instructions. But one of the papers is a response to a fourth paper that isn't included. The agent summarizes all three faithfully. The comparison looks thorough. But the methodology critique in paper two is incomprehensible without the paper it's responding to — and the output reads as shallow or confused on that section.

But completeness failures aren't always visible. Consider instructions that work reliably every day — same task, same specification, consistent results. Then one day the output is wrong. The instructions didn't fail. They just didn't account for a special condition that arose: an unusual clause in this particular contract, a regulatory change that shifted the criteria, an edge case in the data. The specification was always incomplete in the sense that it never covered this scenario — but you couldn't tell, because the scenario hadn't come up yet. The gap was latent until reality exposed it.

The distinction from sufficiency: completeness is about whether the specification covers the task. Sufficiency is about whether the information and context provided are enough to execute it correctly in this specific case. You can have a complete specification — one that covers every phase of the work — that still isn't sufficient because the context is missing a critical piece of information the specification didn't know to ask for.

Sufficiency Failure

Where completeness asks "does the specification cover the task?", sufficiency asks "given this specific situation, is there enough to get the right answer?" A specification can be complete — covering every phase of the work — and still not be sufficient for the case at hand.

This is the most dangerous specification failure because it's truly invisible. The inputs are coherent, correct, and complete. Everything presents as cogent. The agent executes the entire task cleanly. Nothing signals a gap — not in the inputs, and not in the output.

Consider a contract risk analysis. The instructions are clear: evaluate financial exposure, regulatory compliance, and termination clauses. The full contract is in context. The tools are appropriate. The specification is complete — it covers the whole task. The agent produces a polished, thorough risk analysis. But this contract has an unusual indemnification structure that's the actual landmine, and indemnification wasn't in the criteria. The output is well-structured and misses the most important risk in the document. The specification was complete. It just wasn't sufficient for this contract.

Or: three research papers, clear comparison instructions, complete execution — but the person who asked for the analysis needed it because they're evaluating whether to fund a follow-up study, and the instructions said nothing about assessing feasibility or funding implications. The output is a perfect literature comparison that doesn't address the actual decision at hand. Complete specification, insufficient for the purpose.

Sufficiency failures are insidious because the LLM has no way to know anything is missing. The inputs look whole. The model proceeds confidently. You only catch these if someone with domain knowledge reviews the output and asks the right question — "why didn't you look at indemnification?" — or if the output fails to serve its actual purpose downstream.

This is why context engineering and domain expertise in prompt design are among the highest-leverage activities in agent development. The model can only work with what it's given. Ensuring it's given enough requires understanding the domain well enough to know what "enough" means — and that understanding can't be delegated to the model itself.

Density Failure

The final specification property, and the most subtle. The inputs are coherent, correct, complete, and sufficient — the right information is actually in there. But it's buried in so much noise that the model can't find the signal.

This happens more than people realize. You stuff the context window with every potentially relevant document. The critical paragraph is on page 47 of a 200-page dump. The model has what it needs — technically — but the signal-to-noise ratio is so low that the model latches onto more prominent patterns instead. The answer was in the context. The model just couldn't find it.

Or consider instructions that are correct and complete but padded with so many caveats, edge-case handlers, and meta-instructions that the core task gets lost. The model spends its attention budget on the noise and gives shallow treatment to the actual objective.

Density failures are distinct from sufficiency failures. In a sufficiency failure, the information isn't there. In a density failure, it is — but it's drowned. The fix isn't adding more information. It's removing noise. Curating context so that everything present is relevant and the important things are prominent. High density means a clean context where the model's pattern-matching works in your favor, not against it.

Runtime Failures

Even with perfectly cogent inputs, things can go wrong at execution time. These failures live outside the specification.

Tool Failure

The LLM selected the right tool and invoked it correctly, but the tool didn't return what it was supposed to. Wrong data. An error. A timeout. A malformed response.

This isn't an LLM failure at all. It's an infrastructure failure that happens to manifest through the agent. But the consequences can be severe, because of how the LLM responds to tool failure.

The telltale sign: a sudden reasoning derailment immediately following a tool call. The model was tracking fine, called a tool, and then went off the rails. What happened was garbage in, coherent garbage out — the LLM incorporated the bad tool result and kept reasoning fluently from a corrupted foundation.

Tool failures come in three flavors: hard failures (timeouts, exceptions — at least these are visible), soft failures (the tool returns a schema-compliant response with wrong or partial data — it looks valid at the structural level but the content is wrong), and schema failures (malformed output the LLM can't parse correctly). Soft failures are the most dangerous because the response passes every structural check — the agent has no reason to question it and keeps going as if everything worked.

A robust agent should detect and recover from tool failures: retry with backoff, try an alternative approach, or surface the failure to the user. But that recovery behavior has to be explicitly designed. It won't happen by default.

Genuine LLM Failure

Everything was set up correctly — cogent inputs, working tools — and the LLM still went off the rails. Wrong conclusion. Wrong tool selection. Non-sequitur output.

This happens. It's the residual failure rate after the system does everything right. It's rarely truly random — it tends to show up in edge-case reasoning, long-chain dependency breakdowns, or attention misallocation on complex inputs. But it's non-deterministic and hard to reproduce reliably, which makes it the least actionable category.

There's a strong tendency to land here first — to assume the model is the problem. It's actually the least common category. You can't prevent it entirely. Better models help at the margins, but the failure rate is never zero. The right response isn't model-blaming or model-swapping — it's quality gates. Verify outputs at critical junctions. Catch the failures before they propagate. Accept that some percentage of individual LLM calls will produce bad output, and design the system to be resilient to that.

The Diagnostic Sequence

When something goes wrong with an agent, work through the specification failures first, then the runtime failures:

Are the inputs coherent? Are they unambiguous — or is the model forced to choose an interpretation? Internally — do instructions, context, and tools align with each other? Externally — are the inputs plausible given world knowledge? If the inputs are vague, contradictory, or don't hold up to scrutiny, nothing else matters.
Are the inputs correct? Are the instructions, context, and tools right for the actual goal? Or is the agent executing perfectly against the wrong target?
Are the inputs complete? Can the agent make a coherent attempt from start to finish? Or does execution reveal gaps — output that trails off, feels thin, or fails to cover the task?
Are the inputs sufficient? Even with coherent, correct, complete inputs — is there enough information for the right outcome in this specific case?
Are the inputs dense enough? Is the signal-to-noise ratio high enough for the model to find what matters? Or is the right information buried in noise?
Did the tools work? Did every tool call return the expected result? Or did a failure upstream corrupt the reasoning downstream?
If all six were fine — this was a genuine LLM failure. Note it, add a quality gate at that step, and move on.

Most debugging resolves at steps 1 through 5. That's where most of the leverage is.

The Design Implication

This framework implies a key architectural principle: agent reliability is dominated by system design, not model capability.

In practice, upgrading to a better model might reduce failures by a modest percentage. Fixing specification issues — coherence, correctness, completeness, sufficiency, density — routinely cuts failure rates in half or more. The gap between a well-designed agent on a good model and a poorly designed agent on a great model is enormous — and it favors the designer every time.

The instinct to blame the model is natural but counterproductive. The diagnostic sequence forces you to exhaust the fixable, high-leverage causes before concluding you've hit the residual. And when you do hit the residual, the answer is always the same: don't trust, verify.

Appendix: Failure Mode Reference

Specification Failures

Property	Applies To	Description	Example
Coherence (ambiguity)	Instructions / context	The inputs aren't contradictory but are unclear enough that the model must choose an interpretation. It picks one silently and proceeds with full confidence.	"Analyze the results" — which results? By what criteria? The model picks the most pattern-matching interpretation and never flags the ambiguity.
Coherence (internal consistency)	Instructions	Instructions contradict each other or the logic doesn't hold step to step. Check first — if the inputs contradict, "correct" doesn't mean anything.	"Be concise" and "be exhaustive" in the same prompt. Step 3 depends on output that step 2 doesn't produce.
Coherence (internal consistency)	Across essentials	Instructions, context, and tools don't align with each other.	"Summarize X" but X isn't in the context. Told to fetch data but no retrieval tool available.
Coherence (external / plausibility)	Context / inputs	The inputs don't align with what's plausible given world knowledge. A human expert would be skeptical. The LLM takes it at face value.	Financial data shows revenue tripling year over year in a declining market. The agent produces a glowing analysis without questioning the data.
Correctness	Instructions	Wrong task, wrong criteria, wrong objective. The agent executes perfectly against the wrong target.	Instructions say "evaluate vendor pricing" when the actual goal is evaluating vendor reliability. Thorough pricing analysis, useless for the decision.
	Context	Wrong data, outdated information, incorrect reference material.	Agent analyzes last quarter's financials when this quarter's are available. Output is accurate to the wrong data.
	Tools	Wrong tool for the job — a tool that technically works but produces unreliable results for this use case.	Using a generic web search tool to find business reviews instead of the Google Places API. Results are noisy, unstructured, and unreliable.
Completeness Does the specification cover the task? Distinct from sufficiency — completeness is about whether the spec is whole as a system, not whether it's enough for a specific case.	Instructions	Instructions that don't cover the full task. May be visible (output trails off) or latent (works fine until a special condition arises the spec never accounted for).	Instructions that work reliably every day, then fail when an unusual contract clause appears that the spec never addressed. The gap was latent until reality exposed it.
	Context	Missing documents, data, or reference material needed for a coherent attempt.	Three research papers provided for comparison, but one is a response to a fourth paper not included. The methodology critique is incomprehensible without it.
	Tools	A necessary capability is missing entirely. The agent can't execute a phase of the task.	Instructions ask to retrieve and analyze current market data, but no data retrieval tool is provided. The agent improvises from its training data instead.
Sufficiency Given this specific case, is there enough? Distinct from completeness — a complete spec can be insufficient when reality presents something the spec didn't anticipate.	Instructions	Instructions are complete but don't capture what this specific case requires.	Literature comparison instructions are thorough, but the requester needs a funding feasibility assessment. Perfect comparison, wrong deliverable for the purpose.
	Context	All requested context is present, but critical information for this case was never requested.	Contract risk analysis covers financial exposure, compliance, and termination. This contract's landmine is an unusual indemnification structure — not in the criteria.
	Tools	A tool is provided but isn't capable enough for this specific case.	A basic search tool is available for finding reviews, but this case requires structured data from a specialized API. The tool returns results, just not the right ones.
Density Is the signal clean enough to find? Distinct from sufficiency — the information is there, but buried in noise.	Context	The right information is present but drowned in irrelevant material. The model latches onto more prominent patterns instead.	The critical paragraph is on page 47 of a 200-page context dump. The model has what it needs — technically — but finds hay instead of the needle.
	Instructions	The core task is buried in caveats, edge-case handlers, and meta-instructions. The model gives shallow treatment to the actual objective.	A two-sentence task wrapped in three pages of formatting rules, exception handling, and style guidelines. The model spends its attention budget on the noise.

Runtime Failures

Failure Mode	Description	Example
Tool Failure	A tool returned bad data, errored, or timed out. The model incorporated the corrupted result and kept reasoning fluently from a broken foundation. Comes in three flavors: hard (timeouts, exceptions), soft (schema-compliant response with wrong or partial data — looks valid structurally, wrong at the content level), and schema (malformed output). Soft failures are most dangerous — the response passes every structural check, so the agent has no reason to question it.	Search tool returns partial results due to a timeout. The agent synthesizes a confident answer from incomplete data. Sudden reasoning derailment immediately after a tool call.
Genuine LLM Failure	Everything was set up correctly — cogent inputs, working tools — and the model still got it wrong. The residual. Non-deterministic, hard to reproduce. Often shows up in edge-case reasoning, long-chain dependency breakdowns, or attention misallocation. Least common, least actionable. The right response is quality gates, not model-blaming.	The model produces a non-sequitur or logically inconsistent conclusion despite having everything it needed. The same inputs might produce a correct result on the next run.