Principled Orchestration: Extracting Reliable Value from LLMs

The Opportunity

Three realizations frame the opportunity in front of us. Each is simple. Together, they are transformative.

Knowledge work is atomic. All knowledge work — legal research, financial analysis, scientific literature review, claims processing, any of it — reduces to sequences of discrete cognitive operations. A knowledge worker formulates queries. Searches for information. Retrieves documents. Reads and extracts what matters. Evaluates sources. Synthesizes findings. Summarizes, compares, drafts, revises. The specific sequence differs by domain — a pharmaceutical researcher follows a different path than a marketing analyst — but the underlying operations are the same. When the sequence completes successfully, a deliverable emerges: a report, a recommendation, a decision. Until recently, computers managed the data infrastructure. Humans supplied all the cognition.

LLMs changed that. For the first time, the atomic operations involving natural language can be delegated to machines. An LLM can read a document and extract key points. It can evaluate relevance, synthesize across sources, summarize, expand, draft, translate, and assess sentiment. Workflows that previously required a human at every language-dependent step can now proceed through those steps automatically. The human specifies the goal, reviews the output, and intervenes at key decision points. The mechanical work of processing and generating language gets delegated. Processes that took hours take minutes. Bottlenecks at human-dependent steps dissolve.

Including the planning. Beyond executing individual operations, LLMs can participate in — or even drive — the orchestration of those operations. Given a goal and a set of capabilities, an LLM can determine the path: formulate a research plan, execute it, evaluate progress, adjust course, and continue until the goal is met. Traditionally, orchestration decisions were either made by humans in real time or hardcoded by system designers who anticipated the workflow. LLMs introduce a third option: the orchestration decisions themselves can be delegated to the model. This is what makes agentic systems possible — and it is where the enormous power lives.

The Problem

But LLMs have three core architectural shortcomings. These are not bugs the next release will fix. They are structural properties of how these models work. Better models help at the margins, but the fundamental dynamics persist. Understanding them is the prerequisite for building systems that work reliably.

The Memento Effect

Like the protagonist in Memento — intelligent, capable, but no persistent memory — every interaction with an LLM starts fresh. The context window is finite. The model has no access to information unless you explicitly provide it.

Since you cannot fit everything into the context window, you are forced to select a subset. And now you face the real challenge: how do you ensure the model has exactly the context it needs? Not the wrong context. Not missing context. Not polluted context.

When the LLM is executing a task, wrong context gives you a bad output — catchable if you have quality gates. But when the LLM is planning, wrong context means it skips steps it did not know it needed. The model does not know what it does not know. It confidently skips the search it did not realize was necessary. That error is invisible — the right step was never executed, so there is nothing to catch.

There is a related problem I call the Dirty Test Tube. By your third or fourth message in a conversation, the context contains tangents, dead ends, clarifications, and earlier failed attempts. All of that history dilutes the signal for the current step. For any focused task, accumulated conversation history is noise.

Satisficing

Humans have two modes of thinking: fast and slow. Fast thinking handles routine tasks automatically. Slow thinking engages when something is complex — we pause, break things into parts, allocate more effort. Crucially, we know when to shift gears.

Standard LLMs do not have this switch — they are always in fast-thinking mode. Reasoning models (like OpenAI's o1 family or Claude with extended thinking) represent a partial correction: they allocate more internal compute to harder problems, producing measurably better results on complex tasks. But this is a dial, not a categorical fix. Reasoning models improve the quality of individual steps, yet they still can't recognize when a task should be decomposed differently, when an implicit sub-decision deserves its own dedicated pass, or when the overall approach needs rethinking. The improvement is real but bounded.

You see this in the "do better" phenomenon: ask an LLM to write something, then simply say "improve that." It produces a noticeably better response — often significantly better. The capability was there all along. Nothing triggered it. The first response satisficed rather than maximized.

You also see it in what I call hidden intentions. A request like "make this email more professional and concise" contains multiple implicit sub-decisions. What counts as professional? What information is essential versus removable? What tone is appropriate for this audience? The model makes quick judgments about all of these without showing its work. Errors in these implicit decisions go undetected because the decisions themselves are invisible.

The result: complex tasks get shallow treatment. Critical sub-decisions happen in passing rather than getting the dedicated attention they require.

No Ontological Grounding

Everyone knows LLMs "hallucinate." But the reality is more fundamental than occasional factual errors. LLMs have no relationship to truth — only to plausibility. The model generates statistically likely text, not verified claims.

This goes well beyond making up facts:

State tracking is narrative, not actual state. "I've completed steps 1 and 2, now moving to step 3" is generated text about progress, not tracked progress.
Plans are text about plans, not plans. The LLM can commit to an approach, then wander, skip steps, or abandon it entirely — because there is no actual plan underneath, just text about a plan.
Verification is what verification sounds like. "I've confirmed that X is correct" is a plausible response, not an actual check.

The model produces what things sound like, not what they are. It generates the linguistic surface of reasoning without the underlying machinery.

The Failure Modes

These three shortcomings produce the pattern everyone who works with LLMs recognizes: one moment the model seems brilliant, the next you are wondering what it was thinking.

Shallow treatment of complex tasks
Invisible wrong turns — steps never taken, nothing to catch
Workflow drift — the model abandons its own plan midstream
Confident misinformation — wrong answers delivered with full conviction

The common response is to blame the model and wait for the next version. That misses the point. These are architectural properties, not capability gaps. The remedy is not a better model. It is a better system around the model.

The Remedy: Six Principles of Orchestration

Orchestration coordinates multiple prompts, models, and tools to achieve what single interactions cannot. It is the practice of designing systems that account for what LLMs can and cannot do, and that encode what humans know about how work should be done. Six principles guide its design.

Principle 1: Decompose into Explicit Steps

Do not let critical decisions happen in passing. Force slow thinking by making each decision a dedicated step with focused context and a defined output.

When you ask an LLM to "fact-check this document," you are hiding dozens of decisions: what constitutes a claim worth checking, how to formulate verification queries, what sources to trust, what confidence threshold to apply, when to stop looking. The model makes all of these decisions implicitly, in a single pass, with no visibility into any of them.

Orchestration makes each decision point visible in the workflow. Extract claims. Prioritize by importance. For each claim, formulate search queries. Evaluate sources. Assess evidence. Report findings. Single instructions collapse implicit steps into one black-box operation. Decomposition makes each step visible, verifiable, and improvable on its own.

Addresses: Satisficing

Principle 2: Curate Sterile Context

Each step gets exactly what it needs — not accumulated conversation history, not everything that might be relevant, but precisely what this operation requires to do its job well.

The Dirty Test Tube problem means that by the time you are several turns into a process, the context is polluted with tangents, failed attempts, and conversational debris. For any given step, start with a sterile conversation containing exactly what is needed.

The key technique is what I call Compress and Carry Forward: at each transition between steps, define what constitutes signal versus noise for the next step, compress accordingly, and carry only the distilled context forward. The previous step's reasoning and false starts stay behind. The next step sees a clean workspace.

Addresses: The Memento Effect

Principle 3: Externalize State and Control Flow

Loops, counters, progress tracking, and conditional logic live outside the LLM. The system tracks reality. The LLM reasons about language and content.

If you ask an LLM to "process all 50 items in this list," it will lose count, skip items, or process some twice. Not because it is incompetent, but because counting and tracking are not what statistical text generation does well. The workflow — the loops, conditionals, counters, and state — lives in code. The LLM executes individual steps within that structure. The orchestration system manages the flow.

Addresses: No Ontological Grounding

Principle 4: Bound Before Delegating

When you use an agentic loop — an LLM with a goal, instructions, and tools, deciding iteratively what to do next — the degrees of freedom should already be appropriate by the time you enter that loop. The boundary is defined by the instructions you supply and the tools you provide.

If the agent has too broad a mandate, that is a design failure, not an agent failure. An agent asked to "research everything about this company" will wander. An agent asked to "determine Q3 revenue from these SEC filings using these extraction tools" will focus. The constraint is not a limitation — it is what makes the agent effective.

Contains all three problems within manageable boundaries

Principle 5: Encode Expertise in Tool Abstraction

Higher-level tools encode "the right way to do this." A research tool that internally handles query formulation, result evaluation, and gap analysis reduces the LLM's decision surface and makes the reliable path the default path.

Every decision point for an agent is a potential failure point. Agents reasoning through ten micro-steps have more chances to go wrong than agents selecting among three well-designed macro-capabilities. The internal complexity of those capabilities is handled by deterministic orchestration that has been tested and optimized.

Tools should be organized in layers of increasing abstraction. At the base, atomic operations: search a corpus, extract entities, aggregate by field. Above that, composed tools that bundle operations with error handling: an email search tool, a map-reduce rollup, a summarizer. Higher still, reusable workflow patterns: a research pattern, a monitoring pattern, a processing pattern. At the top, specialized domain agents with deep knowledge baked in.

The tool inventory itself becomes a cognitive scaffold — it shapes how problems get decomposed and which solution paths are available. When you hand an agent three well-designed tools instead of thirty primitives, you are encoding your expertise about how work should be done.

Captures the expertise opportunity

Principle 6: Quality Gates at Critical Junctions

Verify outputs and strategic decisions before proceeding. Do not trust the model's self-assessment that it has enough information or made the right choice.

Garbage in, garbage out. If step 3 produces flawed output, every subsequent step is compromised. Gates prevent error propagation. They are the checkpoints where the system asks: is this good enough to build on?

Evaluation methods span a spectrum. Human review handles subjective judgment and high-stakes decisions. LLM evaluation — a separate model assessing the output — catches structural and completeness issues. Programmatic checks handle the mechanical: Does the JSON parse? Is the format valid? Is the value in range? Are all required fields populated?

The rule is simple: do not proceed past a gate until required conditions are met. If step 3 fails its gate, fix it before step 4 begins.

Catches failures from all three problems before they propagate

An Example: Research Done Right

Consider a concrete application: answering a complex question that requires information from multiple sources. The naive approach is to give an LLM a search tool and a goal. It will produce something. But it will be unreliable in ways that are difficult to detect.

Here is how principled orchestration handles it.

Phase 1: Clarification

Before any research begins, clarify what is actually being asked. The system generates a disambiguated version of the question: "I understand you want to compare X and Y across these dimensions. Should I focus on any specific aspects?" This is a quality gate with human validation. It catches misunderstandings before they propagate through the entire process. The LLM works as a focused worker here — its only job is to surface ambiguity and confirm scope.

Phase 2: Requirements Analysis

Generate an explicit checklist of what a complete answer requires. For a comparison question: performance benchmarks, pricing data, availability information, version details, known limitations. This step makes hidden intentions visible. Instead of the model implicitly deciding what matters — and potentially skipping something critical — the requirements are explicit and inspectable. The question "what makes a good answer?" gets dedicated attention as its own step.

Phase 3: Iterative Retrieval

Research happens in a structured loop, not freestyle exploration. The system maintains a Local Transient Knowledge Base — an explicit data structure outside the LLM. This is actual state, not narrative about state.

Each iteration follows a defined cycle. First, gap analysis: compare the current knowledge base against the requirements checklist and identify what is missing. Second, query generation: for each gap, generate targeted search queries. Third, retrieval and evaluation: execute searches, evaluate results against the specific need that motivated them. Fourth, integration: add verified information to the knowledge base, flagging conflicts rather than silently resolving them. Fifth, completeness check: does the knowledge base satisfy the requirements? If not, iterate again.

The loop continues until requirements are met or a maximum iteration count is reached. The exit condition is explicit and verifiable — not a vibes-based judgment by the model that it "probably has enough."

Phase 4: Synthesis

Only after the knowledge base is sufficiently complete does synthesis happen. The synthesizer receives three things: the clarified question, the requirements checklist, and the populated knowledge base. Its job is focused and bounded: produce a coherent answer from verified materials, with citations and acknowledged uncertainty where gaps remain.

What This Demonstrates

Every principle is at work in this example. The phases are fixed, encoding expertise from above about how good research works. The individual operations leverage LLM capability in the worker role. The tools encode best practices from below. The knowledge base is externalized state, addressing the grounding problem. The requirements checklist forces thorough treatment, addressing satisficing. Curated context per step addresses the Memento Effect. And quality gates at each transition prevent silent error propagation.

The result is research you can actually trust — not because the model is infallible, but because the system is designed to catch and correct the places where it is not.

The Bottom Line

The capability is already in the models. Orchestration is how you extract it reliably.

The productivity multiplier of AI is not a simple coefficient applied to your existing workflow. It is a function of how well you decompose problems, how strategically you encode domain expertise into system design, and how rigorously you verify outputs at critical junctions.

The technology is powerful. The orchestration architecture determines whether that power translates to reliable value. The model is not the bottleneck — the usage pattern is. And that is good news, because the usage pattern is something we can design.