AI News

Why Your AI Agents Are Inconsistent (And How Context Graphs Fix It)

Falk Brauer

12 Jan 2026 — 9 min read

The Agent Problem

AI agents are everywhere now. They're handling customer support tickets, qualifying leads, processing documents, making recommendations, executing workflows. The promise is compelling: autonomous systems that handle routine decisions so humans can focus on what matters.

But teams deploying agents keep hitting the same problems:

Inconsistent performance. The agent handles 80% of cases well, but the other 20% are unpredictable. Same type of situation, different responses. Nobody's sure why.

Black box decisions. When something goes wrong, you can't trace why the agent did what it did. The logs show inputs and outputs, but not the reasoning. Debugging is archaeology.

No learning from experience. The agent makes the same mistakes repeatedly. It doesn't learn that "this type of customer needs to be escalated" or "this phrasing causes confusion." Every interaction starts from zero.

Audit nightmares. Regulators, compliance teams, or just your own leadership ask: "Why did the system make this decision?" You don't have a good answer. The model did... something.

These aren't edge cases. They're the core challenges of putting agents into production. And they all stem from the same root cause:

Agents have reasoning capability but no structured decision memory.

Why RAG Isn't Enough

The standard fix is retrieval. Connect the agent to your knowledge base. When it needs to make a decision, retrieve relevant documents, policies, past examples.

This helps with knowledge gaps—the agent can look up your refund policy instead of hallucinating one. But it doesn't solve the deeper problems:

Retrieval doesn't ensure consistency. Different queries surface different documents. The agent reasons from whatever it retrieves, which varies. Same situation, different retrieved context, different decision.

Documents aren't decision logic. Your policy doc says "use judgment for complex cases." That's not actionable for an agent. It needs to know how to judge, what factors matter, what tradeoffs to make.

Retrieval isn't auditable reasoning. You can log what documents were retrieved. But "the agent saw these 5 documents and then decided X" isn't an explanation. What in those documents led to X? Why X instead of Y?

Retrieval doesn't learn. The knowledge base is static. The agent doesn't update it based on what works. Patterns that emerge from thousands of interactions stay invisible.

RAG gives agents access to information. It doesn't give them structured decision-making capability.

What Agents Actually Need

Think about how a well-functioning team handles decisions consistently:

Shared mental models: Everyone knows the types of situations they encounter and how to recognize them.
Explicit heuristics: "If X and Y, then usually Z" — patterns that guide decisions without requiring deep analysis every time.
Clear escalation criteria: When to handle autonomously vs. when to involve someone else.
Feedback loops: When decisions go wrong, the team learns and adjusts.
Audit trails: You can explain why a decision was made by pointing to the reasoning, not just the outcome.

This is what agents need. Not just access to documents—structured decision infrastructure that makes their reasoning consistent, traceable, and improvable.

That's what a Context Graph provides.

Context Graphs: Decision Infrastructure for Agents

A Context Graph is a structured layer that sits between your agent and its decisions. It contains:

Decision contexts: Classified types of situations the agent encounters, represented as semantic clusters. "This is a billing dispute from an enterprise customer with high tenure" rather than unstructured text.

Heuristics: Explicit rules that map contexts to decisions. "Billing disputes + enterprise + high tenure → apply retention-focused resolution, offer goodwill credit in $X-Y range, success rate 78%."

Confidence bounds: How well-matched is this situation to known patterns? When should the agent decide vs. escalate?

Outcome tracking: What happened when each heuristic was applied? Did it work?

The agent doesn't reason from scratch or from raw retrieved documents. It:

Classifies the current situation against known contexts
Retrieves the relevant heuristics for that context
Applies the heuristics with the LLM's reasoning capability
Logs which heuristics were applied and why
Captures the outcome for future learning

This changes everything about how agents perform and how you can audit them.

Performance: From Chaos to Consistency

The Problem

Without structured decision memory, agent performance is inherently variable. The LLM reasons from whatever context it has, which means:

Different prompt formulations → different decisions
Different retrieved documents → different decisions
Model temperature and sampling → different decisions
Subtle context variations → unpredictable responses

You're not running a decision system. You're running a reasoning engine and hoping it decides consistently.

The Context Graph Solution

With a Context Graph, decisions flow through explicit structure:

Situation → Context Classification → Heuristic Selection → Guided Decision

Context classification reduces variance. Instead of reasoning about raw inputs, the agent first maps the situation to a known type. "This is a [Tier 2 escalation] for a [high-value customer] with [technical issue] and [executive visibility]." The classification is deterministic given the input features.

Heuristics constrain the decision space. The agent isn't choosing from infinite possibilities. It's applying known patterns: "For this context type, the standard approach is X, with variations for conditions A, B, C."

Confidence bounds trigger escalation. If the situation doesn't match known patterns well, the agent knows to escalate rather than guess. "Confidence below threshold—routing to human review."

The same inputs produce the same classification, retrieve the same heuristics, and generate consistent decisions.

Measurable Impact

Teams implementing Context Graphs for agent systems typically see:

Decision consistency: 40-60% reduction in variance for same-type situations
Escalation accuracy: Agents learn which situations they handle well vs. poorly
Edge case handling: Novel situations are flagged rather than handled badly
Performance over time: As heuristics are refined, decision quality improves without model changes

Auditability: From Black Box to Glass Box

The Problem

When an agent makes a decision, what's in the audit log?

Timestamp: 2024-11-15 14:32:07
Input: [customer message]
Retrieved: [doc1, doc2, doc3]
Output: [agent response]
Model: gpt-4-turbo

This tells you nothing about why. Why this response? What reasoning led here? If you need to explain this decision to a regulator, a customer, or your own leadership, you're stuck reverse-engineering from outputs.

The Context Graph Solution

With a Context Graph, the audit trail captures the decision logic:

Timestamp: 2024-11-15 14:32:07
Input: [customer message]

Context Classification:
  - Type: billing_dispute
  - Customer segment: enterprise
  - Tenure: 4.2 years (high)
  - Issue severity: medium
  - Sentiment: frustrated
  - Classification confidence: 0.91

Heuristics Applied:
  - H-142: "Enterprise billing disputes, high tenure"
    - Recommendation: Retention-focused resolution
    - Success rate: 78% (n=234)
    - Conditions met: ✓ enterprise, ✓ tenure >2yr, ✓ billing issue
  
  - H-089: "Frustrated sentiment modifier"
    - Recommendation: Lead with acknowledgment, then solution
    - Evidence: Reduces escalation by 34%

Decision Generated:
  - Action: Apply $150 credit, waive disputed charge
  - Reasoning: Standard resolution for profile, within authority limits
  - Confidence: High (0.87)

Output: [agent response]

Outcome: [pending capture]

Now you can answer:

Why this decision? Because the situation was classified as X, which matched heuristics Y and Z.
Was it appropriate? The heuristics have a 78% success rate for this context.
Should it have escalated? Confidence was 0.87, above the 0.75 threshold.
What would change the decision? Different classification or different heuristic match.

This is auditable reasoning, not just logged I/O.

Compliance and Governance

For regulated industries, this structure provides:

Explainability: Decisions trace to explicit rules with documented evidence.

Consistency documentation: You can show that same-type situations receive same-type treatment.

Override tracking: When humans override agent decisions, you capture why—and can learn from it.

Policy alignment: Heuristics can encode compliance requirements, making adherence systematic.

Audit readiness: When regulators ask "how does your AI make decisions," you have a concrete answer.

The Query Optimizer Parallel

To understand why this architecture works, look at how databases solved a similar problem.

The Database Challenge

Early databases executed queries naively. Parse the SQL, scan the tables, return results. This worked until it didn't—complex queries on large tables took forever.

The solution wasn't just faster hardware. It was query optimizers—systems that figure out the best way to execute a query before running it.

How Query Optimizers Work

Query optimizers don't store execution plans for every possible query. Instead, they maintain:

Component	Purpose
Statistics	Compressed summaries of data: cardinality, distributions, histograms
Cost models	Rules for estimating operation costs: "index scan = X, table scan = Y"
Heuristics	Patterns: "small result set → nested loop; large → hash join"
Plan generator	Combines the above to produce optimal plans at runtime
Feedback loop	Compares estimated vs. actual costs; refines the model

The optimizer generates fresh plans using accumulated knowledge about how to plan well.

The Agent Parallel

Query Optimizer	Context Graph
Statistics Layer	Decision Space Embeddings
Data distributions, cardinality	Situation clusters, context similarities

Cost Model	Heuristics Library
Operation cost estimates	Decision patterns with success rates
"Index scan: 10ms"	"Retention offer: 78% success"

Plan Generator	Decision Generator
Query → optimal execution plan	Situation → optimal decision

Feedback Loop	Learning Loop
Estimated vs. actual cost	Predicted vs. actual outcome

The insight:

Query optimizers don't retrieve stored plans—they generate optimal plans from statistics and heuristics.

Agent systems shouldn't retrieve raw examples—they should generate optimal decisions from context patterns and heuristics.

This is the architecture that makes performance consistent and reasoning auditable.

Building a Context Graph for Your Agents

Step 1: Map Your Decision Types

What decisions does your agent make? List them:

Ticket routing
Refund approvals
Escalation triggers
Response tone selection
Information requests

For each, identify the key context dimensions:

Customer attributes (segment, tenure, value)
Situation attributes (issue type, severity, history)
Constraints (policy limits, authority levels)

Step 2: Extract Current Heuristics

Your experienced team already has implicit heuristics. Extract them:

"When would you escalate this?"
"How do you decide on a refund amount?"
"What makes a situation 'complex' vs. 'routine'?"

Document as explicit rules:

heuristic:
  id: escalation_criteria_001
  context:
    customer_value: ">$50K ARR"
    issue_type: "service_degradation"
    duration: ">24 hours"
  decision: escalate_to_tier2
  rationale: "High-value customers with extended outages need senior attention"
  confidence: high

Step 3: Build the Classification Layer

Create embeddings for your context types. When a new situation arrives:

Extract context features (customer info, issue details, history)
Embed the situation
Match against known context clusters
Return classification with confidence

Step 4: Wire Up the Decision Flow

Agent decision process becomes:

def agent_decision(situation):
    # 1. Classify context
    context = classify_context(situation)
    
    # 2. Check confidence
    if context.confidence < THRESHOLD:
        return escalate(situation, reason="low_confidence")
    
    # 3. Retrieve applicable heuristics
    heuristics = get_heuristics(context.type)
    
    # 4. Generate decision with LLM
    decision = llm.generate(
        situation=situation,
        context=context,
        heuristics=heuristics,
        instruction="Apply the relevant heuristics to this situation. Explain your reasoning."
    )
    
    # 5. Log for audit
    log_decision(
        situation=situation,
        context=context,
        heuristics_applied=heuristics,
        decision=decision
    )
    
    # 6. Capture outcome (async)
    schedule_outcome_capture(decision.id)
    
    return decision

Step 5: Close the Learning Loop

Track outcomes. Periodically analyze:

Heuristic performance: Which rules have high/low success rates?
Context gaps: Which situations don't match known patterns?
Override patterns: When do humans override the agent? Why?
Emerging patterns: Are there new heuristics in the outcome data?

Update the Context Graph:

Increase confidence for validated heuristics
Refine or deprecate underperforming ones
Add new heuristics from discovered patterns
Expand context coverage for gap areas

The Compound Effect

Week 1: Agent uses initial heuristics extracted from team knowledge. Decisions are more consistent than before, audit trail is clear.

Month 1: Outcome data reveals that Heuristic H-012 underperforms for a specific customer segment. Refine it. Performance improves.

Month 3: Pattern analysis surfaces a new heuristic nobody had articulated: "Customers who contact support within 48 hours of a billing change are usually confused, not upset—informational response outperforms apologetic." Add it.

Month 6: The Context Graph has 3x the heuristics it started with, each validated against real outcomes. Edge case handling improves because the system knows what it doesn't know. Escalation accuracy is high.

Year 1: The agent handles 90% of cases with consistent, auditable decisions. The 10% it escalates are genuinely novel or complex. New team members inherit the accumulated decision intelligence. Compliance reviews are straightforward because reasoning is documented.

This is what happens when agent decision-making becomes infrastructure rather than ad-hoc reasoning.

Architecture Summary

┌─────────────────────────────────────────────────────────────┐
│                    INCOMING SITUATION                       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  CONTEXT CLASSIFIER                         │
│  • Extract features from situation                          │
│  • Embed against known context types                        │
│  • Return classification + confidence                       │
└─────────────────────────────────────────────────────────────┘
                              │
                    ┌─────────┴─────────┐
                    │                   │
            confidence < threshold    confidence ≥ threshold
                    │                   │
                    ▼                   ▼
┌───────────────────────┐   ┌─────────────────────────────────┐
│      ESCALATE         │   │       HEURISTICS SELECTOR       │
│  Route to human with  │   │  • Match context to heuristics  │
│  context + reasoning  │   │  • Resolve conflicts            │
└───────────────────────┘   │  • Package for decision         │
                            └─────────────────────────────────┘
                                          │
                                          ▼
                            ┌─────────────────────────────────┐
                            │      DECISION GENERATOR         │
                            │  • LLM applies heuristics       │
                            │  • Generates decision + reason  │
                            │  • Estimates confidence         │
                            └─────────────────────────────────┘
                                          │
                                          ▼
                            ┌─────────────────────────────────┐
                            │         AUDIT LOGGER            │
                            │  • Context classification       │
                            │  • Heuristics applied           │
                            │  • Decision + reasoning         │
                            │  • Confidence scores            │
                            └─────────────────────────────────┘
                                          │
                                          ▼
                            ┌─────────────────────────────────┐
                            │       OUTCOME CAPTURE           │
                            │  • Track what happened          │
                            │  • Compare to prediction        │
                            │  • Feed learning loop           │
                            └─────────────────────────────────┘
                                          │
                                          ▼
                            ┌─────────────────────────────────┐
                            │        LEARNING LOOP            │
                            │  • Validate/refine heuristics   │
                            │  • Discover new patterns        │
                            │  • Update confidence scores     │
                            │  • Expand context coverage      │
                            └─────────────────────────────────┘

Conclusion

AI agents have a structural problem: powerful reasoning with no decision memory. They make inconsistent choices, can't explain their reasoning, and don't learn from experience.

RAG doesn't solve this. Retrieval gives agents access to documents, not decision logic. You get information, not consistency or auditability.

Context Graphs provide the missing infrastructure. They capture decision contexts, encode heuristics, guide agent reasoning through explicit patterns, and create audit trails that trace decisions to logic rather than black-box outputs.

The query optimizer parallel shows why this works. Databases don't store every execution plan—they maintain statistics and heuristics to generate optimal plans at runtime. Agents shouldn't reason from scratch every time—they should apply accumulated decision patterns, validated by outcomes, improving over time.

The result: agents that perform consistently, make auditable decisions, and get better as they operate.

That's not just better AI. That's AI you can actually deploy.

Context Graphs represent an emerging architecture pattern for production agent systems—structured decision memory that enables consistent performance and traceable reasoning. The query optimizer parallel draws on decades of database engineering to illuminate how decision infrastructure should work.

Why Your AI Agents Are Inconsistent (And How Context Graphs Fix It)

Falk Brauer

The Agent Problem

Why RAG Isn't Enough

What Agents Actually Need

Context Graphs: Decision Infrastructure for Agents

Performance: From Chaos to Consistency

The Problem

The Context Graph Solution

Measurable Impact

Auditability: From Black Box to Glass Box

The Problem

The Context Graph Solution

Compliance and Governance

The Query Optimizer Parallel

The Database Challenge

How Query Optimizers Work

The Agent Parallel

Building a Context Graph for Your Agents

Step 1: Map Your Decision Types

Step 2: Extract Current Heuristics

Step 3: Build the Classification Layer

Step 4: Wire Up the Decision Flow

Step 5: Close the Learning Loop

The Compound Effect

Architecture Summary

Conclusion

Read more

GenAI Daily - February 23, 2026: UCSF Medical AI Study, Manus $4B Exit Confirmed, New Magnetic Materials Research

Anthropic Is Coming for the Cybersecurity Industry

GenAI Daily - February 22, 2026: ByteDance's Seedance 2.0 Sparks Hollywood Backlash, Record AI Funding Surge, Enterprise Partnerships Scale Up

GenAI Daily - February 21, 2026: ByteDance Seedance 2.0 Draws Hollywood Backlash, Anthropic Blocks Third-Party OAuth, OpenAI Joins UK AI Safety Initiative