Skip to main content
← Newsroom
April 14, 2026

Why Can’t I Just Vibe Code My Own Harness?

AK

Alec Kloss

Principal Engineer, Crogl

You're going to hear this question a lot. I already am. It came up on a cold call this week with an enterprise IR leader. It came up at a conference last month. It comes up every time someone sees what a purpose-built AI harness can do for security operations and thinks, "I've got Claude. I've got a weekend. How hard can it be?"

The answer is: harder than you think. And there's actual research to back that up now.

What Even Is a Harness?

Jamin Ball over at Clouded Judgement put it well in his recent Substack post. He's a VC who writes some of the best data-driven SaaS analysis out there, and he's been digging into where the real value sits in the AI stack. His take:

A harness is really the code that determines what information a model sees at each step, what to store, what to retrieve, and what context to present. It's the scaffolding around the model.

He goes on to say that there's an enormous amount of value in what sits around the model, not just the model itself. And he's right. The harness is the thing that turns a smart but directionless LLM into something that can actually do a job.

The 6x Number

Here's where it gets interesting. A recent paper out of Stanford and MIT called the Meta-Harness studied exactly this question. They took the same underlying model and wrapped it in different harnesses. Same weights, same training, same everything under the hood. The result? Different harnesses produced up to a 6x performance gap on the same benchmark.

Let that sink in for a second. You could have the most capable model in the world, and if the scaffolding around it is mediocre, you're leaving 80%+ of its potential on the table.

The researchers also found that their optimized harness beat Claude Code on TerminalBench-2 and improved over state-of-the-art context management by 7.7 points while using 4x fewer tokens. Less context, better results, better performance. That's a fundamentally different outcome from the same LLM model.

So Why Can't You Just Build One?

You can. Technically. The same way you can build your own SIEM or your own EDR. But surely you wouldn't undertake a project of that magnitude with the risk of missing critical security events, logs, or detections? AI SOC tools operate right in that same realm. But I suppose nobody's stopping you.

Here's what happens in practice. You start with a basic prompt chain. It works okay for simple stuff. Then you hit a real investigation with six log sources, lateral movement across three hosts, and an attacker who's been living off the land for two weeks.

Your weekend harness starts hallucinating connections that aren't there because it lost track of what it already analyzed. It starts re-reading the same logs because it has no memory of what it already processed. It makes a tool call to your SIEM that returns 10,000 results and tries to stuff all of them into context because it doesn't know which ones matter yet.

The model isn't the problem. The model is plenty smart. The problem is that nobody told it what to pay attention to, what to remember, and what to ignore. That's the harness's job.

Context Management Is the Whole Game

Having spent years analyzing SOC incidents, I know firsthand how many systems you end up bouncing between during a single investigation. The context management problem is brutal for humans. It's even more brutal for AI models that don't have a structured way to track what they know.

A real harness needs to do several things well:

  • Maintain a working knowledge graph of entities and relationships that updates as the investigation progresses
  • Decide what context to surface at each reasoning step instead of dumping everything in and hoping for the best
  • Track what the model has already analyzed so it doesn't retread ground
  • Do all of this while staying within token budgets, because throwing more context at the problem actually makes things worse past a certain point

The Stanford paper proved that last point. 4x fewer tokens, better results. You don't accidentally stumble into that architecture on a weekend.

It takes hundreds of investigations' worth of pattern recognition baked into the orchestration logic. It takes understanding which log sources matter at which stage of an investigation. It takes knowing that when you see a certain type of lateral movement, you need to pull a specific set of context about the target host before the model can reason about what happened next.

The Build vs. Buy Trap

The "I'll just build it" instinct is strong in security teams, and for good reason. We've been burned by vendors who overpromise. We like to control our own tooling. I get it — I'm the same way. I've lived that life at one of the nation's largest retailers, who is notoriously known for homegrown solutions.

But the harness isn't a script you write once and forget about. It's a living system that needs to evolve as attack patterns change, as your environment changes, as the models themselves get updated.

Every model update can shift how the harness needs to manage context. Every new log source needs to be integrated into the knowledge graph with the right entity relationships. Every new attack technique needs to be reflected in how the orchestration layer prioritizes evidence.

That's not a side project. That's a product.

What I Think Actually Matters

The Stanford research validates what a lot of us have been saying from the trenches: the model is table stakes. Everyone has access to the same frontier models. The differentiation is in the harness, the context management, the knowledge graph, the orchestration logic that turns a general-purpose reasoning engine into a specialist that can actually do the job.

When that enterprise IR leader asked me why they couldn't just vibe code their own, I told them the truth. You could build a harness that works for a demo. But the gap between a demo harness and a production harness that handles real investigations reliably is the same gap that Stanford measured. It's 6x. And in security, that gap is the difference between catching the intrusion and writing the breach report.

We are all on the same team here. The attackers aren't waiting for us to figure out the optimal scaffolding. Build or buy, just make sure you're not bringing a weekend project to a real fight.


Sources

  1. Jamin Ball, Clouded Judgement 4.10.26: Long Live the Harness
  2. Stanford/MIT, Meta-Harness: Optimizing LLM Scaffolding for Agentic Tasks
  3. Epsilla, The Self-Assembling Agent: Why Stanford's Meta-Harness Changes Enterprise Orchestration

Talk to the team.