Everything should be a closed feedback loop

context is the new frontier, and it evolves.

somewhere in SFsomewhere in SFBack in San Francisco. Writing this between long walks and longer rabbit holes. If you build context systems or think about feedback loops, please reach out., April 16 2026

The new frontier of new software products is not UIUIUser Interface. The visual surface of a product: buttons, menus, forms. The thing a human clicks through to get something done. and a BEBEBackend. The server-side logic, databases, and APIs that power a product. Traditionally the other half of the FE/BE software stack.. It is agent orchestrations. But your agents are only as good as the context you give them. So the actual new top products have context orchestration.

Context is not static

Context evolves. I can see a friend four times over the year. Every time we meet I say "let me give you some context," and that context is not static- it is dynamic. It should evolve over time. And in software, that update should not be manual. Context is only as good as the feedback loop that feeds back into it and that updates it. Everything is feedback loop.

Evals are alpha

I have been working recently on contexere, a system that treats every agent failure as a context extraction event- the SMESMESubject Matter Expert. The nurse, the paralegal, the CPA, the senior engineer- the person whose tacit knowledge is the difference between an agent that is technically correct and one that is actually right. who reviews an output does not just click "wrong," they explain why, and that explanation flows back into the system prompt and the evalsevalsShort for evaluations. Systematic tests that measure whether an AI system is producing the right outputs. Historically treated as pass/fail labels- here, treated as the highest-signal event to extract tacit domain knowledge. automatically. I deeply believe that prompt and context engineering should not be done manually by an engineer. It should be self-healing based on the evals. Evals are alpha, and need to be treated as such. It is not only about things failing, but why.

See the GEPAGEPAGenetic-Pareto prompt optimization. A method that evolves prompts by reflecting on natural-language traces of what worked and what failed, rather than by hand-tuning. Outperforms RL-style approaches with far fewer rollouts. paper arxiv preview arxiv.org/abs/2507.19457 and the meta-harnessmeta-harnessYoonho Lee's work on an agent that optimizes its own scaffolding from traces. The harness writes the harness. The loop closes on itself. paperyoonholee.com/meta-harness. They both are the train of thought that this should all become a feedback loop.

GEPA: prompts that rewrite themselves

GEPA (Genetic-Pareto) treats prompt optimization as evolution driven by reflection. Instead of gradient descent on a reward, it runs a prompt on a few cases, reads the full execution trace in natural language (what tools were called, what failed, what the answer looked like), and reflects on why. That reflection becomes the mutation: an LLM rewrites the prompt to patch the specific weakness the trace revealed. Keep the winners, cull the losers, repeat. In the paper, this beats RL-style approaches with an order of magnitude fewer rollouts, because each rollout carries high-signal prose, not just a scalar.

The shift: a trace is a piece of writing, not a label. And writing is how humans already debug prompts. GEPA just automates the loop. The authors put it bluntly: GEPA learns why something worked, not just what. That is the entire difference between an RLRLReinforcement Learning. In the prompt-optimization context this usually means methods like GRPO or PPO: run the prompt thousands of times, collect scalar rewards, nudge weights or hyperparams toward higher reward. Powerful, expensive, and throws away everything except the reward number. method that throws away the trace and one that reads it.

gepa: a prompt evolving from its own traces

generation 0 / 3

trace + reflection

A few numbers from the paper that made me stop and re-read. GEPA matches or beats GRPOGRPOGroup Relative Policy Optimization. A reinforcement learning variant used to fine-tune LLMs on tasks with verifiable rewards. State-of-the-art for scalar-reward prompt optimization, and the main RL baseline GEPA beats. (a strong RL baseline) with up to 35× fewer rollouts, and produces prompts that are up to 9× shorter than those from MIPROv2MIPROv2A prior prompt optimizer that uses Bayesian search over instruction and few-shot demonstration variants. Strong, but compresses its history into summaries and picks from a small candidate pool., the prior SOTA prompt optimizer. Reflection tends to remove noise, not add it. The loop learns what can be deleted.

And crucially, GEPA does not collapse to one winning prompt. It maintains a Pareto frontier: a live pool of candidate prompts trading off on different dimensions (accuracy on easy cases, accuracy on hard cases, length, cost). New mutations are sampled from across the frontier, so the search keeps multiple strategies alive instead of committing early to a local optimum. Real-world ripple: the authors applied GEPA to CUDA kernel generation and drove vector utilization from 4% to over 30%, sevenfold, just by letting the LLM reflect on compiler errors and rewrite its own prompting. No retraining. No fine-tuning.

benchmark accuracy

GEPA vs RL (GRPO) vs prior prompt optimizer (MIPROv2). higher is better.

GEPA MIPROv2 GRPO (RL) baseline

HotpotQAmulti-hop QA

GEPA

62.3

MIPROv2

55.3

GRPO

43.3

baseline

42.3

HoVerfact verification

GEPA

52.3

MIPROv2

47.3

GRPO

38.6

baseline

35.3

IFBenchinstruction following

GEPA

38.6

MIPROv2

36.2

GRPO

35.8

baseline

36.9

PUPAprivacy-preserving delegation

GEPA

91.8

MIPROv2

81.6

GRPO

86.7

baseline

80.8

meta-harness: the harness writes the harness

The meta-harness goes one level higher. Instead of only optimizing the prompt, it optimizes the entire scaffolding the agent runs inside: system prompts, tool definitions, completion-checking logic, and context management. An observer agent watches execution traces of the real agent, spots where the harness itself failed (wrong tool offered, no guardrail, bad control flow), and proposes edits. Over cycles, the harness that was handwritten by an engineer on day one becomes one that was handwritten by the system, informed by production. This is the same idea as contexere, one level up: not just "what is the system prompt missing," but "what is the whole operating environment missing".

The proposer isn't a bespoke algorithm. It is Claude CodeClaude CodeAnthropic's agentic coding CLI. In meta-harness, it plays the role of the proposer: reads prior harnesses and execution traces off a filesystem with grep and cat, diagnoses failures, and writes a new harness candidate. with grepgrep / catPlain Unix tools. The proposer uses them to browse past runs the way any engineer would: search the logs, read the source, spot the pattern. No special retrieval layer, no embeddings, no vector DB. and cat, browsing a filesystem of past runs the same way a human engineer would. The loop is three steps: propose (agent reads traces + prior harnesses, drafts a new one), evaluate (run on held-out tasks, score it), archive (append source + traces + score to the filesystem). The agent that writes harnesses is itself an agent running inside a harness. Turtles all the way down.

meta-harness: trace → reflect → update → run

harness config

The leverage here is diagnostic context. Most prior optimizers (GEPA included) compress history into a short summary, a scalar score, or a sliding window of the last few candidates. That works for small problems, but harness engineering produces failures that are hard to diagnose without seeing the raw execution trace. Meta-harness goes the other way: each proposal step gets access to up to 10 million tokens of raw evidence - command logs, error messages, timeouts, prior scores, the full source of every harness that was tried. That is roughly 1,000× more evidence per iteration than any prior method, and the optimizer matches the next-best method's final accuracy in just 4 iterations.

diagnostic context per iteration

how much raw evidence the optimizer sees before each proposal. log-scale gap of ~1,000×.

Self-Refine

0.001 Mtok

OPRO

0.002 Mtok

MIPRO

0.003 Mtok

GEPA

0.008 Mtok

TextGrad

0.015 Mtok

AlphaEvolve

0.022 Mtok

TTT-Discover

0.026 Mtok

Meta-Harness

10.0 Mtok

0.0010.010.1110

When an agent fails, it is not because the model is not smart enough. It is because it is missing something- a rule, a constraint, a piece of domain knowledge that was never provided.

But feedback loops are bigger than system prompts

I think that feedback loop is more than just a cool concept for your system prompts. It should be a new product philosophy. Features are now commodity. Anything can be built. But what do people actually want, and how can you be tapped in and know that, and source that, and feed that back into your work?

Anthropic is hiring a meta-PM

Anthropic is hiring right now for a . The role is not to come up with features. The role is to create a system that generates the features to be built automatically. A meta-PM. Because features are a commodity now; knowing which ones to build is not.

The key line, for me: "You treat feedback loops as a product. You are obsessed with making it effortless for the field to share what they're hearing and for product teams to know what matters most. You build AI-enabled systems that do the first pass so humans can focus on judgment, not triage. You think like a product manager, not a process administrator."

A signal for the market

If Anthropic is treating their product as a feedback loop, it is a signal for the market, and for other companies to do the same. Decision making shouldn't be up to 1 single PM and 1 single point of failure, but rather about all of the alpha across the org: sales, customer success, customer support, online reviews, X posts, LinkedIn posts on the product, and so on. A single PM prioritizing a roadmap from their own head is the same failure mode as a single engineer hand-editing a system prompt. The knowledge that actually determines what should be built lives in all the places where the product meets the world, and most of that knowledge stays tacit, because there is no system to catch it.

Context is the new limit

All of this to say: context orchestration, and feedback loops for it, are the new frontier for software.

Sky is not the limit. Context is, or lack thereof.

If you are thinking about context orchestration, self-healing prompts, or how to turn every expert in your org into a context source, I would love to hear from you.

vitoria@vitorialima.com

I read every email!