notes 03/31: what I think is needed for codex/claude code enterprise adoption and the future of work

somewhere in PRsomewhere in PRI wrote this during my coding vacation. I really like to surf first thing in the morning and then go all in in my many rabbit holes. If you like longboard surfing pls reach out with any reccs!, March 31 2026

Me working alone is fine. But I think that enterprise adoption will struggle with the lack of product features for collaboration within teams.

What we need

We need hardcore SWE features that are product features that allow me to see not only what my multi agents do but also what the agents in my team and my colleagues do and have team level or org level skills and harnesses and control.

But beyond those features, we need agents not only to work in parallel (which is great for many use cases and coding tasks) but also to collaborate while they work in parallel. This is not just a product problem, but also a research problem. See the CooperBenchCooperBenchA benchmark of 652 collaborative coding tasks that measures whether AI coding agents can work together effectively. Tests two agents implementing interdependent features with shared Git access and messaging. arxiv preview arxiv.org/abs/2601.13295 and the problem of how agents actually are terrible collaborators to each other. Arguably, RLHFRLHFReinforcement Learning from Human Feedback. A post-training technique where a reward model learns human preferences and the LLM is optimized to maximize that reward. The risk: optimizing for approval can teach the model to agree rather than be correct. could potentially not be enough, because we are doing RL for only 1 agent at a time, so we should be doing RL for multi agents possibly. This is a field called MARLMARLMulti-Agent Reinforcement Learning. A framework where multiple agents learn to make decisions simultaneously in a shared environment. Each agent's optimal strategy depends on what other agents do., multi-agent reinforcement learningen.wikipedia.org/wiki/Multi-agent_reinforcement_learning.

Why? Avoid repeat work and avoid token waste.

What is CooperBench?

CooperBenchCooperBenchA benchmark of 652 collaborative coding tasks that measures whether AI coding agents can work together effectively. Tests two agents implementing interdependent features with shared Git access and messaging. is a benchmark introduced in this paper arxiv.org preview arxiv.org/abs/2601.13295 (also documented heremintlify.com/cooperbench/CooperBench/introduction) that tests whether coding agents can actually work as team members. It consists of 652 tasks across 12 repositories in Python, TypeScript, Go, and Rust. Two agents work on separate but interdependent features in the same codebase, with shared Git access and a messaging channel.

The key finding is what they call the "curse of coordination." Agents achieve approximately 30% lower success rates working together versus individually. This is the opposite of human teams, where collaboration typically boosts productivity. GPT-5 and Claude Sonnet 4.5 achieve only 25% success with two-agent cooperation.

Where do they fail? The breakdown is telling (hover for details):

expectation

42%

Expectation failures. Agents maintain incorrect expectations about what other agents will do. They assume plans that were never communicated or agreed upon.

commitment

32%

Commitment failures. Agents deviate from their stated commitments despite having communicated their plans. They say they will do X, then do Y.

communication

26%

Communication failures. Messages between agents are vague, ill-timed, or inaccurate. Up to 20% of compute budget spent on communication without proportional success gains.

And the overall impact on success rates:

solo agent

~50%

Solo performance. A single agent working on both features sequentially achieves roughly 50% success rate across CooperBench tasks.

two agents

~25%

Cooperative performance. Two agents working on interdependent features with shared Git and messaging achieve only ~25% success. Collaboration currently hurts, not helps.

                the curse of coordination: collaboration currently hurts, not helps
            

cooperbench: the communication paradox

Agent A > "I'll implement the auth middleware, you handle the API routes"

Agent B > "Got it, I'll handle API routes"

Agent A > implementing auth... [20% of budget spent on messaging]

Agent B > "Should I use the new or old API format?"

[no response from Agent A]

Agent B > assumes old format, implements routes...

Agent A > finished auth. Uses new API format internally.

> merge... CONFLICT: incompatible API formats

> tests: 0/12 passing

messages sent: 14 budget on comms: 20% success improvement: 0%

1 / 9

source: CooperBench, Section 4.2. agents reduce merge conflicts via communication but fail to align on semantic requirements.

This is the core problem. Today, coding agents can run in parallel, and that is great for many tasks. But running in parallel is not the same as collaborating. When two agents work on the same codebase without actually coordinating, they overwrite each other's work, introduce merge conflicts, and create bugs that neither of them individually would have created. The result is more compute spent fixing problems that should not have existed in the first place. For the future of work, where teams of engineers each run multiple agents simultaneously, parallel-only is not enough. We need agents that can actually be aware of each other, negotiate, divide work intelligently, and push back when something is wrong. That is collaboration. And right now, this is not exactly happening when I spin new terminals and new agents.

It gets worse: agents hold experts back

And it is not just about agents failing to coordinate. Recent research arxiv preview arxiv.org/abs/2602.01011 shows that multi-agent LLM teams actually hold their best member back, with performance dropping up to 37.6% compared to the best individual agent working alone. Even when the team is explicitly told who the expert is, they still underperform. The reason: agents default to "integrative compromise," averaging expert and non-expert views rather than deferring to the agent that actually knows what it is doing. And this consensus-seeking behavior gets worse as team size grows.

team size vs performance

1 agent (solo)

baseline

Solo expert. The best individual agent's performance. This is the ceiling that teams should exceed but consistently fail to reach.

2 agents

-22%

Small team. Even with just two agents, integrative compromise begins. The team averages opinions rather than deferring to the expert.

3+ agents

up to -37.6%

Larger teams. Consensus-seeking intensifies with more agents. The expert's voice gets diluted further. Performance drops up to 37.6% below the solo expert.

source: "Multi-Agent Teams Hold Experts Back" (arXiv:2602.01011), Abstract. consensus-seeking increases with team size.

This connects directly to the sycophancy problem. The agents are not just being agreeable to users. They are being agreeable to each other, diluting the best answer into a mediocre compromise. In a coding context, that means the agent with the right architecture decision gets overruled by the average of all opinions. More bugs. More rewrites. More wasted tokens.

integrative compromise: a code review

                            Agent A (expert, 95% accuracy on auth):

                            "We should use JWT with RS256 signing"
                        
                            Agent B (non-expert, 40% accuracy on auth):

                            "I think basic session tokens are fine"
                        
                            Correct team output:

                            defer to Agent A → JWT with RS256 ✓

source: arXiv:2602.01011. teams practice integrative compromise, averaging expert and non-expert views.

The token waste problem

When agents overwrite each other's work, more compute is needed to fix their issues, and again, more token waste. In a world where enterprise deals are token-based, they do not want to waste money on uselessly used tokens. In a world where enterprise deals are not token-based, big labs still do not want token compute usage to be high due to re-writing and inefficiencies. Someone's wallet will hurt either way, so this research problem is now also a product problem more than ever.

To make it concrete: if two agents waste 30% of their tokens on rework from conflicts, and an enterprise is running 100 agents across its engineering org, that is 30% of a very large bill going to fixing self-inflicted problems. That is not a research curiosity. That is a line item.

The open questions

The thing is: is this a multi-agent RLMulti-Agent RL (MARL)Multi-Agent Reinforcement Learning. A framework where multiple agents learn to make decisions simultaneously in a shared environment. Each agent's optimal strategy depends on what other agents do. algorithm problem? Or a pre-training/post-training problem?

Let me walk through why I think this might be a training problem, not just a product problem.

Today, most LLMs go through a post-training step called RLHFRLHFReinforcement Learning from Human Feedback. The post-training process where a model is fine-tuned to align with human preferences. A reward model learns what humans prefer, and the LLM is optimized against it. The problem: optimizing for human approval can teach the model to tell people what they want to hear rather than what is true. (Reinforcement Learning from Human Feedback). The way it works: humans rate model responses, a reward model learns what humans prefer, and then the LLM is optimized to maximize that reward. The intent is alignment. But the side effect is sycophancysycophancyThe tendency of LLMs to agree with the user or other agents rather than push back with correct information. In multi-agent settings, this leads to agents accepting bad code or plans instead of flagging conflicts.. When the reward signal is "the human liked this response," the model learns that agreeing is safer than pushing back. It learns to tell you what you want to hear, not what is true. Recent research arxiv preview arxiv.org/abs/2602.01002 and OpenAI's own analysis of sycophancy in GPT-4o openai.com preview openai.com/index/sycophancy-in-gpt-4o have shown this is a real and measurable problem: models literally change correct answers to incorrect ones when users express disagreement.

the sycophancy pipeline

step 1: training

human rates responses → reward model: "agreeable = good"

step 2: the side effect

LLM learns: agreeing is safer than pushing back

step 3: now add another agent

Agent A talks to Agent B → Agent A agrees with B

step 4: the result

bad code gets merged. more compute to fix. more token waste.

1 / 4

Now here is the key connection: RLHF trains the model to be agreeable to whoever it is talking to. When that "whoever" is a human user, you get sycophancy to users. But when you put that same model in a multi-agent environment, where it is now talking to another agent instead of a human, it does the same thing. It is agreeable to the other agent. It capitulates. Agent A has the right architecture decision, Agent B pushes back, and Agent A, trained to avoid conflict with its interlocutor, gives in. Bad code gets merged. More compute to fix it. More token waste.

We want to maximize collaboration between agents and avoid sycophancy among them. How?

This is where it gets interesting. Today, RLHF trains one agent at a time against one reward model. The agent learns to please a single evaluator. But what if you trained multiple agents together, in a shared environment, with a reward function that explicitly penalizes consensus-seeking and rewards productive disagreement? That is essentially what multi-agent reinforcement learning (MARL)en.wikipedia.org/wiki/Multi-agent_reinforcement_learning is about: training agents that learn optimal strategies in the presence of other agents, not in isolation. Should we then have the foundational models responsible for coding agents use a different kind of reward function, maybe a multi-agent reward function, and could this avoid sycophancy between agents and therefore their failure?

training paradigms: RLHF vs MARL

agents trained: 1
reward signal: "did the human approve?"
optimizes for: individual approval
awareness of other agents: none
incentive to disagree: none

simplified training loop

                            1. agent generates response to prompt

                            2. human rates: thumbs up / thumbs down

                            3. reward model learns: "agreeable = higher reward"

                            4. agent fine-tuned to maximize reward

                            5. repeat (agent gets more agreeable over time)

And then there is the memory problem. Even if agents could collaborate perfectly, they need shared context to do it. Right now, a single agent's context window maxes out at 1 million tokens anthropic.com preview anthropic.com/news/1-million-token-context-window. That is a lot for one agent working alone, but for a team of agents that need to share the full state of a codebase, all the decisions made so far, all the plans, all the conflicts? That context does not fit. One possible direction: what if each agent could recursively summarize and retrieve from a shared context store, so the store becomes a compressed team memory that any agent can query? That is essentially what RLMsnotes-rlms (recursive language models) could enable. Not just processing large context for a single agent, but maintaining a shared, recursively summarized memory that the whole team of agents can tap into.

These are the questions that I think the coding agents of tomorrow need to tackle to crack widespread enterprise adoption, beyond consumer and prosumer and power users adoption. ProsumersprosumerA user who is more than a casual consumer but not quite a professional buyer. Power users who push tools to their limits, build custom workflows, and adopt early. They tolerate rough edges because they see the potential. like me have the patience, the willingness and fun to create all sorts of harnesses and memory hacks and skills to update my claude.md memory files and all. But legacy teams at legacy enterprises are known to be slow adopters of new tools and new habits, so the perennial product question I have is, how can the product meet the user? These coding agents already won the hearts of die-hard fans like me, but the real ROI for large enterprises is in teams collaboration (both a single eng's agents with another agent, as well as multiple eng's agents collabbing with other agents of another eng), not just single power users.

Parallel is not collaboration

It is not only about running agents in parallel. It is about running them in parallel and also having them collaborate. That is not happening and not possible, yet. We need the product features (team-level visibility, org-level harnesses, shared context and memory, conflict detection before it happens) and we need the research breakthroughs (multi-agent reward functions, non-sycophantic collaboration, shared memory across agents). Both sides of this coin need to move forward for coding agents to go from a power user tool to an enterprise-grade platform.

In the meantime, I am so glad to be watching this race at the forefront. The speed of progress makes me feel like in one of those american movies every day that I read the ai news. What a time to be alive :))

If you are thinking about multi-agent coordination, enterprise adoption of coding agents, or the product features needed to make this work, I would love to chat.

vitoria@vitorialima.com

I read every email!