Benchmarks for the 300,000 BCE Agents

Humans: 100%. Every frontier AI model: below 1%.

That’s the viral metric from ARC-AGI-3, which François Chollet and his team released a few weeks ago, that’s been floating around the Internet. The gap is crazy. And yes, I played some of the games, you should too — it was fun. What I want to dig into is what the gap is actually measuring and why the results are more honest than a lot of the current AGI discussions.

Quick Context

ARC-AGI versions 1 and 2 were static grid puzzles — you looked at an input, you produced an output, and somewhere in that mapping your intelligence was revealed. They were useful, but eventually game-able. Frontier labs figured out they could auto-generate millions of ARC-style games, train on the reasoning traces, and saturate those benchmarks without developing the underlying capability. In fact, François’ own paper talks about this — Gemini 3 was spotted using the ARC-AGI integer-to-color mapping inside its reasoning chain in an evaluation prompt that never mentioned it. The benchmark had leaked into the training data!

ARC-AGI-3 doesn’t feel like a test. It feels like being dropped into a small, ambiguous world. There are no instructions waiting for you. No friendly prompt explaining what the goal is. You just act, observe the outcomes, try again, and figure it out. Every action you take is a small bet, and has a cost. Intelligence, in this setting, is about arriving at an answer by reducing ambiguity at each step.

Humans, it turns out, are really good at figuring this out — solving 100% of them in a median of about 7 minutes. Our most capable foundation models with all their reasoning chains and test-time compute, are stuck below 1%.

© ARC-AGI-3. Me working through one of the games.

Scoring and Implications

The scoring is where it gets really interesting. The metric is RHAE (pronounced ray) — Relative Human Action Efficiency. For each level, your score is the ratio of human actions to your actions, squared. A human solves it in 10 moves, but your agent takes 20, you don’t get RHAE of 50%, you get 25%. If the agent takes 100 moves, you get a RHAE of 1%.

And the key thing: only environment-affecting actions count, tool calls and reasoning steps don’t. You can think as long as you want between moves, the benchmark doesn’t care. What it cares about is whether your thinking translates into efficient action when you touch the world.

Now for the uncomfortable part, at least for a particular strain of AI enthusiasm.

We have become very good at scaling vertically. Larger models, longer contexts, more parameters. These systems absorb vast amounts of structure from the world and can reproduce it with extraordinary fidelity. Ask them a question that resembles something they’ve seen, and they respond like magic. If you remove the resemblance, the magic fades.

So we started scaling horizontally too. Multi-agent frameworks, agent-to-agent protocols like A2A, tool-calling architectures, MCP servers, AGNTCY — the whole distributed systems reinvention happening on top of foundation models.

ARC-AGI-3 exposes something uncomfortable about this reinvention. Spawning parallel agents to explore a novel environment just inflates your action count, and the RHAE squared penalty eats you alive. Simply leaning on an ensemble of frontier models won’t save you either — the benchmark blocks external API calls during scoring.

Connectivity is naive horizontal scaling. It’s actually worse than doing nothing. Without a shared structure and disciplined coordination, multi-agent systems degrade without ever gaining clarity.

What the benchmark rewards is coherence — action that compounds towards progress, whether that coherence comes from one agent or an emergent system.

Area Under the Intelligence Curve

In the historical evolution of human intelligence, we are at a Shift 1 capability. Individual intelligence, scaled vertically.

Human history had a long Shift 1 phase — roughly 300,000 years where intelligence was essentially a solitary scaling phenomenon. Individuals got smarter. Tools improved, but innovation didn’t accumulate or compound. Breakthroughs disappeared with the breakthrough-er or stayed siloed in small groups who visually witnessed it. Then something changed around 70,000 years ago — recursive language, the cognitive revolution — and suddenly humans could coordinate intent, share context across time, compound innovation across generations. The ratchet effect (ref.: Tomasello, et. al.). That’s what gets us from fire to spacecraft.

Human intelligence didn’t wait for individuals to reach some theoretical capability before going collective. While vertical scaling continued, language enabled humans to start sharing intent, context, and innovation across the group, and the group got smarter than any individual could on their own.

The vertical curve and the horizontal curve compounded. That’s what produced the ratchet effect.

We don’t have to wait either in our journey towards AGI. ARC-AGI-3 tells us where individual silicon intelligence sits on the vertical axis, and the answer is: not yet where we had hoped for. Fine, more work needed. That drives up the urgency even higher for building out the horizontal axis in parallel — the shared memory fabrics, the coordination protocols for meaning and intent, and the reasoning engines with guardrails built in.

We need to grow the area under the intelligence curve from both directions simultaneously.

Multi-agent systems need to scale horizontally, but in a structured way, with a network of interacting agents, each specialized with different priors, working with shared memory and state, coordinating efficiently, and reasoning across individuals.

Multi-Agent-Human Societies

ARC-AGI-3 hints at the other thing the horizontal axis needs, which is humans. The benchmark isolates a capability, i.e., building a novel world model from sparse observation, then acting on it efficiently, where humans still dominate by 100X. That’s a feature in the collective intelligence story.

The multi-agent-human society I believe we are all headed towards, assumes humans are part of the loop and not a legacy component being phased out, but being repurposed for higher-order decisions and thought processes. ARC-AGI-3 is a quantitative argument for why — there are cognitive capabilities where humans are genuinely, measurably better right now, and any collective intelligence architecture that ignores that is weaker for it. Horizontal scaling isn’t just agent-to-agent. It’s agent-to-agent-to-human, with the human doing exactly the kind of work the benchmark says we are good at.

We’re still early humans. That’s okay. That’s where you start. The interesting thing about 70,000 years ago is that nobody waited for permission.

Benchmarks for the 300,000 BCE Agents

Share this:

Like this:

Discover more from VIJOY PANDEY