Tiny AI Startup Shocks Google's Gemini 3 in Key Reasoning Test — What You Need to Know

The Rise of a New Contender in AI Reasoning

Since the release of Gemini 3, it has consistently ranked at the top of the LMArena leaderboard. This leaderboard is a crowdsourced ranking where thousands of real users compare AI models across various tasks, voting on which response is better. However, when it comes to reaching the most challenging reasoning benchmarks, a new player has emerged and already surpassed Google — and it did so without training its own model.

A six-person startup called Poetiq claims to have taken the top spot on the ARC-AGI-2 semi-private test set, a notoriously difficult reasoning challenge created by AI researcher François Chollet. Their system achieved a score of 54 percent, outperforming what Google previously reported for Gemini 3 Deep Think, which was around 45 percent.

To put this into perspective, most AI models were scoring under 5 percent on this benchmark just six months ago. Cracking 50 percent was something researchers believed would take years to achieve.

The most surprising aspect of this breakthrough is that it wasn’t powered by a new frontier model but by a smarter way of orchestrating existing ones.

How Poetiq Achieved This Breakthrough

Instead of building a massive transformer from scratch, Poetiq developed what it calls a meta-system — essentially an AI controller that supervises, critiques, and improves the outputs of any model plugged into it. For their work on the ARC-AGI-2, the team used Gemini 3 Pro as the base model.

Poetiq describes the system as a tight optimization loop:

  • Generate
  • Critique
  • Refine
  • Verify

Here’s what makes it stand out:

  • No retraining required:
    The system adapts to new models within hours

  • Built entirely on off-the-shelf LLMs:
    No custom fine-tuning

  • Lower cost:
    Google’s Deep Think reportedly costs ~$77 per task; Poetiq’s system ran closer to $30

  • Open source:
    The solver is public and inspectable

  • Self-auditing:
    The system evaluates its own answers before returning a final result

On the company website, Poetiq’s team says the approach works by squeezing more reasoning power out of existing LLMs — not by scaling brute-force compute.

Why ARC-AGI-2 Matters

While most benchmarks measure narrow skills like coding or math, ARC-AGI-2 is designed to test something deeper: pattern recognition, analogy, abstract reasoning, and the kind of generalization humans learn in early childhood.

It’s intentionally hard and famously unfriendly to today’s LLMs. Even many frontier models fail spectacularly.

That’s why the leap from single-digit scores to 54 percent in half a year has turned heads. It suggests progress in reasoning methods, not just raw model scale.

However, Poetiq’s result applies specifically to the semi-private test set, which is not fully open to the public. The company site says the result has been verified by the benchmark’s organizers — but independent third-party replication is still pending, which is important for a benchmark this influential.

Perhaps the next breakthrough won’t come from bigger models, as Poetiq’s work highlights a growing trend in AI: progress doesn’t always require billion-dollar infrastructure or a huge research lab.

If systems like this generalize beyond benchmarks, to planning, coding, research, or real-world decision-making, it could reshape how AI is developed. Instead of waiting for the next breakthrough model, companies might build layered intelligence that makes today’s models smarter, cheaper, and more consistent.

The Bottom Line

Poetiq has open-sourced its ARC-AGI solver so researchers can test, extend, or challenge the results. The benchmark has a hidden test set, and history shows results can shift once more people run independent evaluations.

If Poetiq’s numbers hold, this could mark a turning point in AI reasoning research. A six-person team may have just shown that orchestrating models can rival, or even beat, training bigger ones. Poetiq just proved you don’t need a giant lab to win a round.

Posting Komentar untuk "Tiny AI Startup Shocks Google's Gemini 3 in Key Reasoning Test — What You Need to Know"