ARC-AGI-3

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):

- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution

- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)

- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%

- The scoring is designed so that even if AI performs on a human level it will score below 100%

- No harness at all and very simplistic prompt

- Models can't use more than 5X the steps that a human used

- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

> As long as there is a gap between AI and human learning, we do not have AGI.

Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.

One AI researcher's quote stood out to me:

"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.

I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.

I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.

General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.

This so far has been the best test.

And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.

Same question I have for all these benchmarks:

What's going to stop e.g. OpenAI from hiring a bunch of teenagers to play these games non-stop for a month and annotate the game with their logic for deriving the rules, generate a data set based on those playthroughs and fine tuning the next version of chatgpt on all those playthroughs?

My takeaway from playing a number of levels is that I am definitely not AGI

I'll probably be the skeptic here, but:

- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.

- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.

As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.

This is not AGI at all.

Was just at the YC launch event for this. Haven't felt this much inspiration in a while. Incredible minds confronting on tech that will change our society.

I met a guy who, for fun, started working on ARC2, and as he got the number to go up in the eval, a novel way to more efficiently move a robotic arm emerged. All that to say: chasing evals per se can have tangible real world benefits.

Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.

But with them will be an increasing expectation that these models can eventually figure things out with zero context, and zero pretraining; you drop a brain into any problem and it'll figure out how to dig its way out.

That's really exciting.

I was a human tester (I think) for this set of games. I did 25 games in the 90 minutes allotted. IIRC the instructions did mention to minimize action count but the incentives/setup ($5 per game solved) pushed for solve speed over action count. I do recall trying to not just randomly move around while thinking but that was not the primary goal, so I would expect that the baseline for the human solutions have more actions than might otherwise be needed.

I'm not sure how this relates to AGI.

This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.

Humans may or may not be good at the same class of games.

We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.

So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)

Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

The controls just feel really bad. The inputs are too small, and there is way too much lag.

The thing I most appreciate about the ARC-AGI leaderboards is how the graph also takes into account cost per task. All of the recent major advancements in benchmarks seem a little less impressive when also taking into account the massive rise in cost they're paired with. The fact is we can always get a little bit better output if we're willing to use more electricity

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

This is a good and clever benchmark and a worthy successor to the previous two. That being said, I find that the "No tools" approach is a bit odd. They're basically saying that it's OK to have tools as long as they're hidden behind the API layer. Isn't this an odd line to draw?

It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...

I feel like AGI test would be sense of humor. Somehow I cannot force any LLM to output any even normal level joke.

I’m not a Chollet booster. Well, I might be a little bit of one in that I admire his persistence.

I really like these puzzles. There’s a lot to them both in design and scoring — models trained to do well on these are going to be genuinely much more useful, so I’m excited about it. As opposed to -1 and -2, to do well at these, you need to be able to do:

- Visual reasoning

- Path planning (and some fairly long paths)

- Mouse/screen interaction

- color and shape analysis

- cross-context learning/remembering

Probably more, I only did like five or six of these. We really want models that are good at all this; it covers a lot of what current agentic loops are super weak at. So I hope M. Chollet is successful at getting frontier labs to put a billion or so into training for these.

It's like playing The Witness. Somebody should set LLMs loose on that.

I find it quite funny that we are still debating whether models are intelligent or not, while we know they are just statistical models.

Even with billions of dollars spent on training, we had this situation a few weeks ago where models were suggesting to walk instead of drive to a car wash in case you want to wash your car. While a 3 year old would know the answer to the question. And yet, we are designing elaborate tests to 'show whether AGI is here it not', while being fully aware of what these models represent under the hood.

Perhaps actual AGI will be when the models create ARC-HGI-1 to test if humans have general intelligence.

This is an interesting update. And a big challenge for companies and labs. The new tools for measurement are indeed what I'd like out of future agents, and agents that solve the games will need to use different subsystems to do so. This is basically optimization for achieving goals (as opposed to prompt engineering / magic spells to make the LLM do what is told to do) which imo is the future we should aspire to build.

>As long as there is a gap between AI and human learning, we do not have AGI.

This is an absurd constraint. You could have a vastly superhuman AI that doesn't learn as efficiently as a human and it would not pass this definition while it simultaneously goes on to colonize the galaxy...

Looks like I’m generally unintelligent

It's getting pretty old now when Francois Chollet puts out a new ARC challenge, claims definitively that no system is going to crack it without being full blown AGI, the benchmark gets saturated in a few months, he claims the systems definitely aren't AGI then puts out a new challenge that no non AGI system can clear and a few months later.... etc. etc.

In the year 2032: ARC-AGI-13: Almost definitely AGI this time!

Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.

I played the demo, but it definitely took me a minute to grok the rules.

I don't know if this is how we want to measure AGI.

In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.

For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena

Arc AGI 4 can be Chip's Challenge!

Captcha's about to get wild.

Maybe the internet will briefly go back to a place mainly populated with outliers.

At this point, I'm pretty sure we'll just know when it happens.

My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.

i feel bad that we make the LLMs play this

I hope at least some of these are direct Chip's Challenge ports. Waiting for some old muscle memory to kick in here.

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)

CRAZY 0.1% in average lmao

The real question is: Can it be generated using programs? If it can be, then LLMs will eventually monkey type these programs.

Unplayably laggy on an iPhone. Sad people can’t produce a performant experience that a ZX81 could have eaten for breakfast, on a relative super computer

Ew. Cool demo, what idiot thought it was ok to have a half second cooldown between inputs? If I hit up three times I should move up three steps, not two steps because I pressed too quickly.

I feel like we've got tunnel vision. Things you can do on a computer are a tiny subset of what a human can do.

If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.

Would be fun to play but the controls are janky.

Not clear to me the diff with v2?

I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.

It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.

You can tell it's an AI by it not becoming utterly by playing the "game". I could personally not stand any more than the first level.

ok clearly I'm a robot because I can't figure out wtf to do

what is the evidence that being able to play games equates to AGI?

Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.

Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?

Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.

The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.

Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?

This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.

It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...

Looks like I’m generally unintelligent

In the year 2032: ARC-AGI-13: Almost definitely AGI this time!

Captcha's about to get wild.

Maybe the internet will briefly go back to a place mainly populated with outliers.

For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena

My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.

I hope at least some of these are direct Chip's Challenge ports. Waiting for some old muscle memory to kick in here.

The real question is: Can it be generated using programs? If it can be, then LLMs will eventually monkey type these programs.

I feel like we've got tunnel vision. Things you can do on a computer are a tiny subset of what a human can do.

If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.

Would be fun to play but the controls are janky.

I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.

It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.

ok clearly I'm a robot because I can't figure out wtf to do

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?

Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.

The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.

Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?

This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):

- The scoring is designed so that even if AI performs on a human level it will score below 100%

- No harness at all and very simplistic prompt

- Models can't use more than 5X the steps that a human used

Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.

> No harness at all and very simplistic prompt

TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.

Defining the baseline human is always a bit arbitrary. The median human is illiterate and also dead.

If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.

Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.

We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.

Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.

> No harness at all and very simplistic prompt

This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."

...

"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.

...

"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

"Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.

If you are trying to measure GENERAL intelligence then it needs to be general.

Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.

> As long as there is a gap between AI and human learning, we do not have AGI.

Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.

One AI researcher's quote stood out to me:

"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.

I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.

General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.

This so far has been the best test.

AGI’s 'general' is the wrong word, I thinkg. Humans aren’t general, we’re jagged. Strong in some areas, weak in others, and already surpassed in many domains.

LLM are way past us at languages for instance. Calculators passed us at calculating, etc.

The thing is.. this is more akin to testing a blind person's performance on a driving test than testing his intelligence.

I would imagine if you simply encoded the game in textual format and asked an LLM to come up with a series of moves, it would beat humans.

The problem here is more around perception than anything.

Previous iterations of ARC-AGI were reminiscent of IQ tests. This one is just too easy and the fact that models do terribly bad on it probably means that there is input mode mismatch or operation mode mismatch.

If model creators are willing to teach their llms to play computer games through text it's gonna be solved in one minor bump of the model version. But honestly, I don't think they are gonna bother because it's just too stilly and they won't expect their models are going to learn anything useful from that.

Especially since there are already models that can learn how to play 8-bit games.

It feels like ARC-AGI jumped the shark. But who knows, maybe people who train models for robots are going to take it in stride.

My takeaway from playing a number of levels is that I am definitely not AGI

NGI - Natural General Ingelligence

Don't forget that this implies a form of examination you are not used to, namely :

- open book, you have access to nearly the whole Internet and resources out of it, e.g. torrents of nearly all books, research paper, etc including the history of all previous tests include those similar to this one

- arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously

- no shame in submitting a very large amount of wrong answers until you get the "right" one

... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.

Thank you for keeping the bar of "AGI" low. The machines appreciate your contribution.

SGI - Sub General Intelligence or another more colloquial word commonly seen amongst users of wallstreetbets.

I'll probably be the skeptic here, but:

- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.

- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.

As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.

This is not AGI at all.

Isn’t this what AGI is by design? People CAN learn to become good at videogames. Modern LLMs can’t, they have to be retrained from scratch (I consider pre-training to be a completely different process than learning). I also don’t necessarily agree that a grandma would fail. Give her enough motivation and a couple days and she’ll manage these.

My main criticism would be that it doesn’t seem like this test allows online learning, which is what humans do (over the scale of days to years). So in practice it may still collapse to what you point out, but not because the task is unsuited to showing AGI.

had the same thought.

I've been a gamer for just about 40 years. Gaming is my "thing"

I found the challenges fun, but easy. Coming back and reading comments from people struggling with the games, my first thought was - yup definitely not a gamer.

My approach was to poke at the controls to suss the rules, then the actual solutions were really straightforward.

fwiw, I'm pretty dumb generally, but these kinds of puzzles are my jam.

I'm not sure how this relates to AGI.

This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.

Humans may or may not be good at the same class of games.

We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.

Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

It's to do with how the creators of ARC-AGI defined intelligence. Chollet has said he thinks intelligence is how well you can operate in situations you have not encountered before. ARC-AGI measures how well LLMs operate in those exact situations.

"AGI" is a marketing term, and benchmarks like this only serve to promote relative performance improvements of "AI" tools. It doesn't mean that performance in common tasks actually improves, let alone that achieving 100% in this benchmark means that we've reached "AGI".

So there is a business application, but no practical or philosophical one.

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

It's not an IQ test. Just a way to assess your ability to generalize rules. If you've played previous rounds you kinda get used to the "style" of these games and it gets easier

It's not about intelligence, Stevvo. Proof, how long did this specific one take me, under a minute to solve the first level ;)

If you've played Wordle you might've solved the game in a minute once before as well. And if you've played a bunch then you've perhaps also taken the entire day to solve it.

So why is it that today’s puzzle was so intuitive but next month’s new puzzle shared here could be impossible. A more satisfying explanation than luck and the obvious “different things are different” (even though… Yeah different things are different)

Once you figure out one game, it goes a long way towards figuring out all the rest. There are a lot of common general themes.

It's like playing The Witness. Somebody should set LLMs loose on that.

I played the demo, but it definitely took me a minute to grok the rules.

I don't know if this is how we want to measure AGI.

> ... lets focus on how we can augment and empower the existing workforce.

That is a nice sentiment but not what the AI companies are out to do; they want your job.

Took me about 5 secs to figure it out tbh.

Surprised at the comments here re. not figuring it. Simple game. Super annoying though lmao.

Also, let's see if we can get the power and compute requirements brought down. Having to spin up a gigawatt power plant to achieve the same intelligence we humans power with sandwiches is a futile approach, imho.

At this point, I'm pretty sure we'll just know when it happens.

I'm not convinced. I wouldn't be surprised if GPT-2 to ChatGPT is the biggest single jump in "machine intelligence" we will ever see. I'd bet all gains in the future will be more incremental, at least until machines surpass humans by a large enough margin that it's difficult to qualify—let alone quantify—how big any given jump is.

Without a big jump, we're just going to boil the frog (ourselves).

Unless it’s already happened and we missed it

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

Just finished it, 8/8. I mostly approached it by winging it and shuffling things around that looked good and like it was approaching the goal, since there's plenty of time to finish.

I still don't quite understand the exact mirroring rules at play.

You are joking right?

That one was interesting - I found it a lot of work to plan in advance but trivial to complete because at every point there was only one sensible course of action. After a couple of rounds I didn't bother planning and just lined things up as I went.

The most difficult thing about this was controls being unresponsive (at least on firefox).

solved first try with 577 actions, not trying hard to optimize for low action count.

I did the first round literally in 5 secs. How can you not 'get it'? lol

i feel bad that we make the LLMs play this

You're definitely anthropomorphizing too much.

tell me youre joking.

seriously. lmao. if you aint, I dunno what to say.

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)

CRAZY 0.1% in average lmao

Note the scoring function is significantly different for ARC-AGI-3. It isn't the percentage of tests passed like previous versions, it's the square of the efficiency ratio -- how many steps the model needed vs the second best human.

So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.

Not clear to me the diff with v2?

They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.

Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.

v2 was a static fill in the blank task instead of v3 which is interactive.

There's world state that you can change. Not just place pixel.

Here's v2:

https://arcprize.org/tasks/ce602527

what is the evidence that being able to play games equates to AGI?

The test doesn't prove you have AGI. It proves you don't have AGI. If your AI can't solve these problems that humans can solve, it can't be AGI.

Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.

None whatsoever.

It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.

The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.

I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.

It used to be easy to build these tests. I suspect it’s getting harder and harder.

But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…

There isn't a strict definition of AGI, there's no way to find evidence for what equates to it, and besides, things like this are meant only as likely necessary conditions.

Anyway, from the article:

> As long as there is a gap between AI and human learning, we do not have AGI.

This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.

The evolution of the test has been partly due to the evolution of AI capabilities. To take the most skeptical view, the types of puzzles AI has trouble solving are in the domain of capabilities where AGI might be required in order to solve them.

By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.

It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)

That is not the claim. It is a necessary condition, but not a sufficient one.

The evidence is that humans are able to win these games. AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could. The point of these ARC benchmarks is to find tasks that humans can do easily and AI cannot, thus driving a new reasoning competency as companies race each other to beat human performance on the benchmark.

The goal is to learn the rules, and then use that to win.

If you mess around a little bit, you will figure it out. There are only a few rules.

> Only environments that could be fully solved by at least two human participants (independently) were considered for inclusion in the public, semi-private and fully-private sets.

Apparently those games supposed to be hard.

The main frontier models are all up on https://arcprize.org/tasks

Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%

> As long as there is a gap between AI and human learning, we do not have AGI.

Don't read the statement as a human dunk on LLMs, or even as philosophy.

The gap is important because of its special and devastating economic consequences. When the gap becomes truly zero, all human knowledge work is replaceable. From there, with robots, its a short step to all work is replaceable.

What's worse, the condition is sufficient but not even necessary. Just as planes can fly without flapping, the economy can be destroyed without full AGI.

Or the classic from Dijkstra (https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867...):

> even Alan M. Turing allowed himself to be drawn into the discussion of the question whether computers can think. The question is just as relevant and just as meaningful as the question whether submarines can swim.

(I am of the opinion that the thinking question is in fact a bit more relevant than the swimming one, but I understand where these are coming from.)

You know what the G stands for in AGI? General intelligence. You could measure a plane on general versatility in air and it would lose against a bird. You could also measure it against energy consumption. There are a lot of things you can measure a lot of them are pointless, a lot of articles on HN are pointless.

There are very valid reasons to measure that. You wouldn’t ask a plane to drive you to the neighbor or to buy you groceries at the supermarket. It’s not general mobile as you are, but it increases your mobility

For me the whole are we there yet wrt AGI is already dead, since the tools we've had for ~1.5 years are already incredibly useful for me. So I just don't care anymore. For some people we're already there. For other we'll never get there. Definitions change, goalposts move, etc. In the meantime we're already seeing ASI stuff coming (self improvement and so on).

But the arc-agi competitions are cool. Just to see where we stand, and have some months where the benchmarks aren't fully saturated. And, as someone else noted elswhere in the thread, some of these games are not exactly trivial, at least until you "get" the meta they're looking for.

It's unlikely that intelligence comes in only human flavor.

It also doesn't actually matter much, as ultimately the utility of it's outputs is what determines it's worth.

There is the moral question of consciousness though, a test for which it seems humans will not be able to solve in the near future, which morally leads to a default position that we should assume the AI is conscious until we can prove it's not. But man, people really, really hate that conclusion.

Something that surprises me about modern LLMs is that they're relatively smart yet lack consciousness. I used to believe that consciousness (e.g. a desire for self-preservation, intrinsic motivation) might be a necessary requirement for AGI/ASI, but it's increasingly looking like that may not be the case. If true, that's actually good news, since it makes the worst doomsday scenarios less likely.

Important to remember that intelligence is not a singular thing and when the last gap is closed, most aspects will be highly superhuman

So…calculators are intelligent? How about accountants that failed arithmetic 101 in high-school, are they intelligent? Generally intelligent?

I think there's some third baseline standard, which most humans and some AI can meet to be considered "intelligent". A lot of humans are essentially p-zombies, so they wouldn't meet the standard either. Possibly all humans. Possibly me too.

Except there's a much simpler definition of flying than of intelligence.