It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...
Maybe the internet will briefly go back to a place mainly populated with outliers.
If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.
It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.
Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.
The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.
Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?
This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.
- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution
- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)
- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%
- The scoring is designed so that even if AI performs on a human level it will score below 100%
- No harness at all and very simplistic prompt
- Models can't use more than 5X the steps that a human used
- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"
Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.
One AI researcher's quote stood out to me:
"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."
He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.
I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.
I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.
General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.
This so far has been the best test.
And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.
- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.
- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.
As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.
This is not AGI at all.
This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.
Humans may or may not be good at the same class of games.
We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.
So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)
Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?
I don't know if this is how we want to measure AGI.
In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.
CRAZY 0.1% in average lmao
Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.
TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.
If you are trying to measure GENERAL intelligence then it needs to be general.
If you've played Wordle you might've solved the game in a minute once before as well. And if you've played a bunch then you've perhaps also taken the entire day to solve it.
So why is it that today’s puzzle was so intuitive but next month’s new puzzle shared here could be impossible. A more satisfying explanation than luck and the obvious “different things are different” (even though… Yeah different things are different)
That is a nice sentiment but not what the AI companies are out to do; they want your job.
Surprised at the comments here re. not figuring it. Simple game. Super annoying though lmao.
Without a big jump, we're just going to boil the frog (ourselves).
I still don't quite understand the exact mirroring rules at play.
seriously. lmao. if you aint, I dunno what to say.
So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.
Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.
There's world state that you can change. Not just place pixel.
Here's v2:
Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.
It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.
The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.
It used to be easy to build these tests. I suspect it’s getting harder and harder.
But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…
Anyway, from the article:
> As long as there is a gap between AI and human learning, we do not have AGI.
This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.
By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.
It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)
If you mess around a little bit, you will figure it out. There are only a few rules.
Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%
We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.
Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.
Try the games yourself if you want to get a sense of the difficulty.
> Models can't use more than 5X the steps that a human used
These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.
> No harness at all and very simplistic prompt
This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."
...
"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.
...
"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."
If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.
> even Alan M. Turing allowed himself to be drawn into the discussion of the question whether computers can think. The question is just as relevant and just as meaningful as the question whether submarines can swim.
(I am of the opinion that the thinking question is in fact a bit more relevant than the swimming one, but I understand where these are coming from.)
Don't read the statement as a human dunk on LLMs, or even as philosophy.
The gap is important because of its special and devastating economic consequences. When the gap becomes truly zero, all human knowledge work is replaceable. From there, with robots, its a short step to all work is replaceable.
What's worse, the condition is sufficient but not even necessary. Just as planes can fly without flapping, the economy can be destroyed without full AGI.
I really like these puzzles. There’s a lot to them both in design and scoring — models trained to do well on these are going to be genuinely much more useful, so I’m excited about it. As opposed to -1 and -2, to do well at these, you need to be able to do:
- Visual reasoning
- Path planning (and some fairly long paths)
- Mouse/screen interaction
- color and shape analysis
- cross-context learning/remembering
Probably more, I only did like five or six of these. We really want models that are good at all this; it covers a lot of what current agentic loops are super weak at. So I hope M. Chollet is successful at getting frontier labs to put a billion or so into training for these.
Even with billions of dollars spent on training, we had this situation a few weeks ago where models were suggesting to walk instead of drive to a car wash in case you want to wash your car. While a 3 year old would know the answer to the question. And yet, we are designing elaborate tests to 'show whether AGI is here it not', while being fully aware of what these models represent under the hood.
This is an absurd constraint. You could have a vastly superhuman AI that doesn't learn as efficiently as a human and it would not pass this definition while it simultaneously goes on to colonize the galaxy...
"steps" are important to optimize if they have negative externalities.
There’s no “gap that becomes truly zero” at which point special consequences happen. By the time we achieve AGI, the lesser forms of AI will likely have replaced a lot of human knowledge labor through the exact “brute-force” methods Chollet is trying to factor out (which is why many people are saying that doing so is unproductive).
AGI is like an event horizon: It does mean something, it is a point in space, but you don’t notice yourself going through it, the curvature smoothly increases through it.
While I share Dijkstra's sentiment that "thinking machines" is largely a marketing term we've been chasing for decades, and this new cycle is no different, it's still worth discussing and... thinking about. The implications of a machine that can approximate or mimic human thinking are far beyond the implications of a machine that can approximate or mimic swimming. It's frankly disappointing that such a prominent computer scientist and philosopher would be so dismissive and uninterested in this fundamental CS topic.
Also, it's worth contextualizing that quote. It's from a panel discussion in 1983, which was between the two major AI "winters", and during the Expert Systems hype cycle. Dijkstra was clearly frustrated by the false advertising, to which I can certainly relate today, and yet he couldn't have predicted that a few decades later we would have computers that mimic human thinking much more closely and are thus far more capable than Expert Systems ever were. There are still numerous problems to resolve, w.r.t. reliability, brittleness, explainability, etc., but the capability itself has vastly improved. So while we can still criticize modern "AI" companies for false advertising and anthropomorphizing their products just like in the 1980s hype cycle, the technology has clearly improved, which arguably wouldn't have happened if we didn't consider the question of whether machines can "think".
There are very valid reasons to measure that. You wouldn’t ask a plane to drive you to the neighbor or to buy you groceries at the supermarket. It’s not general mobile as you are, but it increases your mobility
It also doesn't actually matter much, as ultimately the utility of it's outputs is what determines it's worth.
There is the moral question of consciousness though, a test for which it seems humans will not be able to solve in the near future, which morally leads to a default position that we should assume the AI is conscious until we can prove it's not. But man, people really, really hate that conclusion.
If model creators are willing to teach their llms to play computer games through text it's gonna be solved in one minor bump of the model version. But honestly, I don't think they are gonna bother because it's just too stilly and they won't expect their models are going to learn anything useful from that.
Especially since there are already models that can learn how to play 8-bit games.
It feels like ARC-AGI jumped the shark. But who knows, maybe people who train models for robots are going to take it in stride.
- open book, you have access to nearly the whole Internet and resources out of it, e.g. torrents of nearly all books, research paper, etc including the history of all previous tests include those similar to this one
- arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously
- no shame in submitting a very large amount of wrong answers until you get the "right" one
... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.
So there is a business application, but no practical or philosophical one.
What's going to stop e.g. OpenAI from hiring a bunch of teenagers to play these games non-stop for a month and annotate the game with their logic for deriving the rules, generate a data set based on those playthroughs and fine tuning the next version of chatgpt on all those playthroughs?
I met a guy who, for fun, started working on ARC2, and as he got the number to go up in the eval, a novel way to more efficiently move a robotic arm emerged. All that to say: chasing evals per se can have tangible real world benefits.
Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.
But with them will be an increasing expectation that these models can eventually figure things out with zero context, and zero pretraining; you drop a brain into any problem and it'll figure out how to dig its way out.
That's really exciting.
In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.
Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.
"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.
But the arc-agi competitions are cool. Just to see where we stand, and have some months where the benchmarks aren't fully saturated. And, as someone else noted elswhere in the thread, some of these games are not exactly trivial, at least until you "get" the meta they're looking for.
LLM are way past us at languages for instance. Calculators passed us at calculating, etc.
My main criticism would be that it doesn’t seem like this test allows online learning, which is what humans do (over the scale of days to years). So in practice it may still collapse to what you point out, but not because the task is unsuited to showing AGI.
I've been a gamer for just about 40 years. Gaming is my "thing"
I found the challenges fun, but easy. Coming back and reading comments from people struggling with the games, my first thought was - yup definitely not a gamer.
My approach was to poke at the controls to suss the rules, then the actual solutions were really straightforward.
fwiw, I'm pretty dumb generally, but these kinds of puzzles are my jam.
Just to drive that thought further.
What are you suggesting, should we rename it. To me the fundamental question is this.
Do we still have tasks that humans can do better than AIs?.
I like the question. I think another good test is "make money". There are humans that can generate money from their laptop. I don’t think AI will be net positive.
I’ve tried to create a Polymarket trading bot with Opus 4.6. The ideas were full of logical fallacies and many many mistakes.
But also I’m not sure how they would compare against an average human with no statistics background..
I think it’s really to establish if we by AGI mean better than average human or better than best human..
I would imagine if you simply encoded the game in textual format and asked an LLM to come up with a series of moves, it would beat humans.
The problem here is more around perception than anything.
ARC-AGI-3 is an interactive reasoning benchmark which challenges AI agents to explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously.
A 100% score means AI agents can beat every game as efficiently as humans.
Instead of solving static puzzles, agents must learn from experience inside each environment—perceiving what matters, selecting actions, and adapting their strategy without relying on natural-language instructions.
As long as there is a gap between AI and human learning, we do not have AGI.
ARC-AGI-3 makes that gap measurable by testing intelligence across time, not just final answers—capturing planning horizons, memory compression, and the ability to update beliefs as new evidence appears.
ARC-AGI-3 includes replayable runs, a developer toolkit for agent integration, and a UI designed for transparent evaluation.
Inspect agent behavior through preview replays—track decisions, actions, and reasoning in a structured timeline.
Integrate your agent using the ARC-AGI-3 toolkit, then use the interactive UI to test and iterate.
Everything you need to build agents: environments, API usage, and integration guidance.
Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).
Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.
Kinda crazy that Yudkowsky and all those rationalists and enthusiasts spent over a decade obsessing over this stuff, and we've had almost 80 years of elite academics pondering on it, and none of them could come up with a meaningful, operational theory of intelligence. The best we can do is "closer to AGI" as a measurement, and even then, it's not 100% certain, because a model might have some cheap tricks implicit to the architecture that don't actually map to a meaningful difference in capabilities.
Gotta love the field of AI.
It doesn't prove anything of the sort. ARC-AGI has always been nothing special in that regard but this one really takes the cake. A 'human baseline' that isn't really a baseline and a scoring so convoluted a model could beat every game in reasonable time and still score well below 100. Really what are we doing here ?
That Francois had to do all this nonsense should tell you the state of where we are right now.
The minimum recommended size for mobile is 44x44
No system can crack these out of the box (like humans can) because we don't have AGI.
The "things that currently make money" definition is interesting. Bc they are the things that automation can't currently do, because could be automated, then price would tend to 0 and and couldn't make money at it.
I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.
Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.
https://openai.com/index/how-we-monitor-internal-coding-agen...
Anthropomorphize or not, it would suck if a model got sick of these games and decided to break any systems it could to try and get it to stop...
I think one major disconnect, is that for most people, AGI is when interacting with an AI is basically in every way like interacting with a human, including in failure modes. And likely, that this human would be the smartest most knowledgeable human you can imagine, like the top expert in all domains, with the utmost charisma and humor, etc.
This is why the "goal post" appears to be always moving, because the non-commoners who are involved with making AGI and what not never want to accept that definition, which to be fair seems too subjective, and instead like to approach AGI like something different, it can solve some problems human's can't, when it doesn't fail, it behaves like an expert human, etc.
Even if an AI could do any intellectual task about as well as a highly competent human could, I believe most people would not consider it AGI, if it lacks the inherent opinion, personality, character, inquiries, failure patterns, of a human.
And I think that goes so far as, a text only model can never meet this bar. If it cannot react in equal time to subtle facial queues, sounds, if answering you and the flow of conversation is slower than it would be with a human, etc. All these are also required for what I consider the commoner accepting AGI as having been achieved.
Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.
This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.
Quintessential goal post moving...
1) Do models generalize?
2) If they do, and they generalize from this, is that a win?
Chollet was one of the first “they do not generalize” evangelists. I’d be curious to hear what he thinks now, because a) most disagree with him, and b) this test seems designed to get models that can generalize better at visual long context problem solving and agency, exactly where the bleeding edge is right now for needs with agentic systems.
Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.
'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.
If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?
This has absolutely nothing to do with AGI. Once they beat these tests, new ones will pop up. They'll beat those, and people will invent the next batch.
The way I see it, the true formula for AGI is: [Brain] + [External Sensors] (World Receptors) + [Internal State Sensors] + [Survival Function] + [Memory].
I won't dive too deep into how each of these components has its own distinct traits and is deeply intertwined with the others (especially the survival function and memory). But on a fundamental level, my point is that we are not going to squeeze AGI out of LLMs just by throwing more tests and training cycles at them.
These current benchmarks aren't bringing us any closer to AGI. They merely prove that we've found a new layer of tasks that we simply haven't figured out how to train LLMs on yet.
P.S. A 2-year-old child is already an AGI in terms of its functional makeup and internal interaction architecture, even though they are far less equipped for survival than a kitten. The path to AGI isn't just endless task training—it's a shift toward a fundamentally different decision-making architecture.
I still agree that this is like declaring blind people lack human intelligence, of course.
That's all probably irrelevant though, from the (possibly statistically "negative") latent space perspective of an AI, which Anthropic has considered [1].
Related, after a long back and forth of decreasing code quality, I had Claude 3.7 apologize with "Sorry, that's what I get for coding at 1am." (it was API access, noon, no access to time). I said, "Get some rest, we'll come back to this tomorrow". Then very next message, 10 seconds later, "Good morning!" and it gave a full working implementation. Thats just the statistically relevant chain of messages found in all human interactions: we start excited, then we get tired, then we get grouchy.
[1] https://www.anthropic.com/research/end-subset-conversations
But vibe coding also tends to produce somewhat poor architecture, lots of redundant and intermingled bits that should be refactored. I think the model is worse the worse code it has to work with, which I presume is only in part because it's fundamentally harder to work with bad code, but also in part because its context is filled with bad code.
When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.
Try again.
This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.
I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.
(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)
My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.
But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).
The whole point of each eval version is to identify a chunk of challenges that humans do well that AI can't. When AI gets to ~80, you move to the next chunk. When you run out of challenges, you have AGI.
+ generalize being the key word.
that's exactly the point! once we cannot invent the next batch (that is easy for humans to solve), that will be AGI
Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.
Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.
A single human is indeed more efficent, and way more flexible and actually just general intelligence.
...
People who write the stuff like the poster above you... are bizzaro. Absolutely bizarro. Did the LLM manfiest itself into existence? Wtf.
Edit, just got confirmation about the bizarro-ness after looking at his youtube.
But by all means, give the agents access to an API that returns pixel data. However I fully expect that would reduce performance rather than increase it.
Denying proper eyesight harness is like trying to construct speech-to-text model that makes transcripts from air pressure values measured 16k times per second, while human ear does frequency-power measurement and frequency binning due to it's physical construction.
This may seem like a joke, but your answer will likely be in the vain of "conscious things are obviously conscious", which gets us nowhere.
I mean, self motivation and a desire to not be turned off can be programmed into even decades old AIs.
[0] I lack a conscious experience and qualia
There is also apparently no real memory; if I tell it to stop doing something today, it’ll agree, then go back to doing it again tomorrow, with no memory of our conversation. This never changes, no matter how many times I ask.
Again we could debate consciousness forever, but in a simple sense, are there any other conscious beings without this sense of continuity? Not that I can think of. And so if everything we call “conscious” is different from an AI, then are we justified in extending it to AI?
I’d be curious about how you’re showing they lack either of those
Ruling out consciousness or qualia emerging from the inference in an LLM is just as invalid of a take as being 100% certain of its consciousness. We don’t know what consciousness really is, so only thing we can say with certainty is we do not know.
I think consciousness is not an abstract property in the world, therefore it’s tied to certain types of entities. Therefore an AI is not going to be “conscious” in the way an animal is, and never will be. This is a failing of specific language. Maybe the machines can be aware, input data, mimic what we see as consciousness, etc. but the metaphor of consciousness really doesn’t fit. A jet can move faster than an eagle but it’s not moving in the same way. We simply lack a sophisticated enough language to easily differentiate the two.
Unprompted they're not unlike a human sleeping or in a coma. Those states don't preclude consciousness in other states.
> I think consciousness is not an abstract property in the world, therefore it’s tied to certain types of entities. Therefore an AI is not going to be “conscious”
This pretty much sums up most arguments for why LLMs aren’t conscious: ”I think” followed by assertions. Only real argument is: science doesn’t quantify consciousness, we cannot quantify consciousness, let’s not assign so much certainty to models clearly exhibiting intelligence not being conscious in some way, to some degree.
I am making a linguistic argument. AI may get as sophisticated as "traditional" consciousness. But this is only "real" consciousness if you are a functionalist and think the output is all that matters.
I disagree and think that "flying" is just a weak generic word that describes both planes and birds, and not some kind of ultimate Platonic Ideal in the world.
Ditto for AI consciousness: it may develop to be as complex as traditional animal consciousness, but I'm not a functionalist, and think it's merely a lack of our sophisticated language that makes us think it's the same thing. It's not. Planes PlaneFly through the air, while birds BirdFly.
All I am saying we should stop being so certain they are not conscious, since we lack a solid, quantifiable model for consciousness.