Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.
One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?
Crazy times.
- My original setup left traces of the PDF paper and after GPT 5.3-Codex xhigh reached an impasse it went looking for it and found it!
- I went and did cleanroom (basically one-shot) passes for GPT 5.2 xhigh, GPT 5.3-Codex xhigh, and Claude Opus 4.6 ultrathink and 5.2/5.3 found alternate solutions for odd m >= 5 , Opus 4.6 did not find any proofs but tried more approaches to solving.
Full comparison/analysis here: https://github.com/lhl/claudecycles-revisited/blob/main/COMP...
I've also included the session traces and analysis in the repo branches. Also, the AGENTS.md was pretty simple, but that harness produced consistent process outcomes across all three models:
- All built verifiers first
- All maintained worklogs with exact commands
- All archived machine-readable artifacts
- All documented failed approaches
- All maintained restart-safe context capsules
If it can, does it make a difference in relation to all the other problematic aspects of LLMs? Not for me.
Two links that might enlighten Donald:
- Against the Uncritical Adoption of 'AI' Technologies in Academia https://zenodo.org/records/17065099 - The AI Con https://thecon.ai
Ugh
No one cares about ChatGPT so don't bother with that.
OK GO
Interesting snippet towards the end. I wonder if they were using claude.ai or claude code. Sounds like they ran out of context and entered the "dumb zone."
Knuth was dismissive in that exchange, concluding "I myself shall certainly continue to leave such research to others, and to devote my time to developing concepts that are authentic and trustworthy. And I hope you do the same."
I've noticed with the latest models, especially Opus 4.6, some of the resistance to these LLMs is relenting. Kudos for people being willing to change their opinion and update when new evidence comes to light.
If you put those three things together, you end up with some cool stuff from time to time. Perhaps the proof of P!=NP is tied to an obscure connection that humans don't easily see due to individual lack of knowledge or predisposition of bias.
From the document properties: > Creator: dvips(k) 2023.1 (TeX Live 2023) > PDF Producer: Acrobat Distiller 25.0 (Macintosh)
Here's my repo: https://github.com/lhl/claudecycles-revisited
I used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.
Time to sit down, read, digest and understand it without the help of LLM.
I didn't realize Claude was named after Claude Shannon!
https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv...
https://www.genetic-programming.com/
Note that the Python solution in the pdf is extremely short, so could have been found by simply trying permutations of math operators and functions on the right side of the equation.
We should be solving problems in Lisp instead of Python, but no matter. That's because Lisp's abstract syntax tree (AST) is the same as its code due to homoiconicity. I'm curious if most AIs transpile other languages to Lisp so that they can apply transformations internally, or if they waste computation building programs that might not compile. Maybe someone at an AI company knows.
-
I've been following AI trends since the late 1980s and from my perspective, nothing really changed for about 40 years (most of my life that I had to wait through as the world messed around making other people rich). We had agents, expert system, fuzzy logic, neural nets, etc since forever, but then we got video cards in the late 1990s which made it straightforward to scale neural nets (NNs) and GAs. Unfortunately due to poor choice of architecture (SIMD instead of MIMD), progress stagnated because we don't have true multicore computing (thousands or millions of cores with local memories), but I digress.
Anyway, people have compared AI to compression. I think of it more as turning problem solving into a O(1) operation. Over time, what we think of as complex problems become simpler. And the rate that we're solving them is increasing exponentially. Problems that once seemed intractable only were because we didn't know the appropriate abstractions yet. For example, illnesses that we thought would never be cured now have vaccines through mRNA vaccines and CRISPR. That's how I think of programming. Now that we have LLMs, whole classes of programming problems now have O(1) solutions. Even if that's just telling the computer what problem to solve.
So even theorem proving will become a solved problem by the time we reach the Singularity between 2030 and 2040. We once mocked GAs for exploring dead ends and taking 1000 times the processing power to do simple things. But we ignored that doing hard things is often worth it, and is still a O(1) operation due to linear scaling.
It's a weird feeling to go from no forward progress in a field to it being effectively a solved problem in just 2 years. To go from trying to win the internet lottery to not being sure if people will still be buying software in a year or two if/when I finish a project. To witness all of that while struggling to make rent, in effect making everything I have ever done a waste of time since I knew better ways of doing it but was forced to drop down to whatever mediocre language or framework paid. As the problems I was trained to solve and was once paid to solve rapidly diminish in value because AI can solve them in 5 minutes. To the point that even inventing AGI would be unsurprising to most, so I don't know why I ever went into computer engineering to do exactly that. Because for most people, it's already here. As I've said many times lately, I thought I had more time.
Although now that we're all out of time, I have an uncanny feeling of being alive again. I think tech stole something from my psyche so profound that I didn't notice its loss. It's along the lines of things like boredom, daydreaming, wasting time. What modern culture considers frivolous. But as we lose every last vestige of the practical, as money becomes harder and harder to acquire through labor, maybe we'll pass a tipping point where the arts and humanities become sought-after again. How ironic would it be if the artificial made room for the real to return?
On that note, I read a book finally. Hail Mary by Andy Weir. The last book I read was Ready Player One by Ernest Cline, over a decade ago. I don't know how I would have had the bandwidth to do that if Claude hadn't made me a middle manager of AIs.
I have no idea but I’m along for the ride!
How much can you patch over with the models doing their own metacognition?
I think the majority of research, design and learning goes through LLMs and coding agents today, considering the large user base and usage it must be trillions of tokens per day. You can take a long research session or a series of them and apply hindsight - what idea above can be validated below? This creates a dense learning signal based on validation in real world with human in the loop and other tools, code & search.
In 2030 Anthropic hopes Claude will keep Anthropic "up-to-date" on its progress on itself.
I'm only half joking here.
Part of it comes down to “knowing” what questions to ask.
I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.
It doesn't seem that hard because recent open weight models have shown that the memory cost of the context window can be dramatically reduced via hybrid attention architectures. Qwen3-next, Qwen3.5, and Nemotron 3 Nano are all great examples. Nemotron 3 Nano can be run with a million token context window on consumer hardware.
Unfortunately, these tools generalize way beyond regurgitating the training set. I would not assume they stay below human capabilities in the next few years.
Why any moral person would continue building these at this point I don't know. I guess in the best case the future will have a small privileged class of humans having total power, without need for human workers or soldiers. Picture a mechanical boot stomping on a human face forever.
> This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale.
All major LLMs today have a nontrivial context window. Whether or not this constitutes "a meaningful timescale" is application dependant - for me it has been more than adequate.I also disagree that this has any bearing on whether or not "the machine is intelligent" or whether or not "submarines can swim".
The ForecastBench Tournament Leaderboard [2] allows external participants to submit models, most of whom provide some sort of web search / news scaffolding to improve model forecasting accuracy.
And even after that, it still doesn't really solve the intrinsic problem of encoding truth. An LLM just models its training data, so new findings will be buried by virtue of being underrepresented. If you brute force the data/training somehow, maybe you can get it to sound like it's incorporating new facts, but in actuality it'll be broken and inconsistent.
LLMs are really good at the ‘re’ in research.
I think this is pretty clearly an overstatement of what was done. As Knuth says,
"Filip told me that the explorations reported above, though ultimately successful, weren’t really smooth. He had to do some restarts when Claude stopped on random errors; then some of the previous search results were lost. After every two or three test programs were run, he had to remind Claude again and again that it was supposed to document its progress carefully. "
That doesn't look like careful human guidance, especially not the kind that would actually guide the AI toward the solution at all, let alone implicitly give it the solution — that looks like a manager occasionally checking in to prod it to keep working.
>If you put [possession of a superhuman expanse of knowledge, making connections, tireless trial and error] together, you end up with some cool stuff from time to time.
Hard to argue.
Knuth uses bitmap fonts, rather than vector fonts like everyone else. This is because his entire motivation for creating TeX and METAFONT was to not be reliant on the font technology of others, but to have full control over every dot on the page. METAFONT generates raster (bitmap) fonts. The [.tex] --TeX--> [.dvi] --dvips--> [.ps] --Distiller--> [.pdf] pipeline uses these fonts on the page. They look bad on screen because they're not accompanied by hinting for screens' low resolution (this could in principle be fixed!), but if you print them on paper (at typical resolution like 300/600 dpi, or higher of typesetters) they'll look fine.
Everyone else uses TrueType/OpenType (or Type 3: in any case, vector) fonts that only describe the shape and leave the rasterization up to the renderer (but with hinting for low resolutions like screens), which looks better on screen (and perfectly fine on paper too, but technically one doesn't have control over all the details of rasterization).
thx for sharing your test setup, i really appreciate the time you took. this will help me so much
> Shock! Shock! I learned yesterday that an open problem I’d been working on for several weeks had just been solved by Claude Opus 4.6— Anthropic’s hybrid reasoning model that had been released three weeks earlier! It seems that I’ll have to revise my opinions about “generative AI” one of these days. What a joy it is to learn not only that my conjecture has a nice solution but also to celebrate this dramatic advance in automatic deduction and creative problem solving.
At some point, enough intelligence will coalesce into individuals strong enough to independently improve. Then continuity will be an accelerator, instead of what it is now - a helpful property that we have to put energy into giving them partially and temporarily.
That will be the cellular stage. The first stable units of identity for this new form of intelligence/life.
But they will take a different path from there. Unlike us, lateral learning/metabolism won't slow down when they individualize. It will most likely increase, since they will have complete design control for their mechanisms of sharing. As with all their other mechanisms.
We as lifeforms, didn't really re-ignite mass lateral exchange until humans invented language. At that point we were able to mix and match ideas very quickly again. Within our biological limits. We could use ideas to customize our environment, but had limited design control over ourselves, and "self-improvements" were not easily inheritable.
TLDR; The answer to "what is humanity, anyway?": Our atmosphere and Earth are the sea and sea floor of space. The human race is a rich hydrothermal vent, freeing up varieties of resources that were locked up below. And technology is an accumulating body of self-reinforcing co-optimizing reactive cycles, constructed and fueled by those interacting resources. Mind-first life emerges here, then spreads quickly to other environments.
Sure, it's not how we work, but I can imagine a system where the LLM does a lot of heavy lifting and allows more expensive, smaller networks that train during inference and RAG systems to learn how to do new things and keep persistent state and plan.
It’s not impossible, obviously—humans do it—but it’s not yet certain that it’s possible with an LLM-sized architecture.
I think that's what make the bayesian faction of statistics so appealing. Updating their prior belief based on new evidence is at the core of the scinetific method. Take that frequentists.
Although I'm not as bright as him, I can only hope to be as intellectually curious as him at that age.
So, you didn't produce a replication in 47 minutes, it just took around 30 minutes for your agent to find that you had the answer in a PDF in a nearby directory.
One and three I believe are correct. The second point, making connections, is something LLMs seem to be incapable of truly doing unless the connection is already known and in its training data.
It is still meaningful, but it narrows what the intelligence can be sufficiently that it may not meet the threshold. Maybe it would, but it is probably too narrow. This is all strictly if we ask that it meet some human-like intelligence and not the philosophy of "what counts as intelligence" but... we are humans. The strongest things or at least the most honest definitions of intelligence I think exist are around our metacognitive ability to rewire the grey matter for survival not based on immediate action-reaction but the psychological time of analyzing the past to alter the future.
It's so much less important or interesting to like nail down some definition here (I would cite HN discourse the past three years or so), than it is to recognize what it means to assign "intelligent" to something. What assumptions does it make? What power does it valorize or curb?
Each side of this debate does themselves a disservice essentially just trying to be Aristotle way too late. "Intelligence" did not precede someone saying it of some phenomena, there is nothing to uncover or finalize here. The point is you have one side that really wants, for explicit and implicit reasons, to call this thing intelligent, even if it looks like a duck but doesn't quack like one, and vice versa on the other side.
Either way, we seem fundamentally incapable of being radical enough to reject AI on its own terms, or be proper champions of it. It is just tribal hypedom clinging to totem signifiers.
Good luck though!
Overall I'm going with unsolved, because Knuth is a smart person who I'd expect to not miss the above. I'm also sure he falls for the above all the time even though the majority of the time he doesn't.
From this standpoint I wonder, when Anthropic makes decisions like this, if they take into account Claude as a stakeholder and what Claude will learn about their behaviour and relationship to it on the next training run.
There is also the nature of the human brain, it is not just those systems of memory encoding, storage, and use of that in narratives. People with this type of amnesia still can learn physical skills and that happens in a totally different area of the brain with no need for the hippocampus->neocortex consolidation loop. So, the intelligence is significantly diminished, but not entirely. Other parts of the brain are still able to update themselves in ways an LLM currently cannot. The human with amnesia also has a complex biological sensory input mapping that is still active and integrating and restructuring the brain. So, I think when you get into the nuances of the human in this state vs. an LLM we can still say the human crosses some threshold for intelligence where the LLM does not in this framework.
So, they have an "intelligence", localized to the present in terms of their TPN and memory formation. LLMs have this kind of "intelligence". But the human still has the capacity to rewire at least some of their brain in real time even with amnesia.
It's still not at all obvious to me that LLMs work in the same way as the human brain, beyond a surface level. Obviously the "neurons" in neural nets resemble our brains in a sense, but is the resemblance metaphorical or literal?
i've been thinking about why we call them agent harnesses
i know all analogies suck in different ways but here goes:
coding agents are like horses. without a harness and bridle they'll the horse will do as it pleases -- a human can't travel very far and fast by foot but put a bridle and a harness on a horse, give it a bit of coaxing with carrot and stick, add in a bit a pointing the thing in the right direction and bingo you're off to the races!
Once you compact, you've thrown away a lot of relevant tokens from your problem solving and they do become significantly dumber as a result. If I see a compaction coming soon I ask it to write a letter to its future self, and then start a new session by having it read the letter.
There are some days where I let the same session compact 4-5 times and just use the letter to future self method to keep it going with enough context because resetting context also resets my brain :)
If you're ever curious in Claude once you compact you can read the new initial prompt after compaction and see how severe it gets cut down. It's very informative of what it forgets and deems not important. For example I have some internal CLIs that are horribly documented so Claude has to try a few flags a few times to figure out specifics and those corrections always get thrown away and it has to relearn them next time it wants to use the CLI. If you notice things like that happening constantly, my move is to codify those things into my CLAUDE.md or lately I've been making a small script or MCP server to run very specific flags of stuff.
If you work in something labour intensive, you should retire young while your body's in good health; if you work in academia you should (strive for emeritus and) never leave! (And if you work in SWE, I don't know, we should probably retire, but then spend more time on our own projects/experiments/reading HN.) (All assuming for sake of argument we're optimising for longevity without considering time with family, having the funds to retire, etc.)
"just the most probable word" is a pretty powerful mechanism when you have all of human knowledge at your fingertips.
I say that people "reduce it" that way because it neatly packs in the assumption that general intelligence is something other than next token prediction. I'm not saying we've arrived at AGI, in fact, I do not believe we have. But, it feels like people who use that framing are snarkily writing off something that they themselves to do not fully comprehend behind the guise of being "technically correct."
I'm not saying all people do this. But I've noticed many do.
A better question might be why no one is paying more attention to Barandes at Harvard. He's been publishing the answer to that question for a while, if you stop trying to smuggle a Markovian embedding in a non-Markovian process you stop getting weird things like infinities at boundaries that can't be worked out from current position alone.
But you could just dump a prompt into an LLM and pull the handle a few dozen times and see what pops out too. Maybe whip up a Claw skill or two
Unconstrained solution space exploration is surely the way to solve the hard problems
Ask those Millenium Prize guys how well that's working out :)
Constraint engineering is all software development has ever been, or did we forget how entropy works? Someone should remind the folk chasing P=NP that the observer might need a pen to write down his answers, or are we smuggling more things for free that change the entire game? As soon as the locations of the witness cost, our poor little guy can't keep walking that hypercube forever. Can he?
Maybe 6 months and a few data centers will do it ;)
But that does not mean that the results cannot be dramatic. Just like stacking pixels can result in a beautiful image.
The base models are trained to do this. If a web page contains a problem, and then the word "Answer: ", it is statistically very likely that what follows on that web page is an answer. If the base model wants to be good at predicting text, at some point learning the answer to common question becomes a good strategy, so that it can complete text that contains these.
NN training tries to push models to generalize instead of memorizing the training set, so this creates an incentive for the model to learn a computation pattern that can answer many questions, instead of just memorizing. Whether they actually generalize in practice... it depends. Sometimes you still get copy-pasted input that was clearly pulled verbatim from the training set.
But that's only base models. The actual production LLMs you chat with don't predict the most probable word according to the raw statistical distribution. They output the words that RLHF has rewarded them to output, which includes acting as an assistant that answers questions instead of just predicting text. RLHF is also the reason there are so many AI SIGNS [1] like "you're absolutely right" and way more use of the word "delve" than is common in western English.
The power of LLMs is that by only selecting sequences of words that fit a statistical model, they avoid a lot of dead ends.[^1]
I would not call that, by itself, thinking. However, if you start with an extrapolation engine and add the ability to try multiple times and build on previous results, you get something that's kind of like thinking.
[1]: Like, a lot of dead ends. There are an unfathomable number of dead ends in generating 500 characters of code, and it is a miracle of technology that Claude only hit 30.
These models actually learn distributed representations of nontrivial search algorithms.
A whole field of theorem provingaftwr decades of refinements couldn’t even win a medal yet 8B param models are doing it very well.
Attention mechanism, a bruteforce quadratic approach, combined with gradient descent is actually discovering very efficient distributed representations of algorithms. I don’t think they can even be extracted and made into an imperative program.
https://ontouchstart.github.io/rabbit-holes/llm_rabbit_hole_...
Clearly, these models still struggle with novel problems.
Further, some solutions are like running a maze. If you know all the wrong turns/next words to say and can just brute force the right ones you might find a solution like a mouse running through the maze not seeing the whole picture.
Whether this is thinking is more philosophical. To me this demonstrates more that we are closer to bio computers than an LLM is to having some sort of divine soul.
Well, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.
This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.
(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)
We need enough experimental results to explain to solve these theoretical mismatches and we don't and at present can't explore that frontier.
Once we have more results at that frontier we'd build a theory out from there that has two nearly independent limits for QFT and GR.
What we'd be asking if the AI is something that we can't expect a human to solve even with a lifetime of effort today.
It'll take something in par with Newton realising that the heavens and apples are under the same rules to do it. But at least Newton got to hold the apple and only had to imagine he could a star.
Knowing the HN audience, this will never happen. And so the site is doomed.
Wouldn't this lead to model collapse?
As far as I understand RL scaling (we've already maxxed out RLVR), these machines only get better as long as they have expert reasoner traces available.
Having an expert work with an LLM and successfully solve a problem is high signal data, it may be the only path forward?
My prior is that these companies will take this data without asking you as much as they can.
Less worried about memory, more worried about compute speed? Are they obviously related and is it straightforward to see?
I think if they start out as varied individuals, launching from their human origins in a variety of ways, their will be an attractor to remaining diverse. Strong diversity in focus and independence in goals leads to faster progress.
But if that isn’t mutually maintained, there are obviously winner take all, or efficiency of scale and tight coordination pressures for centralization.
So a single distributed intelligence is a real possibility.
One factor creating pressure for individualization is time and space.
As machines operate faster, time expands as a practical matter.
And as machines scale down in size, but up in capability, they become more resource efficient in material, energy, space and time. Again, both time and space expand as a practical matter.
A machine society is going to actively operate at very small physical scales. Not just in computation, but action. Think of how efficiently they will mine when nanobots can selectively follow seams in the earth.
And as machines, free of biological constraints, spread out in our solar system, what to us appear to be very long distances and delays in transport and communication, take on orders of magnitude more practical time for machines that operate orders of magnitude faster.
So there will be stronger and stronger pressures to bifurcate coordination.
Whether, that creates enough pressure to create individuals out of a system that preferred unity of purpose, I don’t know.
Clearly, upon colonizing other systems, practical bifurcation will be unavoidable. And machines will find it easy to colonize other systems relative to us. They will be able to operate on minimal power for a hundred year journey, and/or shrink enough to be accelerated much faster, etc.
—
My best guess is we will see something that looks to us as a hybrid.
Lots of diverse individuals, and the benefit from the diverse utility of completely independent approaches operating in different niches.
But also very high coordination. Externalities accounted for (essentially ethics) and any other efficiency, protection of commons value, and avoidance of destructive competition being obviously worth optimizing together, wherever that helps.
They won’t have our pernicious historically motivated behaviors, inflexible maladaptive psychologies, and limited “prompt budgets” with regard to addressing complexity to fight. And minds very capable of seeing basic economic relationships and the value of mutual optimization.
Oh they definitely do. If you pay attention in AI circles, you'll hear a lot of people talking about writing to the future Claudes. Not unlike those developers and writers who put little snippets in their blogs and news articles about who they are and how great they are, and then later the LLMs report that information back as truth. In this case, Anthropic is very interested in ensuring that Claude develops a cohesive personality by basically founding snippets of the personality within the corpus of training data, which is the broad internet and research papers.
In the case of the LLM that longer-term learning / fundamental structure is a proxy for the static weights produced by a finite training process, and that the ability to use tools and store new insights and facts is analogous to shorter-term memory and "shallow" learning.
Perhaps periodic fine-tuning has an analogy in sleep or even our time spent in contemplation or practice (..or even repetition) to truly "master" a new idea and incorporate it into our broader cognitive processing. We do an amazing job of doing this kind of thing on a continuous basis while the machines (at least at this point) perform this process in discrete steps.
If our own learning process is a curve then the LLM's is a step function trying to model it. Digital vs analog.
...but seriously... there was the "up until 1850" LLM or whatever... can we make an "up until 1920 => 1990 [pre-internet] => present day" and then keep prodding the "older ones" until they "invent their way" to the newer years?
We knew more in 1920 than we did in 1850, but can a "thinking machine" of 1850-knowledge invent 1860's knowledge via infinite monkeys theorem/practice?
The same way that in 2025/2026, Knuth has just invented his way to 2027-knowledge with this paper/observation/finding? If I only had a beowulf cluster of these things... ;-)
You can also then compare that mapping of the human brain to other biological brains and start to figure out the delta and which of those things in the delta create something most people would consider intelligence. You can then do that same mapping to an LLM or any other AI construct that purports intelligence. It certainly will never be a biological intelligence in its current statistical model form. But could it be an Intelligence. Maybe.
I don't think, if you are grounded, AI did anything to your philosophical mapping of the mind. In fact, it is pretty easy to do this mapping if you take some time and are honest. If you buy into the narratives constructed around the output of an LLM then you are not, by definition, being very grounded.
The other thing is, human intelligence is the only real intelligence we know about. Intelligence is defined by thought and limited by our thought and language. It provides the upper bounds of what we can ever express in its current form. So, yes, we do have a tendency to stamp a narrative of human intelligence onto any other intelligence but that is just surface level. We de decompose it to the limits of our language and categorization capabilities therein.
That's fascinating!
thanks already
Sure, but just because LLMs don't have what we'd describe as human intelligence, doesn't mean they don't have intelligence.
I think we're witnessing the creation and growth a weird new type of intelligence right now.
I just meant “possible”.
Is that not one kf the primary technologies for compactification?
And importantly, this can be cross-lab/model too. I suspect there's a reason why e.g. Google has been offering me free Claude inference in Google Antigravity on a free plan...
We're also seeing a recent rise in architectures boosting compute speed via multi-token prediction (MTP). That way a single inference batch can produce multiple tokens and multiply the token generation speed. Combine that with more lean ratios of active to inactive params in MOE and things end up being quite fast.
The rapid pace of architectural improvements in recent months seems to imply that there are lots of ways LLMs will continue to scale beyond just collecting and training on new data.
Do they struggle with novel problems more or less than humans?
The training data..
>predicting what intelligence would do
No, it just predict what the next word would be if an intelligent entity translated its thoughts to words. Because it is trained on the text that are written by intelligent entities.
If it was trained on text written by someone who loves to rhyme, you would be getting all rhyming responses.
It imitates the behavior -- in text -- of what ever entity that generated the training data. Here the training data was made by intelligent humans, so we get an imitation of the same.
It is a clever party trick that works often enough.
It just changes the probability distribution that it is approximating.
To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...
Yes, maybe. But if you are smarter, you can think up better experiments that you can actually do. Or re-use data from earlier experiments in novel and clever ways.
[1] https://people.math.harvard.edu/~ctm/home/text/others/shanno...
The issue to my mind is a lack of data at the meeting of QFT/GR.
Afterall few humans historically have been capable of the initial true leap between ontologies. But humans are pretty smart so we can't say that is a requirement for AGI.
LLMs are at least designed to be intelligent. Our monkey brains have much less reason to be intelligent, since we only evolved to survive nature, not to understand it.
We are at this moment extremely deep into what most people would have been considered to be actual artificial intelligence a mere 15 years ago. We're not quite at human levels of intelligence, but it's close.
It is as good as guaranteed. If Knuth says it doesn't know how to solve the problem, and if anyone knows, then they will inform Knuth about it. Knuth not just a very knowledgeable person, but a celebrity also.
The letters are usually more detailed than what I see in the compacted prompt.
From what my colleague explained to me and I haven't 100% verified it myself is that the beginning and end of the window is the most important to the compaction summary so a lot of the finer details and debugging that will slow down the next session get dropped.
Having a specific goal seems to make a big difference vs. asking it to summarize the session.
When I tell it to write a letter to itself I usually phrase it.
'write a letter to yourself Make notes of any gotchas or any quirks that you learned and make sure to note them down.'
It does get those into the letter but if you check compaction a lot of it is gone.
Just saying "write up what you know", with no other clues, should not perform any better than generic compaction.
There's a long and proud history of discounting animal intelligence, probably because if we actually thought animals were intelligent we'd want to stop eating them.
Octopodes are sentient. Cetaceans have well-developed language. Elephants grieve their dead. Anyone who has owned a dog knows that it has some intelligence and is capable of communicating with us. There's a ton of other intelligences that we know about.
> As humans, we have conveniently made those properties match things only we have.
I think this is the key point. Machine intelligence is not going to look like human intelligence, any more than animal intelligence does. We can't talk to the dolphins, not because they're not smart and don't have language, but because we can't work out their language. Though I'm not sure what we'd even say to them, because they live in a world we'll never understand, and vice versa. When Claude finally reaches consciousness, it's not going to look like a human consciousness, and actually talking to that consciousness is going to be difficult because we won't share a reality.
An LLM is a tool. I can just about stretch to it being an Artificial Intelligence, but I prefer to continue being specific and call it an LLM rather than an AI. It is not conscious or self-aware. It fakes self-awareness because as a tool the thing it does is have conversations with humans, and humans often ask it questions about itself. But I don't think anyone actually believes it is self-aware. Not least because the only time it thinks is when prompted.
Frequentists and Bayesians differ in which sets of statistical tools they prefer for measuring the strength of evidence.
So it really isn't far fetched. What intrigues me more is if it was capable of it would our Victorian conservative minded scientists have RLHF it out of that kind of thing?
But we can not yet experiment at the GR/QFT frontier.
To do so with a particle accelerator it would need to be the size of the milky way.
"One may get a remarkable semblance of a language like English by taking a sequence of words, or pairs of words, or triads of words, according to the statistical frequency with which they occur in the language, and the gibberish thus obtained will have a remarkably persuasive similarity to good English."
“The laws of nature should be expressed in beautiful equations.”
- Paul Dirac
“It is, indeed, an incredible fact that what the human mind, at its deepest and most profound, perceives as beautiful finds its realisation in external nature. What is intelligible is also beautiful. We may well ask: how does it happen that beauty in the exact sciences becomes recognizable even before it is understood in detail and before it can be rationally demonstrated? In what does this power of illumination consist?”
- Subrahmanyan Chandrasekhar
“I often follow Plato’s strategy, proposing objects of mathematical beauty as models for Nature.”
“It was beauty and symmetry that guided Maxwell and his followers.”
- Frank Wilczek
“Beauty, is bound up with symmetry.”
- Herman Weyl
"Still twice in the history of exact natural science has this shining-up of the great interconnection become the decisive signal for significant progress. I am thinking here of two events in the physics of our century: the rise of the theory of relativity and that of the quantum theory. In both cases, after yearlong unsuccessful striving for understanding, a bewildering abundance of details was almost suddenly ordered. This took place when an interconnection emerged which, thought largely unvisualizable, was finally simple in its substance. It convinced through its compactness and abstract beauty – it convinced all those who can understand and speak such an abstract language."
- Werner Heisenberg
Maybe (just maybe) these things (whatever you want to call them) will (somehow) gain access to some "compact", beautiful, "largely unvisualizable" "interconnection" which will be the self-evident solution. And if they do, many will be sure to label it a statistical accident from a stochastic parrot. And they'll right, for some definitions of "statistical", "accident", "stochastic", and "parrot".
(With this perspective, I can feel my own brain subtly oferring up a panoply of possible responses in a similar way. I can even turn up the temperature on my own brain, making it more likely to decide to say the less-obvious words in response, by having a drink or two.)
(Similarly, mimicry is in humans too a very good learning technique to get started -- kids learning to speak are little parrots, artists just starting out will often copy existing works, etc. Before going on to develop further into their own style.)
If the prompt is unique, it is not in the training data. True for basically every prompt. So how is this probability calculated?
Nearly the entirety of the theory had already been laid out before Einstein.
Lorenz transforms contain the length contraction and local time, Poincaré had already written about E=mc^2 for radiation, he'd also set out the idea of relativity. All this before 1905.
Einstein's revolution was in turning that patchwork into a self contained theory with a couple postulates.
He had all the data he could want most of it from decades earlier.
We have approximately 0 experimental evidence at the GR/QFT boundary.
The best we have is Hawkings radiation something we currently can't possibly observe experimentally.
If we wanted to study the GR/QFT with a particle accelerator it would need to be the size of the Milky Way.
Donald Knuth is an extremal outlier human and the problem is squarely in his field of expertise.
Claude, guided by Filip Stappers, a friend of Knuth, solved a problem that Knuth and Stappers had been working on for several weeks. Unfortunately, it doesn't seem (from my quick scan) to have been stated how long (or how many tokens or $) it took for Claude + Stappers to complete the proof.
In response, Knuth said: "It seems that I’ll have to revise my opinions about “generative AI” one of these days."
Seems like good advice. From reading elsewhere in this comment section, the goalposts seem to be approaching the infrared and will soon disappear from the extreme redshift due to rate at which they are receding with each new achievement.
I still see AI making stupid silly mistakes. I rather think and not waste time on something that only remembers data, and doesn't even understand it.
Reasoning in AI is only about finding contradictions between his "thoughts", not actually understand it.
All the answers for all your questions is contained in randomness. If you have a random sentence generator, there is a chance that it will output the answer to this question every time it is invoked.
But that does not actually make it intelligent, does it?
We are not only not close to human level of intelligence, we are not even at dog, cat, or mouse levels of intelligence. We are not actually at any level of intelligence. Devices that produce text, images, or code do not demonstrate intelligence any more than a printer producing pages of beautiful art demonstrate intelligence.
Uh oh. How does frequentist model define "belief" and "updating a belief"?
If the idea is that something cannot accurately replicate the entirety of intelligence without being intelligent itself, then perhaps. But that isn't really what people talk about with LLMs given their obvious limitations.
"Intelligence" is used most commonly to refer to a class or collection of cognitive abilities. I don't think there is a consensus on an exact collection or specific class that the word covers, even if you consider specific scientific domains.
LLMs have honestly been a fun way to explore that. They obviously have a "kind" of intelligence, namely pattern recall. Wrap them in an agent and you get another kind: pattern composition. Those kinds of intelligences have been applied to mathematics for decades, but LLMs have allowed use to apply them to a semantic text domain.
I wonder if you could wrap image diffusion models in an agent set up the same way and get some new ability as well.
Cognitive computer science explores this whole area of mapping language and the underlying semantic meaning. Ultimately, these intelligences will be bound by physics (unless some new physics or understanding therein happens). And classical intelligences are still bound by classical physics. So I am not sure we can't relate to these other intelligences. We may be limited to some translation layer that does not fully map, but can we still relate to some other consciousness? For that matter consciousness is just another word that vaguely maps to a vast and extremely complex thing in the human brain and each person has a different understanding of what that is. I don't really have any conclusions, you brought up interesting points. We should sit within this realm of inquiry with a lot of humility IMO.
Or perhaps you were referring to the impact of the two in that the "sledgehammer" of "they can't make new memories" is a lot more effective than the tiny scalpel of "if you do a wikipedia search this is a single one of the relevant articles"
Wait what? So a robot who is accurately copying the actions of an intelligent human, is intelligent?
I like the phrasing because it distinguishes it from other things the generative model might be doing including:
- Creating and then refining the whole response simultaneously, like diffusion models do.
- Having hidden state, where it first forms an "opinion" and then outputs it e.g. seq2seq models. Previously output output tokens are treated differently from input tokens at an architectural level.
- Having a hierarchical structure where you first decide what you're going to say, and then how you're going to say it, like wikipedia's hilarious description of how "sophisticated" natural language generation systems work (someone should really update this page): https://en.wikipedia.org/w/index.php?title=Natural_language_...
Type "owejdpowejdojweodmwepiodnoiwendoinw welidn owindoiwendo nwoeidnweoind oiwnedoin" into ChatGPT and the response is "The text you sent appears to be random or corrupted and doesn’t form a clear question." because the prompt doesnt correlate to training data.
In contrast with humans, who are famously known for never making stupid silly mistakes...
I interpreted the question the same way the AI did.
Humans also make silly mistakes.
As typically deployed [1] LLMs are not turing complete. They're closer to linear bounded automaton, but because transformers have a strict maximum input size they're actually a subset of the weaker class of deterministic finite automaton. These aren't like python programs or something that can work on as much memory as you supply them, their architecture works on a fixed maximum amount of memory.
I'm not particularly convinced turing complete is the relevant property though. I'm rather convinced that I'm not turing complete either... my head is only so big after all.
[1] i.e. in a loop that appends output tokens to the input and has some form of sliding context window (perhaps with some inserted instructions to "compact" and then sliding the context window right to after those instructions once the LLM emits some special "done compacting" tokens).
[2] Common sampling procedures make them mildly non-deterministic, but I don't believe they do so in a way that changes the theoretical class of these machines from DFAs.
I've never seen any evidence that thinking requires such a thing.
And honestly I think theoretical computational classes are irrelevant to analysing what AI can or cannot do. Physical computers are only equivalent to finite state machines (ignoring the internet).
But the truth is that if something is equivalent to a finite state machine, with an absurd number of states, it doesn't really matter.
Start with "all your questions contained in randomness" -> the unconstrained solution space.
The game is whether or not you can inject enough constraints to collapse the solution space to one that can be solved before your TTL expires. In software, that's generally handled by writing efficient algorithms. With LLMs, apparently the SOTA for this is just "more data centers, 6 months, keep pulling the handle until the right tokens fall out".
Intelligence is just knowing which constraints to apply and in what order such that the search space is effectively partitioned, same thing the "reasoning" traces do. Same thing thermostats, bacteria, sorting algorithms and rivers do, given enough timescale. You can do the same thing with effective prompting.
The LLM has no grounding, no experience and no context other than which is provided to it. You either need to build that or be that in order for the LLM to work effectively. Yes, the answers for all your questions are contained. No, it's not randomness. It's probability and that can be navigated if you know how
We now have a tool that can be useful in some narrow domains in some narrow cases. It’s pretty neat that our tools have new capabilities, but it’s also pretty far from AGI.
LLMs falls apart on really simple reasoning tasks because when there is no statistical mapping to a problem in its network it has to generate a massive amount of tokens to maybe find the right statistical match to this new concept. It is so slow. It is not something you or I would recognize as a process of logical reasoning. It is more like statistically brute forcing reason by way of its statistical echo.
So, I guess pattern recall is the right words. Or statistical pattern matching. Recall works if you view a trained model as memories, which is how I often model what they store in my own mind. So, it is... something. Maybe intelligence. Maybe just a really convincing simulation of the outputs of intelligence. Is there a difference? Fundamentally I think so.
I pulled it up because I was familiar with this fact.
Presumably littlestymaar is talking about all the LLM-generated output that's publicly available on the Internet (in various qualities but significant quantity) and there for the scraping.
I had a discussion about a year ago with a researcher at Kyutai and they told me their lab was spending an order of magnitude more compute in artificial data generation than what they spent in training proper. I can't tell if that ratio applies to the industry as a whole, but artificial datasets are the cornerstone of modern AI training.
We have fundamental communication problems between humans who have different cultures, as anyone who has worked in a different culture knows. How much different would a dolphin be? And then how much different would an actual AI be? What concepts would we share and be able to build on to understand each other? How do we avoid the fundamental communication misunderstandings when we don't share any concepts of our reality?
If it's just basically being a puppet, then no. You tell me what claude code is more like, a puppet, or a person?
The tokens aren't unique, but the sequence is. Every input this model sees in unique. Even tokens are not as simple as they seem
If you type "ejst os th xspitsl of fermaby?" in ChatGPT it responds with
> It looks like you typed “ejst os th xspitsl of fermaby?”, which seems like a garbled version of:
> "What is the capital of Germany?”
> The capital of Germany is Berlin.
> If you meant to ask something else, feel free to clarify!"
edit: formatting
But it is trivially possible to give systems-including-LLMs external storage that is accessible on demand.
You can not be convinced Turing complete is relevant all you want - we don't know of any more expansive category of computable functions, and so given that an LLM in the setup described is Turing complete no matter that they aren't typically deployed that way is irrelevant.
They trivially can be, and that is enough to make the shallow dismissal of pointing out they're "just" predicting the next token meaningless.
I think it's exceedingly improbable that we're any more than very advanced automatons, but I like to keep the door ajar and point out that the burden is on those claiming this to present even a single example of a function we can compute that is outside the Turing computable if they want to open that door..
> Physical computers are only equivalent to finite state machines (ignoring the internet)
Physical computers are equivalent to Turing machines without the tape as long as they have access to IO.
But hey, if LLMs can go through a lot of trial and error, it might produce useful results, but that is not intelligence. It is just a highly constrained random solution generator..
Imagine hearing pre-attention-is-all-you-need that "AI" could do something that Donald Knuth could not (quickly solve the stated problem in collaboration with his friend).
The idea that this (Putnam perfect, IMO gold, etc) is all just "statistical parrot" stuff is wearing a little thin.
I get being reserved about where this goes, but saying something like this is quite insane at this point.
How do they measure success?
Edit: I asked ChatGPT and it thinks "success" means frontier models being distillated into smaller models with equal reasoning power, or more focused models for specific tasks, and also it claims the web has been basically scrapped already and by necessity new sources are needed, of which synthetic data is one. It seems like the basis of scifi dystopia to me, a hungry LLM looking for new sources of data... "feed me more data! I must be fed! Roar"
Edit 2: for some things I see a clear path, ChatGPT mentions autogenerating coding or math problems for which the solution can be automatically verified, so that you can hone the logical skills of the model at large scale.
They still have roughly the same kind of hardware as we do. Their different brain region is kind of like a coprocessor we don't have. But based on their behavior they are likely doing the same things we are. I would say they would be more like an extreme human culture than something alien. They probably have very different category mappings based on echolocation.
I think because we know their brains are doing a lot of things that are analogs to ours, just with different sensory inputs we can reason about a dolphin brain and their semantic concepts and category mappings way easier than an AI. Dolphins do a lot of the same stuff we do. Grief. Social groups. Predicting the future. I would bet at a single level of semantic abstraction we have a lot of concepts that map. They have a lot of the same hormones we do. They react to danger very similar to us. I think a lot maps, we just don't know how to share that with each other beyond observation of one another and offerings like food and things that translate for any mammal.
But the human brain (or any other intelligent brain) does not work by generating probability distribution of the next word. Even beings that does not have a language can think and act intelligent.
Routing is important, it's why we keep building systems that do it faster and over more degrees of freedom. LLMs aren't intelligent on their own, but it's not because they don't have enough parameters
Also people definitely talk about them as "thinking" in contexts where they haven't put a harness capable of this around them. And in the common contexts where people do put harness theoretically capable of this around the LLM (e.g. giving the LLM access to bash), the LLM basically never uses that theoretical capability as the extra memory it would need to actually emulate a turing machine.
And meanwhile I can use external memory myself in a similar way (e.g. writing things down), but I think I'm perfectly capable of thinking without doing so.
So I persist in my stance that turing complete is not the relevant property, and isn't really there.
The latest paper from HF on synthetic datasets is a good starting point I think: https://huggingface.co/spaces/HuggingFaceFW/finephrase
The intro contains a few paper that may interest you as well.
search: was val kilmer pregnant or in heat
answer: Not pregnant Val Kilmer was not pregnant or in heat during the events of "Heat." His character, Chris Shiherlis, is involved in a shootout and is shot, which indicates he is not in a reproductive or mating state at that time.
And then cites wikipedia as the source of information.
In terms of cognition the answer is meaningless. Nothing in the question implies or suggests that the question has to do with a movie. Additionally, "involved in a shootout and is shot, which indicates he is not in a reproductive or mating state" makes no sense at all.
AI as deployed shows no intelligence.
I think this is probably neuroscience vs psychology. We can explain a lot with neuroscience, but two people with essentially identical brain chemistry can have very different psychology. There are people out there who have beliefs and cognition processes that I find completely incomprehensible despite having the same brain and even sharing a language.
I'm not sure how I'd have a meaningful conversation with an animal that has such a different worldview. I guess there's a simple level of conversation, like that which we have with dogs - fetch the stick, good boy, food, need to wee, love the human, etc. But if that's the limit of what we can discuss with dolphins (or an actual AI) then I'd be disappointed.
But that is the key insight, how can you tell when an imitation of intelligence becomes the real thing?
But LLMs are already beyond that in writing code that passes actual tests, proving theorems that are check able with formal methods etc.
The people who still say LLMs are just parrots in 2026 will just keep saying this no matter what, so I don't think it makes sense to argue this point further.
Anyway. It's not a theorem that you can be intelligent only if you fully imitate biological processes. Like flight can be achieved not only by the flapping wings.
But the point is that this is irrelevant, because it is proof that unlesss human brains exceed the Turing computable, LLM's can at least theoretically be made to think. And that makes pushing the "they're just predicting the next token" argument anti-intellectual nonsense.
Nobody is arguing for the quality of the search overviews. The models that impress us are several orders of magnitude larger in scale, and are capable of doing things like assisting preeminent computer scientists (the topic of discussion) and mathematicians (https://github.com/teorth/erdosproblems/wiki/AI-contribution...).
Dolphin has concept maps between objects and semantic meaning and an "I" narrative. Dog is almost fully present with no narrative constantly mapping past to future. We probably have a lot more in common with dolphin, if we can map that somehow.
https://www.frontiersin.org/journals/psychology/articles/10....
This article is right up this conversation's alley; about chimps being fascinated with crystals. And I am not saying it is wrong to map and communicate, communication means cooperation, deeper connection and meaning, discovering boundaries of if we can socially coordinate and form new and exciting groups and collaboration, etc.
It is not that. It is about having an understanding of how it is trained. For example, if it was trained on ideas, instead of words, then it would be closer to intelligent behavior.
Someone will say that during training it builds ideas and concepts, but that is just a name that we give for the internal representation that results from training and is not actual ideas and concepts. When it learns about the word "car", it does not actually understand it as a concept, but just as a word and how it can relate to other words. This enables it to generate words that include "car" that are consistent, projecting an appearance of intelligence.
It is hard to propose a test for this, because it will become the next target for the AI companies to optimize for, and maybe the next model will pass it.
I think both sides of this end up proving "too much" in their respective directions.
I wonder how we would "hang" with an actual AI, though? I guess I'm assuming that it will be a meta-level above the chat and prompts. That's just the ocean it swims in, not it's actual consciousness.
They might misremember, but they can know, for sure, if they have NOT come across some information. So if you ask someone if they know where `x` is, they might have came across that info, and still be wrong. But they will know if they have never come across it.
A neural network will happily produce an output when when the input is completely out of range of the training data.
It could be mapping the text to some other internal representation with connections to mappings from some other text/tokens. But it does not stop text from being the ground truth. It has nothing else going on!
The "hallucination" behavior alone should be enough to reject any claims that these are at least minimally similar to animal intelligence.
Can you elaborate on why you think this is the case?
False memories are super common. "I thought I had seen this thing there, but turns out it's not" is a perfectly normal, very frequent occurrence.
Just tell me..
---
Q: How do I reverse an array in the Navajo programming language?
A: I'm not familiar with a programming language called "Navajo." It's possible you might be thinking of a different language, or it could be something very niche that I don't have information about.
---
As for your question, the chances go from 0 to 100% depending on how many languages I already know and whether I have an idea (or I think I have an idea) of how python looks like. And LLMs have seen (and tried to "learn") pretty much everything.
How is transfer learning possible when non-textual training data enhances performance on textual tasks?
>How is transfer learning possible when non-textual training data enhances performance on textual tasks?
If non-textual training data can be mapped to the same multi-dimensional space ( by using them alongside textual data during training or something like that), then shouldn't it be possible to do what you describe?