> This reminds me of Antirez's "Don't fall into the anti-AI hype". In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
5.5pro is amazing but this implication might not be true & is the core argument of this piece.
AI will prove all sort of things - interesting, boring & incorrect.
To sort it will be the task of the PhD.
At the time I thought the key missing tool was a natural language search that acted like mathoverflow, where you could explain your problem or ideas as you understood them and get references to relevant literature (possibly outside your experience or vocabulary).
> It seems to me that training beginning PhD students to do research [...] has just got harder, since one obvious way to help somebody get started is to give them a problem that looks as though it might be a relatively gentle one. If LLMs are at the point where they can solve “gentle problems”, then that is no longer an option. The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.
Training must start from the basics though. Of course everybody's training in math starts with summing small integers, which calculators have been doing without any mistake since a long time.
The point is perhaps confirmed by another comment further down in the post
> by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders
People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired. I'm not sure if there is a similar point with math. Again from the post
> suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.
This is a cultural choice. It makes sense that in the mathematics culture we currently have, this is alien. But already, other fields, and many individuals, would disagree and say that the human did have a major achievement here. As long as human-AI collaborations are producing the best results, there is meaningful contribution by the humans, and people that are deeper experts and skilled LLM whisperers should be able to make outsized contributions. The real shoe drops when pure AI beats humans and human-AI collaboration.
This comment about time is very interesting to me. I know it's "just" doing mathematical proofs but the possibilities of speeding up planning, proposals and decision making in the physical world should excite people.
> Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques.
This is exactly what leads me to believe that the real impact of LLMs in human history is yet to come. My work as a researcher was mostly spent on two classes of workloads: reading papers that were recently published to gather ideas and keep up with the state of the art, and work on a selection of ideas gathered from said papers to build my research upon. It turns out that LLMs excel at the most critical component of both workloads: parsing existing content and use it when prompting the model to generate additional content based on specific goals and constraints. I mean, papers are already a way to store and distribute context.
Does the author know about CAISc 2026 [0]?
This is of enormous importance but still is being actively ignored by many professionals or dismissed as as a minor issue.
Our emotional human brains are very enthusiastic about these new kind of "intelligent" products ("partners") and we want to believe so hard that they are finally "there" that we tend to ignore how big of a problem it is that LLMs carry a fundamental design problem with them that will make them produce errors even when we use a grotesque amount of resources to build "bigger" versions of them. The potential for errors will never go away with the current AI architecture.
This is a fundamental paradigm shift in computing. Instead of putting a lot of energy into building an architecture that will produce reliable results, we are now maximizing on a system / idea that will never give us 100% reliable results.
Basically it is just a marketing stunt. Probably the computer science guy building it knew very well that he would still need some fundamental break troughs to get to a real product, but the marketing guy saw that there is still potential to make a lot of money by selling a product that will produce correct results only 80% of the time.
The marketing guy was right and marketing is now dominating science, but humanity will pay a big price for that.
Putting enormous amounts of money into a fundamentally flawed system that we can not optimize to produce reliably error free results is just stupid.
The big achievement of "classical" computing is that the results are reliably error free. We have still some known issues eg. with floating point math and bad blocks on disk / bit flipping etc. but these are observable and we can handle / avoid them. Generally "non-ai-computing" was made so reliable, that we can depend on it for many very important things. This came not by accident but was created by a lot of people who put a lot of resources into research to achieve that result.
LLMs introduce a level of uncertainty and unreliability into computing that makes them practically useless.
Because if you have enough knowledge to verify the result and AI is only quicker in producing the result, what is the point then putting so much resources in it (besides making money by re-centralizing computing, of course). Verifying a lot of results that have been produced quicker is still slow, so the people who are now just AI verifiers should just produce the results themselves, makes the whole process quicker.
AI is only of value if it can produce results about things that you or your organization does not know anything about. But these results you can not verify and therefore potentially wrong results can be fatal for you, your organization and all the people that are affected by actions generated based on these wrong results.
Many people have already been killed because decision makers are not able to follow that very simple logic.
So we can still create "interesting and enjoyable results", but finally it is a gigantic miss-allocation of resources of historic idiocy. It fits, of course, very well in a timeline where grifters are on top of societies around the world.
It is a fundamentally wrong path that should not be followed and scientists around the world should articulate exactly that instead of producing marketing blog posts for a system with such fatal inherent issues.
However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.
Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.
Paying for Pro from any of my current academic budgets is completely ouf of the field of reality here -- all budgets tend to have restricted uses and software payments fit into very few categories. Effectively, I'd have to ask for a brand new grant and hope the grant rules allow for large software payments and I won't encounter an anti-AI reviewer; such a thing would take one year at least.
As a nail to the coffin, I was "denied" all Claude Opus recently as part of Microsoft's clampdown on individual (and academic) use of Copilot.
(Chagpt 5.5 Plus does not seem sufficient for any deeper investigations into new research topics, I've tried.)
Apologies for the rant.
This made me a little sad
The question that keep bothering me is can an LLM generate an idea that is truly novel? How would/could that actually happen? But then that leads to the question - what are we actually doing when we think?
Perhaps it's as simple as the ability to just make mistakes that matters, the same things that powers evolution. As long as the LLM can make mistakes, it's capable of generating something genuinely novel. And it can make more mistakes much faster than we can.
And certainly not to send it to a fellow colleague to ask its opinion first.
LLMs are certainly becoming capable to code, find vulnerabilities, solve mathematical problems, but we need to avoid putting their works in production, or in front of other humans, without assessing it by any possible mean.
Otherwise tech leads, maintainers, experts get overwhelmed and this is how the « AI slop » fatigue begins.
To be clear I’m talking about this step:
> That preprint would have been hard for me to read, as that would have meant carefully reading Rajagopal’s paper first, but I sent it to Nathanson, who forwarded it to Rajagopal, who said he thought it looked correct.
The "non-trivial" is for human abilities. The weights lifted by a crane are also "non-trivial". People keep getting amazed at machine's abilities. Just like a radio telescope can see things humans can't, microscope can see the detail humans can't, we need not be amazed. The sensory perception of patterns is at different level for AI. It's a machine.
Maybe if you find AI to be doing stuff you find impressive, the stuff you were doing wasn't that impressive? Worth ruminating on your priors at least.
This is as AGI as it needs to be to get my vote. And it's scary.
Creativity is connecting ideas from different domains and see if something from one field applies to another. I do think AI is overhyped generally; but a major benefit from AI could be that after ingesting all the existing human knowledge (something no single human can ever hope to achieve) it would "mix and connect" it and come up with novel insights.
Most published research sits ignored and unread; AI can uncover and use everything.
Anyone spotting the issue here? What did that really cost?
I am not against compute being used for scientific or other important problems. We did that before LLMs. However, the major LLM gatekeepers want to make all industries and companies dependent on their models. And, at some point, they need to charge them the actual, unsubsidized costs for the compute. In the meantime, companies restructure in the hopes that the compute costs remain cheap.
> Training must start from the basics though.
Sure, but the point is that at some point (e.g. when starting a PhD) one needs to do research, not learn the basics. And LLMs make that harder, because they solve the "easy research" part.
Take a young lion "fighting/playing" with another young lion as a way to learn how to fight, and later hunt. And suddenly they get TikTok and are not interested in playing anymore. Their first encounter with hunting will be a lot harder, won't it?
> People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired.
Again, that's true but missing the point: if you never get to be a "good coder", you will always be a "bad vibe coder". Maybe you can make money out of it, but the point was about becoming good.
Yes but it's not just that if you solved a problem yourself, you're better at solving other problems; it's also that you actually understand the problem that you solved, much better than if you simply read a proof made by somebody (or something) else.
I see this happening in the enterprise. People delegate work to some LLM; work isn't always bad, sometimes it's even acceptable. But it's not their work, and as a result, the author doesn't know or understand it better than anyone else! They don't own it, they can't explain it. They literally have no value whatsoever; they're a passthrough; they're invisible.
What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.
It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.
Just in case if you don't want to disclose your name my email is northzen@gmail.com
You are worthy of doing this work because you are able to do it. Do the work because you love it and because you love the mystery. Enjoy every moment that you get to do it. Find joy in the great fortune you have to do this work while others toil away on tasks that bring them no satisfaction. Sometimes it's tedious, but sometimes it's incredibly rewarding in its own right.
Don't work for the possibility of eternal glory though, it just doesn't exist anymore.
Any statement preceded by the word 'believe' is a coping mechanism.
> This notion of immortality was just a small intangible bonus I hoped for when I jumped into grad school
Any statement preceded by the word 'hope' is a coping mechanism.
> AI is making me feel less worthy
Worth comes from understanding, not achievement.
We praise car drivers despite most of the performance in their sport comes from the car. The driver makes the difference when two cars are close in performance. Brilliances or mistakes. Horse riders too.
In the case of math, the human can lead the LLM on the right track, point it to a problem or to another one. So it deserves some praise.
Then the team that built the car, cared about the horse, built the AI might deserve even more praise but we tend to care more about the single most visible human.
Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.
You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:
Frontier models are still nowhere near solving it, but progress has been rapid.
* o3 (high) <1.5 years ago was at 1.4%
* GPT 5.4 (xhigh), 23.4%
* GPT-5.5 (xhigh), 27.1%
* GPT-5.5 Pro (xhigh) 30.6%.
There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.
Only the trip stays the same beautiful 5 star plus travel.
Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.
The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.
Gemini to me is the most unpredictable LLM while GPT works best overall for me.
Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.
Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.
I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.
It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.
I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.
Whatever the Joules... (convert to $ using your preferred benchmark price) it is a fraction to what it might take a human Ph. D. weeks to feed and sustain themselves when working on the same problem. The economics on LLMs is just unbeatable (sadly) when compared to us humans.
you deserve opinions shaped by interactions with the best tools that are out there.
Mathematicians have engaged, vigorously, on this very philosophical question for centuries - is math discovered truth, or is it more akin to building an edifice where you first define the materials, then the structure, and see where it leads? Lots of strong feelings on both sides.
Some people like to parrot "next token prediction", "LLMs can only interpolate", and other nonsense, but it is obviously not true for many reasons, in particular since we introduced RL.
Humans do not have the monopoly on generating novel ideas, modern AI models using post training, RL etc can come to them in the same way we do, exploration.
See also verifier's law [0]: "The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI."
This applied to chess, go, strategy games, and we can now see it applying to mathematics, algorithmic problems, etc.
It is incredibly humbling to see AI outperform humans at creative cognitive tasks, and realise that the bitter lesson [1] applies so generally, but here we are.
[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
This works really well.
Now, it's clear that I have no idea how much of this is something we would consider new and original, and how much is a kind of systematic, but not novel, easy of thinking.
What I couldn't do so far is get an LLM to generate a truly new maths theory, with new abstract concepts and dimensions and points of view. The kind that is not just a combination of existing theories and logic.
I think this is good advice in general, maybe with an emphasis on public vs. private, friendly contact. Having 0 thought AI slop thrown at you out of the blue is rude. "could have been a prompt" indeed. But having a friend/colleague ask for a quick glance at something they know you handle well is another story for me.
If I've worked on a subject for a few years, and know the particulars in and out, I'd have no trouble skimming something that a friend or a colleague sent me. I am sparing those 5-10 minutes for the friend, not for what they sent. And for an expert in a particular domain, often 5 minutes is all it takes for a "lgtm" or "lol no".
To me, it's rearranging the information you had in a way that hasn't been applied or published before.
That's literally what LLMs are built for.
It usually takes dissolving that, often through difficult experiences, before they can see it as a machine, something that could be separated from them.
https://github.com/vjeranc/fixed-rtrt
M3 module was formalized fully purely from experimental data and from a nudge by earlier versions of codex in 15-30 minutes in a simple write/compile/fix-first-error loop. I was a bit surprised how fast it picked up the pattern but given there was a paper from '70s it became clear why later.
For those that don't know, this is Timothy Gowers. He is one of the most accomplished mathematicians in the world. Like Terence Tao, he is considered one of the world leaders in mathematics and tends to have good judgement in where the field is going.
Even without that knowledge, no, this article is certainly not AI generated. It has none of the tells.
There’s the example of a poor person and a rich person buying boots. The poor person’s boots wear out and have to be replaced while the rich persons boots last for many years due to higher quality craftsmanship. Over years, the poor person’s boots wear will pay may for boots.
I'm not trying to shame here, just curious whether this is completely unattainable for most researchers in your area.
If it's "invented", then it requires ingenuity.
If it's "discovered", then it was always already there, just waiting for the right connections to be made for it to be uncovered and represented in a way we can understand.
Invention requires ingenuity, but discovery does not. So if LLMs can generate truly novel mathematics, for me that settles it that mathematics is indeed discovered, as LLMs are quite capable of discovery yet I don't consider them possible of invention.
Graduate? Yes.
In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.
I now see LLMs as persuasion machines.
I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.
Instead, it started writing an OCR program in python.
I stopped it after several minutes.
Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.
This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.
A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.
I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"
Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".
However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.
Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.
Many mathematicians work because they love the breakthrough (a certain quote of Villani comes to mind). They love finding new results, uncovering new mysteries. From that point of view, having an AI that can build on your basic ideas and refine them into more powerful arguments is awesome, regardless of who gets the credit. There are those that treat it more like solving puzzles so the result is not of interest. From that point of view, I can see the dissatisfaction. But I have found those with that viewpoint don't tend to make it as far in academia as those with the other viewpoint.
Of course if you are really poor, then you have to take expensive shortcuts, but for most people that shouldn’t be the case. Learning to do more with less money isn’t as bad as many people think. It’s also good for the brain to be a bit more creative.
But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.
Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.
I probably will erase the contents in a few days.
Even if you just drop an email and it doesn't work out, I appreciate this gesture so much. Thank you.
Also if he did send me complete junk, I would still parse it for multiple days to see what is there.
It still sounds to me like remarkable automation rather than something that's expanding the frontier of human knowledge, for now at least.
jagged AGI
So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.
There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.
You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.
I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?
That's true. The question is whether the produced pattern has any value. LLMs are incapable of determining this, and will still often hallucinate, and make random baseless claims that can convince anyone except human domain experts. And that's still a difficult challenge: a domain expert is still needed to verify the output, which in some fields is very labor intensive, especially if the subject is at the edge of human knowledge.
The second related issue is the lack of reproducibility. The same LLM given the same prompt and context can produce different results. This probability increases with more input and output tokens, and with more obscure subjects.
The tools are certainly improving, but these two issues are still a major hurdle that don't get nearly as much attention as "agents", "skills", and whatever adjacent trend influencers are pushing today.
And can we please stop calling pattern matching and generation "intelligence"? This farce has gone on long enough.
Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.
I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.
I think this is a bit pedantic. Obviously the parent you’re replying to is referring to the concept of “in-context learning”, which is the actual industry / academic term for this. So you feed it a paper, and then it can use that info, and it needs steering / “mentoring” to be guided into the right direction.
Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.
In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.
we do also have training on synthetic data. it might compound.
When I'm cooking meatballs with sauce and the recipe calls for frying them, I'll have an LLM guestimate how long and which program to use in an air fryer to mimic the frying pan, based on a picture of balls in a Pyrex. So I can just move on with the sauce, instead of spending time browsing websites and stressing about getting it perfect.
I used to hate these non-deterministic instructions, now I treat it as their own game. When I will publish my first recipe, I'll have an LLM randomize the ingredient amounts, round them up to some imprecise units and also randomize the times. Psychologists say we artists need to participate and I WILL participate.
Nobody looks at this species and goes hm, rational and reasonable :)
An aside: It was a very nice gesture and completely unexpected by me, so even if it doesn't work out, it made my day. I personally believe that kind gestures have a lot of power.
Back on topic: There is a real danger of the gap between rich and poor universities significantly widening in all fields if the rich can afford Pro level models, or even hardware that can run their own comparable models, and this being fiscally inaccessible to the rest.
One can sweep this under the rug by blaming the educational funding but this just shoots down all discussion. Even if GDP of a country goes up by a lot -- such as Poland -- it takes time before any budget benefit trickles to the education budget, and with some governments it might never do.
I believe Microsoft et al do have the most power here to boost affordable access to AI for researchers on a large scale; the fact that they cut some too expensive models (Opus, 5.5) from their academic benefits package is a grim omen. I do realize they would like universities to pay them also, and ultimately the universities should do that -- but then we are back at the institutional level of the problem.
This. Should become a general rule for any non-trivial use of LLM in a professionel setting.
Thank you.
At present, the tools are available for whomever wants to buy them. Not OpenAI's fault that parent comment's government and/or institutions policies haven't been updated to allow for their purchase and use.
I'd argue that the OpenAI dude/dudettes level of generosity is appropriate given the circumstances.
But if you ask questions occasionally, (and don't resend, for example, your whole codebase with each request), then the API feels really cheap, even for the frontier models.
That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.
We care about sports with humans.
For publications and theses, as long as the final results hold and can be replicated and validated, I don’t see why we shouldn’t allow the wholesale use of LLMs
thats literally what an IQ test tests - abstract pattern matching. but I guess you dont like IQ tests either
And that can be very hard to do given the ui we most interact with them in is a chat session.
You seem to have a good estimate in your head; I definitely do not.
From personal experience, ChatGPT 5.5 (the Plus tier) is excellent for programming tasks and also for various teaching related tasks but I have not observed the research benefits that Tim Gowers has when I asked it questions in my area of expertise. So the costs are definitely higher than a few dozen $ a month per PhD/professor.
You might be right that universities should immediately spring into action and demand funding for research level AI resources and hardware. One thing you might be mistaken in is that public universities are unfortunately very inflexible institutions; one reason for this is that they have a large internal leadership structure AND they are funded by the state, so even if the entire university agrees on something, the funding is at the whim of the ministry of education and thus the current political leadership.
This is really just a glorified undergraduate education, the real point of graduate school is to learn to do real-world relevant research. For the latter, I think LLM use will be accepted but there will be a heavy expectation on the author of making the result very easily digestable for human mathematicians and linking it thoroughly with the existing literature - something that LLMs are very much not successful at, but a student might be able to do quite well with a mixture of expert guidance and personal effort.
I think the GP meant that *if the tools provide substantial benefit* to staff, their costs can be compared to salaries and other large expenses of the university. The $100/month subscription costs less than your office space.
"Reasoning" and now "Agentic" AI systems are not some fundamental improvement on LLMs, they're just running roughly the same prior-gen LLMS, multiple times.
Hence the conclusion that LLM improvement has slowed down, if not stagnated entirely, and that we should not expect the improvements of switching to these "reasoning" systems to keep happening.
We are all having to keep revising upwards our assessments of the mathematical capabilities of large language models. I have just made a fairly large revision as a result of ChatGPT 5.5 Pro, to which I am fortunate to have been given access, producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me.
The background is that, as has been widely reported, LLMs are now capable of solving research-level problems, and have managed to solve several of the Erdős problems listed on Thomas Bloom’s wonderful website. Initially it was possible to laugh this off: many of the “solutions” consisted in the LLM noticing that the problem had an answer sitting there in the literature already, or could be very easily deduced from known results. But little by little the laughter has become quieter. The message I am getting from what other mathematicians more involved in this enterprise have been saying is that LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques.
I decided to try something a little bit different. At least in combinatorics, there are quite a lot of papers that investigate some relatively new combinatorial parameter that leads naturally to several questions. Because of the sheer number of questions one can ask, the authors of such papers will not necessarily have the time to spend a week or two thinking about each one, so there is a decent probability that at least some of them will not be all that hard. This makes such papers very valuable as sources of problems for mathematicians who are doing research for the first time and who will be hugely encouraged by solving a problem that was officially open. Or rather, it used to make them valuable in that way, but it looks as though the bar has just been raised. It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.
In any case, a little over a week ago I decided to see how ChatGPT 5.5 Pro would fare with a selection of problems asked by Mel Nathanson in a paper entitled Diversity, Equity and Inclusion for Problems in Additive Number Theory. Nathanson has a remarkable record of being interested in problems and theorems that have later become extremely fashionable, which has led him to write a series of extremely well timed and therefore highly influential textbooks. In this paper, he argues for the interest of several other problems, some of which I will now briefly describe.
If is a set of integers, then its sumset
is defined to be
. For a positive integer
, the
–fold sumset, denoted
, is defined to be
. Nathanson is interested in the possible sizes of
given the size of
. To that end one can define a set
to be the set of all
such that there exists a set
with
and
.
An obvious first question to ask is simply “What is ?” When
, the answer is the set of all integers between
and
. It is an easy exercise to show that if
, then
, so this result is saying that all sizes in between can be realized. However, it is not true in general that
can take every size between its minimum and maximum possibilities, and we do not currently have a complete description of
.
Another natural question one can ask, and this is where ChatGPT came in, is how large a diameter you need if you want a set with
and
having prescribed sizes. (Of course, the size of
must belong to
.) Nathanson showed that for every
there is a subset
of
with
and
, and asked whether the bound
could be improved. ChatGPT 5.5 Pro thought for 17 minutes and 5 seconds before providing a construction that yielded a quadratic upper bound, which is clearly best possible. It wrote up its argument in a slightly rambling LLM-ish style, so I asked if it could write the argument up as a LaTeX file in the style of a typical mathematical preprint. After two minutes and 23 seconds it gave me that, after which I spent some time convincing myself that the argument was correct.
The basic idea behind both Nathanson’s argument and ChatGPT’s was that in order to obtain a set of a given size with a sumset of a given size, it is useful to build it out of a Sidon set, which means a set with sumset of maximal size (that is not quite the usual definition but it is the simplest to use in this discussion), and an arithmetic progression. Also, for a bit of fine tuning one can take an additional point near the arithmetic progression. Then if one plays around with the various parameters, one finds that one can obtain sets of all the sizes one wants. Nathanson doesn’t express his argument this way (it is Theorem 5 of this paper), instead giving an inductive argument, but I think, without having checked too carefully, that if one unravels his argument, one finds that effectively that is what he ends up with, and the Sidon set in question consists of powers of 2. ChatGPT obtained its improvement by simply using a more efficient Sidon set — it is well known that one can find Sidon sets of quadratic diameter. (One might ask why Nathanson didn’t do that in the first place: I think it is because the obvious idea of using a more efficient Sidon set becomes obvious only after one has redescribed his inductive construction. Is that what ChatGPT did? It is very hard to say.)
Next, I asked ChatGPT to see whether it could do the same for a closely related question, where instead of looking at the size of the sumset, one looks at the size of the restricted sumset, which is defined to be . Unsurprisingly, it was able to do that with no trouble at all. I got it to write both results up in a single note, to avoid a certain amount of duplication. If you are curious, you can see the note here.
I then asked what it could do for general . I was much less optimistic that it would manage to do anything interesting, because the proof for
makes fundamental use of the fact (due to Erdős and Szemerédi) that we know exactly which sizes we need to create. If we don’t know what the set
is, then it seems that we are forced to start with a hypothetical set
with
and
and build out of it a set of small diameter with the same property. As it happens, I still don’t know how to get round that difficulty (I’m mentioning that just to demonstrate that my mathematical input was zero, and I didn’t even do anything clever with the prompts), but Nathanson mentioned in his paper a remarkable paper of Isaac Rajagopal, a student at MIT, who must have got round the difficulty somehow, because he had managed to prove an exponential dependence of
on
for each fixed
.
I’ll leave the previous paragraph there, but Isaac has subsequently explained to me that that isn’t really the difficulty. His argument gives a complete description of when
is sufficiently large, and if one wants to prove a polynomial dependence for fixed
, then assuming that
is sufficiently large is clearly permitted. The real difficulty is that constructing the sets with given sumset sizes was significantly more complicated, and necessarily so because the degree of the polynomial grows with
, and one therefore needs more and more parameters to define the sets.
In any case, the task faced by ChatGPT was not to solve the problem from scratch, but to see whether it was possible to tighten up Isaac Rajagopal’s argument. Here’s what happened.
Isaac made some very interesting remarks about the nature of what the additional ideas were that ChatGPT contributed. Since, as I have already said, my mathematical input was zero, I invited him to write a guest section to this post. Just before we get to that, I want to raise a question (that will undoubtedly have been raised by others as well), which is simple: what should we do with this kind of content? Had the result been produced by a human mathematician, it would definitely have been publishable, so I think it would be wrong to describe it as AI slop. On the other hand, it seems pointless even to think about putting it in a journal, since it can be made freely available, and nobody needs “credit” for it (except that Isaac deserves plenty of credit for creating the framework on which ChatGPT could build). I understand that arXiv has a policy against accepting AI-written content, which makes good sense to me. So maybe there should be a different repository where AI-produced results can live. But various decisions would need to be made about how it was organized. I myself think that one would probably want to have some kind of moderation process, so that results would be included only if a human mathematician was prepared to certify that they were correct — or, better still, that they had been formalized by a proof assistant — and perhaps also that they answered a question that had been asked in a human-written paper. On the other hand, I wouldn’t want a moderation process that created vast amounts of work (unless the work was itself done by AI, but there are obvious dangers in going down that route). Anyway, until these questions are answered, this result is available from the link above, and perhaps, now that LLMs are so good at literature search, that will be enough to make it findable by anyone who wants to know whether Nathanson’s problem has been solved.
With just a few prompts, ChatGPT was able to improve the upper bound on (which I will define very soon) from exponential in
to polynomial in
. While its first improvement of the bound, from exponential in
to exponential in
, was a routine modification of my work, the improvement to polynomial in
is quite impressive. To do this, ChatGPT came up with an idea which is original and clever. It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove, using similar methods to those in my own proof. My goal is to explain that idea, in a manner that will be digestible to my friends who are computer science majors as well as my math major friends.
The problem of bounding is closely related to a problem I worked on at the Duluth REU (Research Experience for Undergrads) program, of determining
. In particular,
is the set of possible
-fold sumset sizes
, where
can be chosen to be any set of
integers.
is the minimal
such that we can achieve all of the values of
using
-element sets
. I spent last summer explicitly characterizing the set
for large
, by constructing sets
such that
achieves all sizes which I could not rule out as impossible. So,
can be upper-bounded by optimizing my constructions.
I constructed these sets by combining smaller component sets which are simpler to analyze. Some of these components are the geometric series
for various values of and
. Unfortunately, the elements of
and
are exponentially large in terms of
. So, I asked ChatGPT (through Tim) whether there exist sets of
elements which have similar sumset sizes to these geometric series, but contain only numbers of polynomial size in
: I had no idea if this was possible, or how to begin constructing such sets. ChatGPT came back with an answer, constructing sets
and
which behave like “half a geometric series squeezed into a polynomial interval,” which is counterintuitive. Before I discuss the construction of
and
, I will explain the important properties of the sumset sizes of
and
which they recreate.
For , a set
is called a
set if the only solutions to
with in
are the “trivial” solutions, by which I mean that one side of the equation is a reordering of the other side. If
is a
set of size
, then elements of
correspond exactly to choices of
elements of
, with repetition allowed. Using “stars and bars,” one can see that
and this is the maximum possible value of
among sets of size
. So, another definition is that
is a
set if
. Sidon sets, which Tim discussed, are exactly
sets.
To make things more concrete, let us assume that in (1). Then,
is a
set, but it is not a
set because of the relations
for any choice of in
. In particular,
, as these
relations are the only ones preventing
from being a
set.
lacks the relations in (2) because
is not in
. So,
is a
set, but it is not a
set because of the relations
for any choices of in
. This gives
relations, and one can check that
. To summarize, we have seen that
(a) is a
set.
(b) is a linear function of
.
(c) is a
set.
(d) is a quadratic function of
.
ChatGPT was able to find sets and
of
elements which satisfy (a)-(d), but whose elements all have polynomial size in
. The construction of
and
uses
-dissociated sets, which are sets
where the only solutions to
with and
in
are the “trivial” solutions, i.e.
and one side of the equation is a reordering of the other side. For
, it is possible to construct an
-dissociated set
, where
is approximately
, and in particular polynomial in
. Constructions of such a
using finite fields date back to Singer (1938) and Bose–Chowla (1963) and are described in Appendix 1. Define
and
In hindsight, I have good intuition for the construction of and
. All of the relations in (2) and (3) are formed by combining one or two relations of the form
. There are approximately
relations of the form
in
and
, and approximately
such relations in
and
. There are few other low-order relations in
and
, and similarly in
and
because
is
-dissociated. So,
and
manage to contain half as many
-relations as their geometric series counterparts, while also containing few low-order relations.
We now see why (a)-(d) hold with and
replaced by
and
, respectively. For concreteness, we assume that
and
, so
contains no nontrivial relations as in (4) with
. Then,
is a
set, but it is not a
set because of the relations
for any choice of in
. If we let
, we can check that
is linear in
. In particular, (a) and (b) hold with
replaced by
, and the linear function
replaced by
. We can also see that
is a
set, but it is not a
set because of the relations
for any in
. If we let
, we can check that
is quadratic in
. In a similar manner, (c) and (d) hold with
replaced by
, and the quadratic function
replaced by
.
Even though I can motivate it in retrospect, ChatGPT’s idea to use -dissociated sets to control relations of order at most
feels quite ingenious. As far as I can tell, this idea is completely original.
ChatGPT’s proof that its construction produces the desired values of is very similar to my proof that the sets
which I construct achieve all possible values of
, after replacing
and
by
and
, respectively. Properties (a)-(d) capture many of the important properties of
and
(or
and
) which are used in this proof. The final constructions involve combining the sets
and
(or
and
in my paper) for each value of
between
and
with another set which is the union of an arithmetic progression and a point. Intuitively,
and
(or
and
) have large sumsets, while arithmetic progressions have small sumsets, so it is plausible that one could get sets which achieve all the medium-sized sumsets by combining them. However, the proof of this is quite involved, and it occupies Section 4 of my paper and the entirety of the ChatGPT preprint. In Appendix 2, I work out the details of the ChatGPT construction to show that for
sufficiently large,
For comparison, it is easy to see that is at least on the order of
, and it is unknown what the real value is. In Appendix 3, I give details of the correspondence between my paper and the ChatGPT preprint, which will be helpful for those who want to read either.
Finally, I want to express my deep gratitude to Tim for allowing me to contribute to this blog. I am still stunned by the coincidence that the problem he chose to put into ChatGPT 5.5 Pro led him to my paper on the arXiv.
I would judge the level of the result that ChatGPT found in under two hours to be that of a perfectly reasonable chapter in a combinatorics PhD. It wouldn’t be considered an amazing result, since it leant very heavily on Isaac’s ideas, but it was definitely a non-trivial extension of those ideas, and for a PhD student to find that extension it would be necessary to invest quite a bit of time digesting Isaac’s paper, looking for places where it might not be optimal, familiarizing oneself with various algebraic techniques that he used, and so on.
It seems to me that training beginning PhD students to do research, which has always been hard (unless one is lucky enough, as I have often been, to have a student who just seems to get it and therefore doesn’t need in any sense to be trained), has just got harder, since one obvious way to help somebody get started is to give them a problem that looks as though it might be a relatively gentle one. If LLMs are at the point where they can solve “gentle problems”, then that is no longer an option. The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.
I would qualify that statement in two ways though. First, there is the obvious point that a beginning PhD student has the option of using LLMs. So the task is potentially easier than proving something that LLMs can’t prove: it is proving something in collaboration with LLMs that LLMs cannot manage on their own. I have done quite a lot of such collaboration recently and found that LLMs have made useful contributions without (yet) having game-changing ideas.
A second point is that I don’t know how much of what I have said generalizes to other areas of mathematics. Combinatorics tends to be quite focused on problems: you start with a question and you reason back from the question or if you reason forwards you do so very much with the question in mind. In other areas there can be much more of an emphasis on forwards reasoning: you start with a circle of ideas and see where it leads. To do it successfully, you need to have some way of discriminating between interesting observations and uninteresting ones, and it isn’t obvious to me what LLMs would be like at that.
Of course, everything I am saying concerns LLMs as they are right now. But they are developing so fast that it seems almost certain that my comments will go out of date in a matter of months. It is also almost certain that these developments will have a profoundly disruptive effect on how we go about mathematical research, and especially on how we introduce newcomers to it. Somebody starting a PhD next academic year will be finishing it in 2029 at the earliest, and my guess is that by then what it means to undertake research in mathematics will have changed out of all recognition.
I sometimes get emails from people who are interested in doing mathematical research but are not sure whether that makes sense any more as an aspiration. I have a view on that question, but it may very well change in response to further developments. That view is that there is still a great deal of value in struggling with a mathematics problem, but that the era where you could enjoy the thrill of having your name forever associated with a particular theorem or definition may well be close to its end. So if your aim in doing mathematics is to achieve some kind of immortality, so to speak, then you should understand that that won’t necessarily be possible for much longer — not just for you, but for anybody. Here’s a thought experiment: suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.
So what is the point of struggling with a difficult mathematics problem? One answer is that it can be very satisfying to solve a problem even if the answer is already known, but I don’t think that is a sufficient reason to spend several years of your life on this peculiar activity. A better answer is that by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders, or people who have a solid grasp of how to do basic arithmetic are likely to be more skilled at using calculators (and especially at noticing when an answer feels off). Mathematics is a highly transferable skill, and that applies to research-level mathematics as well. By doing research in mathematics, you may not get the same rewards as your equivalents a generation ago, but there is a good chance that you will be equipping yourself very well for the world we are about to experience.
We will construct an -dissociated set
, where
is approximately
. This construction is a very minor modification of Bose–Chowla (1963)’s construction of a
set, which I learned about from this paper. For whatever reason, the GPT preprint (Lemma 3.1) uses a different, less efficient construction using moment curves.
Let be a prime, let
, let
be the finite field with
elements and fix a generator
of
, so that
is equal to
. Define a set of
elements
Then, each element corresponds to a unique value of
, by taking
. Now an additive relation of the form in (4) with
can be reframed by taking powers of
as
As is a degree-
extension of
and
is a generator of
as an
-extension, this means that
does not satisfy any nonzero polynomials in
of degree
. So, both sides of (6) are identical as polynomials in
and thus the additive relation in (4) is trivial. So,
is
-dissociated, and of course one can prune a few elements to reduce
to size
.
Fix constants such that
(in my paper I arbitrarily chose
). Let the two sets in (5) be called
and
. Let
denote the set of integers
satisfying
. Similarly to my paper, the constructions of
such that
achieves the desired sizes will combine sets of the following four types:
One reason that this construction needs to be complicated is that we need to create at least many sets. To do this, we vary
parameters
and
in the domain
and
parameters
and
in the domain
. We can choose
to be slightly bigger than
, and then the above construction gives us
different sets where
can be made arbitrarily small. So, if we were to remove any of the above parameters from the construction, and not change the others, this construction would no longer create
many sets. In comparison, Nathanson’s construction when
only needs to create
sets. He does this by combining a Sidon set, an arithmetic progression, and one extra value, and varying the size of the arithmetic progression and the extra value in ranges of size
.
We want to combine sets
, which are given by
,
for the
values of
,
for the
values of
, and a
set. By Appendix 1, for all
, there exists a
-dissociated set
of diameter
. By the constructions of
and
, we can take each
, where
. Let
have basis vectors
. To combine
, we can define
as
Similarly to my Lemma 4.9, this construction ensures that the generating function product $latex \mathcal{F}{A}(z) = \prod{i=1}^q \mathcal{F}_{A_i}(z)$ holds, which is the identity that both my paper and the GPT preprint use (see either paper for a definition of these generating functions). By (the standard) Lemma 2.3 of the GPT preprint, is Freiman-isomorphic of order
to a subset of
. Therefore, for
sufficiently large (the whole construction relies on this for the same reasons as in my paper),
In Section 4.2 of my paper, I use a different, simpler construction to construct sets achieving the values in
which have
, for some small
. These sets
are subsets of
, meaning that all elements have polynomial size in
. This is observed in Section 5 of the GPT preprint.
Section 4.3 of my paper carries out the construction which combines many components including and
. This corresponds to Sections 2, 3, 4, and 6 of the GPT preprint. This section has a lot of moving parts; I give an outline in Section 4.3.1.
In Section 4.3.2, I describe how the different components will be combined, using a construction which I call the disjoint union, and introduce generating functions as a bookkeeping tool to keep track of the sumset sizes of a set
. This corresponds to Section 2 and Section 4 of the GPT preprint.
In Section 4.3.3, I compute the generating function of each of the component sets, including (Lemma 4.15) and
(Lemma 4.17). This corresponds to Section 3 and Section 6.1 of the GPT preprint. In particular,
is computed in Lemma 3.3 and
is computed in Lemma 3.4. Once these generating functions have been computed, the remainder of the proof is almost identical in my paper and in the GPT preprint.
In Section 4.3.4, I put all the pieces together to show that as we range over the sets which I have constructed, the values of
will assume all of the elements of
. The key idea is to show that the set of all values of
forms an interval, and contains numbers both smaller than
and equal to
.
Tags: ai, mathematics
This entry was posted on May 8, 2026 at 4:40 pm and is filed under Computing, Straight maths. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.