Modelling text describing the world is not modelling (some aspect) of the world?
Modelling the probability that a reader likes or dislike a piece of text is not modelling (some aspect) of a reader's state of mind?
The text describes the world to humans. This is the crucial thing that you miss. It is very subjective.
Imagine that you learn the grammar of a foreign language without learning the meaning of the words. You might be able to make grammatically valid sentences. But you will still will not understand a single thing that something written in that language describes. But that will be perfectly clear to someone who actually understand the meaning of the words.
When you train LLMs on large volumes of text that describe logically consistent facts in a million different ways, the "logic" sort of becomes part of the grammer that the model learns. That is logic becomes a higher kind of "grammer" or a enormous set of grammatical rules that it captures. But that does not mean the model can do actual logic.
isn't that circular reasoning?
"I can call anyone not smart enough to take offense because as I said those anyone aren't smart enough to take offense"?
(also disregarding that being offended has been shifted into "protection of the (perceived) weak (or of the group of your allegiance)" rather than "protection of self" for quite some time now)
---
but generally I always felt that this tension around the phrase was somewhat of perscriptive/descriptive difference, or maybe "level of detail in the model" type
just because there is knowledge of a more full understanding of the process doesn't mean other descriptions/modeling of the process are invalid or unuseful
newtonian gravity doesn't describe time dilation - and yet most of the time it is enough to use only it, so it's successfully studied in schools and undergrads
if output of LLM can be modeled (by intuition) as "some other being" for many practical uses *and model works* - then automatical blaming others for "using less precise model" and warning about it feels... strange
Maybe that's the best one can do when describing something very new and strange. A series of vivid, incompatible metaphors might be the best guide for a while. "Intelligence" as we normally understand it is a significant overstatement, while "parrot" is a massive understatement.
I don't understand this point. I feel like almost everything associated with computing is extruding synthetic text.
It's rare to read an author who can directly face Brandolini's Law of misinformation asymmetry and not only hold his own against the bullshit but overcome it.
Meanwhile you have multiple Fields Medalists (Tau, Gowers) saying they’re very impressed by LLMs’ mathematical reasoning, something that the stochastic parrots thesis (if it has any empirically-predictive content at all) would predict was impossible. I doubt Tau and Gowers thought much of LLMs a few years ago either. But they changed their minds. Who do you want to listen to?
I think it’s time to retire the Stochastic Parrots metaphor. A few years ago a lot of us didn’t think LLMs would ever be capable of doing what they can do now. I certainly didn’t. But new methods of training (RLVR) changed the game and took LLMs far beyond just reducing cross entropy on huge corpuses of text. And so we changed our opinions. Shame Emily Bender hasn’t too.
Sigh.
That's captured elsewhere - attempts to create "synthetic human behavior" - but mostly around ethics vs practical function or consumer appeal.
Even just a "stochastic parrot" can be extremely valuable if the parrot is fast enough and can connect enough dots in a human-reasoning-style to say things like "what could come after a description of a problem, some background info, and a question about what could have caused the problem? Probably a relevant hypothesis that fits the background facts and the problem description" and then generate a high-probability-fitting sequence of text to spit out.
There doesn't need to be any more intent in that than just "predict what would be the next text that would be similarly connected to the previous in the same way text in the model training process would." It doesn't need to be intending to solve the problem if the hit rate is good enough such that predicting how someone else would describe the solution is often the same as actually "intending" to solve it...
Nor does the ability to predict things stochasticly mean that there isn't any symbolic way to do the same. Quite possibly the stochastic process is just a brute-force rough approximation of what a true symbolic model could do. IMO the success of the stochastic approach is exactly in line with the existence of some sort of underlying structure/system. (Though such as system would have to be incredibly complex to support all the crazy things we do with language.)
When I prompt a coding agent to fix a bug, it outputs text describing a hypothesis and more text that results in running shell commands to test the hypothesis. If the output shows that it guessed wrong, it outputs more text to test a different hypothesis, and more text to edit code, and in the end, the bug is fixed.
The text resembles the output of a reasoning process closely enough to actually work. Maybe, for some purposes, it doesn't matter if it's "real" or not?
What does "real" reasoning do for us that the imitation doesn't do? Does it come up with better hypotheses? Is it better at testing them? Sometimes, but not always. Human reasoning is more expensive, less available, and sometimes gets poor results.
Did you read TFA? This is precisely one of the non-questions that she answers.
"Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot."
So perhaps this has always been a negative claim, about what language model AI is not.
Renaissance Philanthropies is a front for VC companies.
They never publish allocated computational resources, prior art or any novel algorithm that is used in the LLMs. For all we know, all accounts that are known to work on math stunts get 20% of total compute.
In other words, they ignore prior art, do not investigate and just celebrate if they get a vibe math result. It isn't science, it is a disgrace.
While most inference executions are intentionally non-deterministic, even a purely deterministic one would still be stochastic in that the model itself was built in a process such that the statistical frequency, sequencing, etc of the training text and followup processes all heavily influence the result.
Because of that, the output is the sort of thing that is not expected to generate 100% perfect output 100% of the time, but to have a good probability of being like-in-kind-to-the-training-data (and useful/relevant as a result).
(As compared to a non-stochastic model, like arithmetic on integers, where 2+2 is always gonna be 4 and you don't have a chance of coming up with some novel pair of inputs to addition that will cause your arithmetic to miss the mark.)
> Text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been, because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that. This can seem counter-intuitive given the increasingly fluent qualities of automatically generated text, but we have to account for the fact that our perception of natural language text, regardless of how it was generated, is mediated by our own linguistic competence and our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they do [89, 140]. The problem is, if one side of the communication does not have meaning, then the comprehension of the implicit meaning is an illusion arising from our singular human understanding of language (independent of the model). Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.
Do you really think that claiming the output of an LLM “has no reference to meaning” is not an empirical claim? That it doesn’t attempt to place any bounds whatsoever on what LLMs can and cannot do? LLMs can solve some very difficult mathematical problems quite well now: see the article from Gowers that was on here recently. Do you think that the output in a situation like that “has no reference to meaning?” If so, you’ll have to explain why, because I don’t understand at all.
She also says:
> What I am trying to do... is to help people understand what these systems actually are
Can a phrase that has no empirical content aid people in understanding an empirical phenomenon?
> the astonishing willingness of so many to... turn to synthetic text... for all kinds of weighty decisions.
Why is this astonishing, if the nature of these models as "stochastic parrots" places no limitations whatosever on their empirical capabilities, reliability, etc?
> the field of linguistics is particularly relevant in this moment, as a linguist’s eye view on language technology is desperately needed to help make wise decisions about how we do and don’t use these products
Is it wise to make decisions about a product on the basis of information that has no relevance to how it is actually likely to behave?
(It may be, if one has ethical concerns with "data theft, the exploitative labor practices", etc -- but one could have such concerns about any kind of product, not just a "stochastic parrot", and linguists are not generally academia's experts on, e.g., labor practices.)
and
> "Meanwhile you have multiple Fields Medalists (Tau, Gowers) saying they’re very impressed by LLMs’ mathematical reasoning, something that the stochastic parrots thesis (if it has any empirically-predictive content at all) would predict was impossible. I doubt Tau and Gowers thought much of LLMs a few years ago either. But they changed their minds. Who do you want to listen to?"
I don't understand how these things are supposedly incompatible.
Larger models and further other refinement reduce the "haphazardness" of produced text. A big enough model with enough semantic connections between different words/phrasings/etc plus enough logical connections of how cause and effect, question and answer, works in human language can obviously stitch together novel sequences when presented with novel prompts. (The output was not limited to sequences of n words that appeared 1:1 in the training data for any n for at least three and a half years now, if not even back to when the paper was written.)
"without any reference to meaning" veers into the philosophical (see how much "intent" is brought up in the linked post today). But has anything been proven wrong about the idea that the text prediction is based on probabilistic evaluation based on a model's training data? E.g. how can you prove "reasoning" vs "stochastic simulated reasoning" here?
Perhaps a useful counterfactual (but hopelessly-expensive/possibly-infeasible) would be to see if you could program a completely irrational LLM. Would such a model be able to "reason" it's way into realizing its entire training model was based on fallacies and intentionally-misleading statements and connections, or would it produce consistent-with-its-training-but-logically-wrong rebuttals to attempts to "teach" it the truth?
> without any reference to meaning
is vague, but I read it as actually quite a strong claim about the limitations of LLMs. I don’t think it would be possible for LLMs to do long chains of correct mathematical reasoning about novel problems that they haven’t seen before “without any reference to meaning.” That simply isn’t possible just by regurgitating and remixing random chunks of training data. Therefore I consider the stochastic parrots picture of LLMs to be wrong.
It might have been an accurate picture in 2020. It is not an accurate picture now. What is often missed in these discussions is that LLM training now looks totally different than it did a couple years ago. RLVR completely changed the game, allowing LLMs to actually do math and code well, among other things.
LLMs certainly use something similar, except they understand text as input. LLMs, especially used for marketing stunts, have way more computing power available than any theorem prover ever had. They probably do random restarts if a proof fails which amounts to partially brute forcing.
Lawrence Paulson correctly complained about some of the hype that Lean/LLMs are getting.
ACL2 even uses formulaic text output that describes the proof in human language, despite being all in Common Lisp and not a mythical clanker.
They do not think and use old and well established algorithms or perhaps novel ones that were added.
Not only would it be a leap to suggest that people automatically lose their integrity by taking funds for projects they believe are useful, especially after involvement with adjacent fields, but you are suggesting merely being impressed by a fund is enough to dismiss their views?
You also have no evidence that Renaissance Philanthropies is a front for VC companies. All news coverage indicates that they seek to be an alternative for high net worth individuals engaging in philanthropy.
Many people discovering Erdos results, engaging in Olympiads etc, are doing so with publicly available models and publish the resources used in the process.
They certainly do not. Read the papers where the IMO results were presented. No tools of any kind were used.
Then… what’s the point of the label, if it’s not making any empirically-meaningful claims about LLMs at all? I know that LLMs involve sampling over a distribution of output logits. I’ve written code to do it. So what? I know they have statistical elements. Yet I don’t go around calling LLMs stochastic parrots, because that label implies a whole lot of claims about LLMs that I don’t think are true any longer, like that they are just regurgitating and remixing training data and can’t successfully model structured systems (like mathematics or programming).
They act as a learned proposal mechanism on top of hard search. Things like suggesting relevant lemmas, tactics, turning intent into formal steps, and ranking branches based on trained knowledge.
Maybe a kind of learned "intuition engine", from a large corpus of mathematical text, that still has to pass a formal checker. This is not really something we've had to this extent before.
> They do not think
That claim seems less useful, unless “think” is defined in a way that predicts some difference in capability. If the objection is that LLMs are not conscious, fine, but that doesn't say much about whether they can help produce correct formal proofs.
https://www.renaissancephilanthropy.org/insights/renaissance...
https://www.renaissancephilanthropy.org/insights/embedding-a...
It promotes "agentic science", which will destroy science further:
https://www.renaissancephilanthropy.org/insights/open-source...
No one publishes. Please show me papers about the math proof logic in ChatGPT that are as detailed as those from Boyer/Moore/Kaufman for prior work.
If they are on arxiv.org with 50 authors in a sea of slop, I didn't find them. If they exist, they are certainly not from Gowers, Tao or Lichtman.
You have all the upper hand because your AI shills back you up here, but nothing of substance.
> Yet I don’t go around calling LLMs stochastic parrots, because that label implies a whole lot of claims about LLMs that I don’t think are true any longer, like that they are just regurgitating and remixing training data and can’t successfully model structured systems.
The first part doesn't imply the second. It is nearly unarguable that all LLMs are going is regurgitating and remixing training data. There aren't any significant inputs other inputs than training data. It seems more likely that humans are doing the same operation the LLMs are when they model structured systems or exercise creativity - compressing data in efficient ways and then spitting it back out. "Humans are stochastic parrots" is an easy claim to defend.
But we are feeding a sealion who does not know how the math proof logic in LLMs work, probably because it is a highly computationally expensive random restart hack calling Lean that is unpublishable.
To be fair, LLMs are pretty bad at all of these. They struggle to avoid cliches and to produce prose with actual substance (below a stylistic facade that is undeniably convincing).
I have bad news for you about the writings of most Ph.D.s and University professors...
> Just look at the donors of Renaissance Misanthropy. If you're actually interested, who funds each project is listed in the PDF here. https://www.renaissancephilanthropy.org/annual-reports
As you can see, it's mainly philanthropic projects of wealthy families.
https://www.renaissancephilanthropy.org/the-fund-model
But get your AI friends to downvote truth and sink the entire submission, because that is how the AI fascists operate.
10 min read
4 days ago
--
It’s been a bit over five years since the Stochastic Parrots paper (Bender, Gebru et al 2021) was published (and somewhat longer since Google made it an enormous news story by firing my co-authors). During that time, I have been watching the phrase stochastic parrot(s) on social media, initially out of linguistic interest (it’s rare to get to see how a coinage develops from its very beginning). In the early days, most usage I saw was from people referring to the paper, and then people who had read the paper referring to large language models as stochastic parrots. Eventually, though, the phrase outran the paper, as people picked it up as a way to refer to LLMs.
Tracking this phrase also provides a window into parts of the online discourse about “AI” that I would otherwise be unlikely to see. In that discourse, I see a lot of misconceptions about a) how large language models work and b) my own work on this topic. Accordingly, it seems like a fitting time to do some debunking, answering questions that people frequently fail to ask. Below what you’ll find aren’t questions, but the various statements that people make, when perhaps they should have stopped and asked a question.
To keep this grounded in the actual text in question, here is where we introduce the term in the original paper:
Text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been, because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that. This can seem counter-intuitive given the increasingly fluent qualities of automatically generated text, but we have to account for the fact that our perception of natural language text, regardless of how it was generated, is mediated by our own linguistic competence and our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they do [89, 140]. The problem is, if one side of the communication does not have meaning, then the comprehension of the implicit meaning is an illusion arising from our singular human understanding of language (independent of the model). Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. (p.616–617)
The phrase stochastic parrots was one attempt (among several) to make vivid what it is that large language models, when used to synthesize text, are doing. In later work, (Mystery AI Hype Theater 3000, The AI Con), I’ve also added synthetic text extruding machine as a way to describe systems that closely model which bits of words tend to co-occur in their input data and can be used to, well, extrude synthetic text.
I have never and will never say that “AI” is a stochastic parrot, because I reject “AI” as a way to describe technologies (LLMs or otherwise). Also, the Stochastic Parrots paper, written in Sept-Oct 2020, was not a paper about “AI” at all, but a paper about the risks and harms associated with the drive for ever larger language models, which, at that point, mostly weren’t being used to extrude synthetic text. (OpenAI had made GPT-2 and GPT-3 available for playing with, but this was still two years before they imposed ChatGPT on the world and synthetic text suddently became everyone’s problem.) The term “AI” appears only once, near the end of the paper, where we write:
Work on synthetic human behavior is a bright line in ethical AI development, where downstream effects need to be understood and modeled in order to block foreseeable harm to society and different social groups. (p.619)
I believe this particular insight, and its phrasing, is due to Margaret Mitchell (aka Shmargaret). In the years since, this observation has unfortunately been repeatedly reinforced: work on synthetic human behavior unfortunately continued apace, and the foreseeable harms (predictably) came to pass.
Indulge me into a little digression into linguistics here. The word just is the kind of word that evokes a scale or ranking. For example, She is just 5 feet tall places her on a scale of height and furthermore suggests that her height is further down that scale than would be expected or desirable or just normal/normative. So someone who says that I say that some model is “just” a stochastic parrot is also attributing a scale, perhaps of functionality (or, in the anthropomorphizing language I am always struggling against, “capability”), and asserting that I am placing whatever model in the wrong, or at least a surprisingly low, spot on that scale.
This misunderstands what I was doing with the phrase stochastic parrots, and what we were doing in that paper in general. While I can’t speak for my co-authors, I am not invested in the project of “AI”, do not see it as a goal that is worthwhile (nor feasible) to work towards, and am not measuring large language models against some scale of progress towards that goal. What I am trying to do, in a world absolutely saturated with marketing selling the idea that the synthetic text extruding machines are “AI”, or maybe even “AGI”, is to help people understand what these systems actually are: systems designed to mimic the language (specifically: linguistic forms) that people use.
An important related point here is that though all of these systems (Claude, Gemini, ChatGPT, etc) have LLMs specifically designed to produce synthetic text as key components, that doesn’t mean there aren’t other components, as Margaret Mitchell also points out. Most things we historically do with computing are not well approximated by extruding synthetic text. Accordingly, if a company’s goal is to portray their product as functional, they would be well advised, for example, to run text classification systems on user input to intercept any arithmetic queries and route those to an actual calculator.
I often see people talking about “the stochastic parrots critique of LLMs,” but this, too, misapprehends at least the way I use the phrase. (This may be an accurate description of how other people use it.) I definitely take a critical view on the project of “AI”, and on the ways in which people are using synthetic text extruding machines (aka LLMs). But the target of my criticism is not the models. Rather, I am concerned about the actions of people: the data theft, the exploitative labor practices, the haphazard creation of and failure to document datasets, the complete disregard for environmental impact, and the astonishing willingness of so many to surrender their own power and turn to synthetic text (for which no one is accountable) for all kinds of weighty decisions.
Another common trope in the discourse around this phrase is to claim that stochastic parrot is an insult (or even a slur). On one reading, that would require LLMs to be the kind of thing that can take or feel offense, which they clearly aren’t. But, indeed, it is also possible to insult someone’s work, or consumer product they have acquired, etc. At which point, I refer the reader to the previous two points.
Folks have also pointed out that this coinage is somewhat unfair to actual parrots who, for all I know, do have internal lives and do use their ability to mimic human speech with some kind of communicative intent. My best answer here is to say that (despite parrot in stochastic parrot being a noun), I am drawing not on the name of the bird directly but rather on the English verb to parrot, which means to repeat back without understanding.
This one misses the role of stochastic in stochastic parrot, which means randomly, according to some probability distribution. What comes out of these systems is not usually a direct regurgitation of their input, but rather a remix of it. This remix is shaped by the specific ways in which the systems were built (“trained”) through multiple steps, by the “system prompt” (a prompt prepended to user input that the user doesn’t usually see), and the user input itself. In other words, theses systems make papier-mâché of their training data, molded around the balloons of these other components.
This one is funny because it comes up, in the same form, every time one of the companies promotes a new model. “Stochastic parrots might have been an accurate description in [year], but not anymore because…” and then reference to whatever demo the author has been impressed by. This is framed as heralding the arrival of “real” “AI” — over and over and over again.
But stochastic parrots (in my writing at least) isn’t an argument. It’s a description or a metaphor, again an attempt to make vivid what language mimicking machines do.
Stochastic parrots also does not refer to an empirical hypothesis. Accordingly, it doesn’t make sense to say it’s been “disproved” or that it is “unfalsifiable”.
The closest thing to a hypothesis in this space in my writing is the argument (again, not empirical hypothesis) in Bender and Koller 2020, the one with the octopus thought experiment. The Stochastic Parrots paper refers to this earlier paper, which lays out the argument that language models don’t understand text they are used to process, because language models only ever have access to the linguistic form (i.e. spellings of words) in the training data.
In that paper, we provide a definition of understanding as mapping from language to something outside of language, and show that systems built only with linguistic form have no purchase with which to encode (“learn”) such a mapping.
Stochastic parrots was coined to refer to language models, i.e. systems trained only on linguistic form used to mimic the kinds of sequences of linguistic form that people use. It is true that image/text models, for example, that can be used to map from linguistic strings to images or vice versa, can be argued to meet the definition of understanding in Bender & Koller 2020 — albeit in an extremely thin way. But the stochastic parrots framing is still extremely relevant to these models, as well as systems built with them. As quoted above:
we have to account for the fact that our perception of natural language text, regardless of how it was generated, is mediated by our own linguistic competence
When we look at the text in an image/text model, we make sense of it in a way that is rich and socially situated and we must not project that onto the model if we want to keep a clear-eyed view of how such models actually function (and in what circumstances we should be willing to use them). Similar things can be said about the images, too, though it’s generally not linguistic competence per se they are experienced through.
As we write in the Stochastic Parrots paper:
The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said. (p.617)
User interfaces, if well designed, should be transparent in the sense of providing the user with clear information about what the system can reliably do. Even if there is some thin kind of technical “understanding” in e.g. a text/image model, the fact that it’s using our language at all will send misleading signals about what is actually going on, so long as we relate to language as we always do (and I don’t see how we can avoid doing so).
This one stands out because it tends to come from other folks who are critical of “AI” but are impatient with criticism that doesn’t come from their own lens. Of course the phrase stochastic parrots isn’t a sociological critique of the way these systems are being used by corporations (and governments) to discipline labor and centralize power. It seems like a category error to ask that of a phrase coined to try to make vivid the basic functionality of the software. If you want sociological analysis, I recommend The AI Con, co-authored with a sociologist (the amazing Dr. Alex Hanna).
I came up with this phrase as we were writing the paper, and then wondered if I had heard it somewhere. As of early October 2020, a Google search for “stochastic parrot” provided 0 hits. I also asked around on social media. It turns out there are two quasi-antecedents:
In a Daily Nous post from July 2020 (which I had not read prior to my coinage), Regina Rini writes:
So long as we get what we came for — directions to the dispensary, an arousing flame war, some freshly dank memes — then we won’t bother testing whether our interlocutor is a fellow human or an all-electronic statistical parrot.
That’s the shape of things to come. GPT-3 feasts on the corpus of online discourse and converts its carrion calories into birds of our feather.
The more direct inspiration for me was an email from Stuart Russell in September 2020 to Alexander Koller and I about our ACL 2020 paper (the one with the octopus thought experiment):
I have been watching with increasing disbelief as the NLP community becomes more and more enamored of its randomized parrots. The paper is a breath of sanity.
Once I made the connection, I emailed Russell to offer a footnote acknowledging it in the Stochastic Parrots paper (in October 2020, before the paper was publicly known). He declined.
On a related note, my view into the discourse sometimes turns up a “sour grapes” argument, wherein people think my motivation for “critiquing LLMs/‘AI’” (see above) is that I’m just salty because my work as a linguist (and more specifically on grammar engineering, which is a symbolic rather than statistical approach to natural language processing) is somehow upended or made obsolete by LLMs. I can promise you that it is still interesting and worthwhile to study how language works and how we work with language, and to use computers to do so. And in fact, the field of linguistics is particularly relevant in this moment, as a linguist’s eye view on language technology is desperately needed to help make wise decisions about how we do and don’t use these products.