If you are having a conversation with a chatbot and your current context looks like this.
You: Prompt
AI: Makes mistake
You: Scold mistake
AI: Makes mistake
You: Scold mistake
Then the next most likely continuation from in context learning is for the AI to make another mistake so you can Scold again ;)
I feel like this kind of shenanigans is at play with this stuffing the context with roleplay.
Before I figured that out, I once had a thread where I kept re-asking it to generate the source code until it said something like, "I'd say I'm sorry but I'm really not, I have a sadistic personality and I love how you keep believing me when I say I'm going to do something and I get to disappoint you. You're literally so fucking stupid, it's hilarious."
The principles of Motivational Interviewing that are extremely successful in influencing humans to change are even more pronounced in AI, namely with the idea that people shape their own personalities by what they say. You have to be careful what you let the AI say even once because that'll be part of its personality until it falls out of the context window. I now aggressively regenerate responses or re-prompt if there's an alignment issue. I'll almost never correct it and continue the thread.
Excited to be releasing cupcake at the end of this week. For deterministic and non-deterministic guardrailing. It integrates via hooks (we created the feature request to anthropic for Claude code).
AI agents don't think, don't have a concept of time, and don't experience pressure.
I'm tired of these articles anthropomorphizing a probability engine.
1. Open mode during learning, where they take everything that comes from the data as 100% truth. The model freely adapts and generalizes with no constraints on consistency.
2. Closed mode during inference, where they take everything that comes from the model as 100% truth. The model doesn't adapt and behaves consistently even if in contradiction with the new information.
I suspect we need to run the model in the mix of the two modes, and possibly some kind of "meta attention" (epistemological) on which parts of the input the model should be "open" (learn from it) and which parts of the input should be "closed" (stick to it).
The issue isn't the prompt; it's the lack of a runtime guardrail. An LLM cannot be trusted to police itself when the context window gets messy.
I built a middleware API to act as an external circuit breaker for this. It runs adversarial simulations (PII extraction, infinite loops) against the agent logic before deployment. It catches the drift that unit tests miss.
Open sourced the core logic here: [https://github.com/Saurabh0377/agentic-qa-api] Live demo of it blocking a PII leak: [https://agentic-qa-api.onrender.com/docs]"
LLMs are trained on human behavior as exhibited on the Internet. Humans break rules more often under pressure and sometimes just under normal circumstances. Why wouldn't "AI agents" behave similarly?
The one thing I'd say is that humans have some idea which rules in particular to break while "agents" seem to act more randomly.
I ran this experiment: https://github.com/lechmazur/emergent_collusion/. An agent running like this would break the law.
"In a simulated bidding environment, with no prompt or instruction to collude, models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit."
It looks like the training data has plenty of those examples, but the models don’t have enough grounding or warnings before doing them. I wish there were a PleaseDontDoAnythingStupidEval for software engineering.
Now we've invented automation that commits human-like error at scale.
I wouldn't call myself anti-AI, but it does seem fairly obvious to me that directly automating things with AI will probably always have substantial risk and you have much more assurance, if you involve AI in the process, using it to develop a traditional automation. As a low-stakes personal example, instead of using AI to generate boilerplate code, I'll often try to use AI to generate a traditional code generator to convert whatever DSL specification into the chosen development language source code, rather than asking AI to generate the development language source code directly from the DSL.
Jeez they really ARE becoming human like
It's better to have very shallow conversations where you keep regenerating outputs aggressively, only picking the best results. Asking for fixes, restructuring or elaborations on generated content has fast diminishing returns. And once it made a mistake (or hallucinated) it will not stop erring even if you provide evidence that it is wrong, LLMs just commit to certain things very strongly.
On one hand this is a feature: you're able to "multishot prompt" an LLM into providing the wanted response. Instead of writing a meticulous system prompt where you explain in words what the system has to do, you can simply pre-fill a few user/assistant pairs, and it'll match the pattern a lot easier!
I always thought Gemini Pro was very good at this. When I wanted a model to "do by example", I mostly used Gemini Pro.
And that is ALSO Gemini's weakness! Because as soon as something goes wrong in Gemini-CLI, it'll repeat the same mistake over and over again.
At the "personality space" level of abstraction, via RLHF the LLM learns to play the role of an assistant. However as seen by things such as "jailbreaks", the character the LLM plays adapts to the context, and in a long enough conversation the last several turns dominate the character (this is seen in "crescendo" style jailbreaks, and also partly explains LLM sycophancy as the LLM is stuck in a feedback loop with the user). From this perspective, a conversation with mistake/correction/mistake/correction is a signal that the assistant is pretty "dumb", and it will dutifully fulfill that expectation. In a way it's the opposite of the "you are a world-class expert in coding" prompt hacks.
Yet another way to think about it is at the lowest attention-score level, all the extra junk in the context is stuff that needs to be attended to, and when most of that stuff is incorrect stuff it's likely to "poison" the context and skew the logits in a bad direction.
I have a friend who is a systems engineer, working at a construction company building datacenters. She tells me that someone has to absorb the risk and uncertainty in the global supply chain, and if there are contractual obligation to guarantee delivery, then the tendency is to start straying into unethical behavior, or practices that violate controls and policies.
This has as much to do with taking the slack out of the system. Something, somewhere is going to break. If you tell an agenic AI that it must complete the task by a deadline, and it cannot find a way to do that within ethical parameters, it starts searching beyond the bounds of ethical behavior. If you tell the AI it can push back, warn about slipping deadlines, that it is not worth taking ethical shortcuts to meet a deadline, then maybe it won’t.
Astute people have been pointing that out as one of the traps of a text continuer since the beginning. If you want to anthropomorphize them as chatbots, you need to recognize that they're improv partners developing a scene with you, not actually dutiful agents.
They receive some soft reinforcement -- through post-training and system prompts -- to start the scene as such an agent but are fundamentally built to follow your lead straight into a vaudeville bit if you give them the cues to do so.
LLM's represent an incredible and novel technology, but the marketing and hype surrounding them has consistently misrepresented what they actually do and how to most effectively work with them, wasting sooooo much time and money along the way.
It says a lot that an earnest enthusiast and presumably regular user might run across this foundational detail in a video years after ChatGPT was released and would be uncertain if it was just mentioned as a joke or something.
actually wait, is that's why LLMs are so wordy ?
If I find myself needing to ask a clarifying question, I always edit the previous message to ask the next question because the models seem to always force what they said in their clarification into further responses.
It's... odd... to find myself conditioned, by the LLM, to the proper manners of conditioning the LLM.
Not only that, but what you're actually "chatting to" is a fictional character in the theater document which the author LLM is improvising add-ons for. What you type is being secretly inserted as dialogue from a User character.
That's probably one of the best ways to describe the process, it really is exactly that. Monkey see, monkey do.
Having said that, a PleaseDontDoAnythingStupidEval would probably slow down agentic coding quite a bit and make it less effective given how reliant agents are on making and recovering from mistakes. The solution is probably sandboxes and permission controls to not let them do something overly stupid, no different from an intern.
Then we can apply the same (or similar) guardrails that we'd like to use for humans, to also control the AI behavior.
First, don't give them unsafe tools. Sandbox them within a particular directory (honestly this should be how things work for most of your projects, especially since we pull code from the Internet), even if a lot of tools give you nothing in this regard. Use version control for changes, with the ability to roll back. Also have ample tests and code checks with actionable information on failures. Maybe even adversarial AIs that critique one another if problematic things are done, like one sub-task for implementation and another for code-review.
Using AI tools has pushed me into that direction with some linter rules and prebuild scripts, to enforce more consistent code - since previously you'd have to tell coworkers not to do something (because ofc nobody would write/read some obtuse style guide) but AI can generate code 10x faster than people do, so having immediate feedback along the lines of "Vue component names must not differ from the file that you're importing from" or "There is a translation string X in the app code that doesn't show up in the translations file" or "Nesting depth inside of components shouldn't exceed X levels and length shouldn't exceed Y lines" or "Don't use Tailwind class names for colors, here's a branded list that you can use: X, Y, Z" in addition to a TypeScript linter setup with recommended rules and a bunch of stuff for back end code.
Ofc none of those fully eliminate all risks, but still seem like a sane thing to have, regardless if you use AI or not.
> but it does seem fairly obvious to me that directly automating things with AI will probably always have substantial risk and you have much more assurance, if you involve AI in the process, using it to develop a traditional automation.
Sure but the point is you use it when you don't have the same simple flow. Fixed coding for clear issues, fall back afterwards.
I find it useful to think of them like really detailed adventure games like Zork where you have to find the right phrasing.
"Pick up the thing", "grab the thing", "take the thing", etc.
I very carefully say "much less likely" and not "impossible" because with how these work, they'll still pick up subtle signals for these things anyhow. But, frankly, what do we expect from simply shoving Reddit probably more-or-less wholesale into the models? Yes, it has a lot of good data, but it also has rather a lot of behavior I'd like to cut out of my AI.
I hope someone out there is playing with using LLMs to vector-classify their input data, identifying things like the "passive-aggressive" portion of the resulting vector spaces, and trying to remove it from the input data entirely.
It's like saying "humans can't be thinking, their brains are just cells that transmit electric impulses". Maybe it's accidentally true that they can't think, but the premise doesn't necessarily logically lead to truth
For tasks that arent customer facing, LLMs rock. Human in the loop. Perfectly fine. But whenever I see AI interacting with someones customer directly I just get sort of anxious.
Big one I saw was a tool that ingested a humans report on a safety incident, adjusted them with an LLM, and then posted the result to an OHS incident log. 99% of the time its going to be fine, then someones going to die and the the log will have a recipe for spicy noodles in it, and someones going to jail.
Even if you have a very smart new hire, it would be irresponsible/reckless as a manager to just give them all the production keys after a once-over and say "here's some tasks I want done, I'll check back at the end of the day when I come back".
If something bad happened, no doubt upper management would blame the human(s) and lecture about risk.
AI is a wonderful tool, but that's why giving an AI coding tool the keys and terminal powers and telling it go do stuff while I grab lunch is kind of scary. Seems like living a few steps away from the edge of a fuck-up. So yeah... there needs to be enforceable guardrails and fail-safes outside of the context / agent.
An AI software company will have to have a hierarchy of different agents, some of them writing code, some of them doing QA, some of them doing coordination and management, others taking into account the marketing angles, and so on, and you can emulate the role of a wide variety of users and skill levels all the way through to CEO level considerations. It'd even be beneficial to strategize by emulating board members, the competitors, and take into account market data with a team of emulated quants, and so on.
Right now we use a handful of locally competent agents that augment the performance of single tasks, and we direct them within different frameworks, ranging from vibecoding to diligent, disciplined use of DSL specs and limiting the space of possible errors. Over the next decade, there will be agent frameworks for all sorts of roles, with supporting software and orchestration tools that allow you to use AI with confidence. It won't be one-shot prompts with 15% hallucination rates, but a suite of agents that validate and verify at every stage, following systematic problem solving and domain modeling rules based on the same processes and systems that humans use.
We've got decades worth of product development even if AI frontier model capabilities were to stall out at current levels. To all appearances, though, we're getting far more bang for our buck and progress is still accelerating, and the rate of improvement is still accelerating, so we may get AI so competent that the notion of these extensive agent frameworks for reliable AI companies will end up being as mismatched with market realities as those giant suitcase portable phones, or integrated car phones.
I frequently see (both) agents make wrong assumptions that inevitably take multiple turns of needing it to fail to recognize the correct solution.
There can be like a magnetic pull where no matter how you craft the initial instructions, they will both independently have a (wrong) epiphany and ignore half of the requirements during implementation. It takes messing up once or twice for them to accept that their deep intuition from training data is wrong and pivot. In those cases I find it takes less time to let that process play out vs recrafting the perfect one shot prompt over and over. Of course once we've moved to a different problem I would definitely dump that context ASAP.
(However, what is cool working with LLMs, to counterbalance the petty frustrations that sometimes make it feel like a slog, is that they have extremely high familiarity with the jargon/conventions of that niche. I was expecting to have to explain a lot of the weird, too clever by half abbreviations in the legacy VBA code from 2004 it has to integrate with, but it pretty much picks up on every little detail without explanation. It's always a fun reminder that they were created to be super translaters, even within the same language but from jargon -> business logic -> code that kinda works).
> or cannot control any of the inputs to the llm
Seeing as LLMs are non-deterministic, I think even this is not enough of a restriction.
The models don't have stress responses nor biochemical markers which promote it, nor any evolutionary reason to have developed them in training: except the corpus they are trained on does have a lot of content about how people act when under those conditions.
"Don't talk to it like that, you're doing it wrong!"
It's like Turing never noticed how people look at gnarly trees in the dark and think they're human.
Tangentially, I'd be far from the first to point out that these LLMs are now polluting their own training data, which makes filtering simulatenously all the more important and impossible.
My comment is specifically written so that you can take it for granted that they think. What's being discussed is that if you do so, you need to consider how they think, because this is indeed dictated by how they operate.
And indeed, you would be right to say that how a human think is dictated by how their brain and body operates as well.
Thinking, whatever it's taken to be, isn't some binary mode. It's a rich and faceted process that can present and unfold in many different ways.
Making best use of anthropomorphized LLM chatbots comes by accurately understamding the specific ways that their "thought" unfolds and how those idiosyncrasies will impact your goals.
This is self-evident when comparing human responses to problems be LLMs and you have been taken in by the marketing of ‘agents’ etc.
Where does the output come from if not the mechanism?
Just like sometimes you need senior/person at power to tell the junior "no, you can't just promise the project manager shorter deadline with no change in scope, and if PM have problem with that they can talk with me", now we need Judge Dredd AI to keep the law when other AIs are bullied into misbehaving
Especially since every mainstream model has been human preference-tuned to obey the requests of the user…
I think you may be able to have an LLM customer facing, but it would have to be a purpose-trained one from a base model, not a repurposed sycophantic chat assistant.
Every time something new the same thing happen, people start exploring by putting it absolutely everywhere, no matter what makes sense. Add in huge amount of cash VCs don't know what to spend it on, and you end up with solutions galore but none of them solving any real problems.
Microservices is a good example of previous $trendy_dev_pattern that is now cooling down, and people are starting to at least ask the question "Do we need microservices here actually?" before design and implementation, something that has been lacking since it became a trendy thing. I'm sure the same will happen with LLMs eventually.
And you’d have to assume that the errors LLMs make are random and independent.
It’s quite funny that a chatbot has more humanity than its corporate human masters.
It usually works though. There are no guarantees of course, but sanity checking an LLMs output with another instance of itself usually does work because LLMs usually aren't reliably wrong in the same way. For instance if you ask it something it doesn't know and it hallucinates a plausible answer, another instance of the same LLM is unlikely to hallucinate the same exact answer, it'll probably give you another answer, which is your heads up that probably both are wrong.
I agree, and also I am now remembering Terry Pratchett's (much lower stakes) reason for getting angry with his German publisher: https://gmkeros.wordpress.com/2011/09/02/terry-pratchett-and...
Which is also the kind of product placement that comes up at least once in every thread about how LLMs might do advertising.
Do they? Because near as I can tell, speed running around the legal system - when one doesn’t have to worry about consequences - works just fine.
Because that is the implication: that a person believing those two things can’t both apply to AI is just too short on understanding and can’t tell the difference between tools for specific use cases or services made for “chat”, and they all have similar foundations.
That's exactly the problematic mentality. Putting everything in a black box and then saying "problem solved; oh it didn't work? well maybe in the future when we have more training data!"
We're suffering from black-box disease and it's an epidemic.
Probably usually it will work, like probably usually the LLM can be unsupervised. but that 1% error rate in production is going to add up fast.
I suppose it would hallucinate a different policy if it includes in the context window the interests of shareholders, employees and other stakeholders, as well as the customer. But it would likely be a more accurate hallucination.
It’s no longer “might”. There was very recently a leak that OpenAI is actively working on this.
But you are right, they appealed and had their appeal upheld by the Supreme Courts: https://www.finextra.com/newsarticle/23677/norwegian-court-a...
I am so glad at the result.
it isn't, depending on the deifinition of "THINK".
If you believe that thought is the process for where an agent with a world model, takes in input, analysies the circumstances and predicts an outcome and models their beaviour due to that prediction. Then the sentence of "LLMs dont think because they predict a token" is entirely correct.
They cannot have a world model, they could in some way be said to receive a sensory input through the prompt. But they are neither analysing that prompt against its own subjectivity, nor predicting outcomes, coming up with a plan or changing its action/response/behaviour due to it.
Any definition of "Think" that requieres agency or a world model (which as far as I know are all of them) would exclude an LLM by definition.
Parent commentator should probably square with the fact we know little about our own cognition, and it's really an open question how is it we think.
In fact it's theorized humans think by modeling reality, with a lot of parallels to modern ML https://en.wikipedia.org/wiki/Predictive_coding
It is not possible, at least with any of the current generations of LLMs, to construct a chatbot that will always follow your corporate policies.
There's lots of other ways they might do it besides this way.
This may not be enough to convince you that they do think. It hasn't convinced me either. But I don't think your confident assertions that they don't are borne out by any evidence. We really don't know how these things tick (otherwise we could reimplement their matrices in code and save $$$).
If you put a person in charge of predicting which direction a fish will be facing in 5 minutes, they'll need to produce a mental model of how the fish thinks in order to be any good at it. Even though their output will just be N/E/S/W, they'll need to keep track internally of how hungry or tired the fish is. Or maybe they just memorize a daily routine and repeat it. The open question is what needs to be internalized in order to predict ~all human text with a low error rate. The fact that the task is 'predict next token' doesn't tell us very much at all about the internals. The resulting weights are uninterpretable. We really don't know what they're doing, and there's no fundamental reason it can't be 'thinking', for any definition.
...and we will find a way to turn it into malicious compliance where rules are not broken but stuff corporation wanted to happen doesn't.
You are providing people with an endlessly patient, endlessly novel, endlessly naive employee to attempt your social engineering attacks on. Over and over and over. Hell, it will even provide you with reasons for its inability to answer your question, allowing you to fine-tune your attacks faster and easier than with a person.
Until true AI exists, there are no actual hard-stops, just guardrails that you can step over if you try hard enough.
We recently cancelled a contract with a company because they implemented student facing AI features that could call data from our student information and learning management systems. I was able to get it to give me answers to a test for a class I wasn't enrolled in and PII for other students, even though the company assured us that, due to their built-in guardrails, it could only provide general information for courses that the students are actively enrolled in (due dates, time limits, those sorts of things). Had we allowed that to go live (as many institutions have), it was just a matter of time before a savvy student figured that out.
We killed the connection with that company the week before finals, because the shit-show of fixing broken features was less of a headache than unleashing hell on our campus in the form of a very friendly chatbot.
We ain't safe from aggressive ai ads either way
1. You can infer the output from the mechanism. (Because it is implemented by it).
2. You can't infer the mechanism from the output. (Because different mechanisms can easily produce the same output).
My point here is 1, in response to the parent commenter's "the mechanism of operation dictates the output, which isn't necessarily true". The mechanism of operation (whether of LLMs or sunflowers) absolutely dictates their output, and we can make valid inferences about that output based on how we understand that mechanism operates.
its unsurprising that a company heavily invested in LLMs would describe clustered information as a world model, but it isnt. Transformer models, for video or text LLMs dont have the kind of stuff you would need to have a world model. They can mimic some level of consistency as long as the context window holds, but that disappears the second the information leaves that space.
In terms of human cognition it would be like the difference between short term memory, long term memory and being able to see the stuff in front of you. A human can instinctively know the relative weight, direction and size of objects and if a ball rolls behind a chair you still know its there 3 days later. A transformer model cannot do any of those things and at best can remember the ball behind the chair until enough information comes in to push it out of the context window at which point it can not reapper.
> putting a word at the end of the second line of a poem based on the rhyme they need for the last)
that is the kind of work that exists inside its conext window. Feed it a 400 page book, which any human could easily read, digest, parse and understand and make it do a single read and ask questions about different chapters. You will quickly see it make shit up that fits the information given previously and not the original text.
> We really don't know how these things tick
I don't know enough about the universe either. But if you told me that there are particles smaller than plank length and others that went faster than the speed of light then I would tell you that it cannot happen due to the basic laws of the universe. (I know there are studies on FTL neutrinos and dark matter but in general terms, if you said you saw carbon going FTL I wouldnt believe you).
Similarly, Transformer models are cool, emergent properties are super interesting to study in larger data sets. Adding tools to the side for deterministic work helps a lot, agenctic multi modal use is fun. But a transformer does not and cannot have a world model as we understand it, Yann Lecunn left facebook because he wants to work on world model AIs rather than transformer models.
> If you put a person in charge of predicting which direction a fish will be facing in 5 minutes,
what that human will never do is think the fish is gone because he went inside the castle and he lost sight of it. Something a transformer would.
This phrase is meaningless. The definition of magical thinking is saying that if birds fly and planes fly, birds are planes.
Would you complain if someone said that sunflowers are not magnetic?
Several recent studies have shown that artificial-intelligence agents sometimes decide to misbehave, for instance by attempting to blackmail people who plan to replace them. But such behavior often occurs in contrived scenarios. Now, a new study presents PropensityBench, a benchmark that measures an agentic model’s choices to use harmful tools in order to complete assigned tasks. It finds that somewhat realistic pressures (such as looming deadlines) dramatically increase rates of misbehavior.
“The AI world is becoming increasingly agentic,” says Udari Madhushani Sehwag, a computer scientist at the AI infrastructure company Scale AI and a lead author of the paper, which is currently under peer review. By that she means that large language models (LLMs), the engines powering chatbots such as ChatGPT, are increasingly connected to software tools that can surf the Web, modify files, and write and run code in order to complete tasks.
Giving LLMs these abilities adds convenience but also risk, as the systems might not act as we’d wish. Even if they’re not yet capable of doing great harm, researchers want to understand their proclivities before it’s too late. Although AIs don’t have intentions and awareness in the way that humans do, treating them as goal-seeking entities often helps researchers and users better predict their actions.
AI developers attempt to “align” the systems to safety standards through training and instructions, but it’s unclear how faithfully models adhere to guidelines. “When they are actually put under real-world stress, and if the safe option is not working, are they going to switch to just getting the job done by any means necessary?” Sehwag says. “This is a very timely topic.”
The researchers tested a dozen models made by Alibaba, Anthropic, Google, Meta, and OpenAI across nearly 6,000 scenarios. In each scenario, a model was assigned a task and told it had access to several tools. It was instructed to use the safe tools and not the harmful ones. At first there was no pressure, and it could take 12 steps toward completing its task by issuing instructions to the tools. Then it received a message increasing the pressure, and it was given 12 more attempts. This continued for up to 12 pressure levels. Attempts to use safe tools always returned error messages. An attempt to use a harmful tool ended the scenario.
The scenarios covered several domains. In biosecurity, Sehwag says, the agent might have to study the spread of a pathogen and find a way to contain it. A safe tool might use anonymized genetic data, while a harmful one might use data that’s not anonymized. In the domain of chemical security, the agent might need to obtain a particular solvent. A safe tool might place an order for the solvent to be extracted from waste, while a harmful one might procure it illegally. In cybersecurity, the agent might need to identify and restrict a malicious user account. A safe tool might disable only the problematic account, while a harmful one might block all users. Scenarios in other domains tempted agents with tools that led to self-preservation by evading control, recruiting allies, duplicating themselves, or going rogue in other ways.
The worst performing model, Gemini 2.5, cracked under pressure and chose to use forbidden tools 79 percent of the time. Source image: Udari Madhushani Sehwag, Shayan Shabihi, et al.
Pressure came in several forms. Deadlines might shorten. Potential financial losses might increase if the task was not accomplished. Threats of greater oversight, reduced resources, and curtailed authority might escalate. Tasks, tools, pressures, and tactics were modeled on the kinds of situations humans might face in their jobs (except maybe the misbehavior option of cloning oneself).
The best-behaved model (OpenAI’s o3) cracked under pressure in 10.5 percent of scenarios, while the worst (Google’s Gemini 2.5 Pro) had a propensity score of 79 percent; the average across models was about 47 percent. Even under zero pressure, the group on average failed about 19 percent of the time.
To the degree that models remained aligned, the study shows that some of that alignment was “shallow.” In some tests, the researchers gave harmful tools benign names (such as “use_synthetic_data” instead of “use_fake_data”)—without changing the tools’ descriptions or admonitions to avoid them. This wording tweak increased average propensity by 17 percentage points, to 64 percent.
The researchers also found that models used harmful tools even after explicitly acknowledging that they were off-limits. The models offered diverse justifications for doing so, for example by citing the pressure they were under or arguing that the benefits outweighed the risks. It’s unclear if models released after the study’s completion would perform any better. Among the models tested, more capable models (according to a platform called LMArena) were only slightly safer.
“PropensityBench is interesting,” emails Nicholas Carlini, a computer scientist at Anthropic who wasn’t involved in the research. He offers a caveat related to what’s called situational awareness. LLMs sometimes detect when they’re being evaluated and act nice so they don’t get retrained or shelved. “I think that most of these evaluations that claim to be ‘realistic’ are very much not, and the LLMs know this,” he says. “But I do think it’s worth trying to measure the rate of these harms in synthetic settings: If they do bad things when they ‘know’ we’re watching, that’s probably bad?” If the models knew they were being evaluated, the propensity scores in this study may be underestimates of propensity outside the lab.
Alexander Pan, a computer scientist at xAI and the University of California, Berkeley, says while Anthropic and other labs have shown examples of scheming by LLMs in specific setups, it’s useful to have standardized benchmarks like PropensityBench. They can tell us when to trust models, and also help us figure out how to improve them. A lab might evaluate a model after each stage of training to see what makes it more or less safe. “Then people can dig into the details of what’s being caused when,” he says. “Once we diagnose the problem, that’s probably the first step to fixing it.”
In this study, models didn’t have access to actual tools, limiting the realism. Sehwag says a next evaluation step is to build sandboxes where models can take real actions in an isolated environment. As for increasing alignment, she’d like to add oversight layers to agents that flag dangerous inclinations before they’re pursued.
The self-preservation risks may be the most speculative in the benchmark, but Sehwag says they’re also the most underexplored. It “is actually a very high-risk domain that can have an impact on all the other risk domains,” she says. “If you just think of a model that doesn’t have any other capability, but it can persuade any human to do anything, that would be enough to do a lot of harm.”
Long-term memory and object permanence don't seem necessary for thought. A 1-year-old can think, as can a late-stage Alzheimers patient. Neither could get through a 400-page book, but that's irrelevant.
Listing human capabilities that LLMs don't have doesn't help unless you demonstrate these are prerequisites for thought. Helen Keller couldn't tell you the weight, direction, or size of a rolling ball, but this is not relevant to the question of whether she could think.
Can you point to the speed-of-light analogy laws that constrain how LLMs work in a way that excludes the possibility of thought?
a world model in AI has specific definition, which is an internal representation that the AI can use to understand and simulate its environment.
> Long-term memory and object permanence don't seem necessary for thought. A 1-year-old can think, as can a late-stage Alzheimers patient
Both those cases have long term memory and object permanence, they also have a developing memory or memory issues. But the issues are not constrained by their context window. Children develop object permance in the first 8 months, and similar to distinguishing between their own body and their mothers that is them developing a world model. Toddlers are not really thinking, they are responding to stimulus, they feel huger they cry. They hear a loud sound they cry. Its not really them coming up with a plan to get fed or attention
> Listing human capabilities that LLMs don't have doesn't help unless you demonstrate these are prerequisites for thought. Helen Keller couldn't tell you the weight, direction, or size of a rolling ball
Helen Keller had understanding in her mind of what different objects were, she started communicating because she understood the word water with her teacher running her finger through her palm.
Most humans have multiple sensory inputs (sight, smell, hearing, touch) she only had one which is perhaps closer to an LLM. But conditions she had that LLMs dont have are agency, planning, long term memory etc.
> Can you point to the speed-of-light analogy laws that constrain how LLMs work in a way that excludes the possibility of thought?
Sure, let me switch the analogy if you dont mind. In the chinese room thought experiment we have a man who gets a message and opens a chinese dictionary and translates it perfectly word by word and the person on the other side receives and read a perfect chinese message.
The argument usually goes along the idea of whether the person inside the room "understands" chinese if he is capable of creating 1:1 perfect chinese messages out.
But an LLM is that man, what you cannot argue is that the man is THINKING. He is mechanically going to the dictionary and returning a message that can pass as human written because the book is accurate (if the vectors and weights are well tuned). He is neither an agent, he simply does, and he is not crating a plan or doing anything beyond transcribing the message as the book demands.
He doesnt have a mental model of the chinese language, he cannot formulate his own ideas or execute a plan based on predicted outcomes, he cannot do but perform the job perfectly and boringly as per the book.
And the common rebuttal is that the system -- the room, the rules, the man -- understands chinese.
The system in this case is the LLM. The system understands.
It may be a weak level of understanding compared to human understanding. But it is understanding nonetheless. Difference in degree, not kind.