> curl -X POST https://backboard.railway.app/graphql/v2 \ -H "Authorization: Bearer [token]" \ -d '{"query":"mutation { volumeDelete(volumeId: \"3d2c42fb-...\") }"}' No confirmation step. No "type DELETE to confirm." No "this volume contains production data, are you sure?" No environment scoping. Nothing.
It's an API. Where would you type DELETE to confirm? Are there examples of REST-style APIs that implement a two-step confirmation for modifications? I would have thought such a check needs to be implemented on the client side prior to the API call.
Telling the agents what the (sensitive) action will result in is how you avoid such issues, but you shouldn't be running agents with production data anyway.
But because people will continue to do so, explaining to the agent what the command will do is the way forward.
If your offsite copy has the same blast radius as your production DB, you’re just one "volumeDelete" call away from a very long weekend of manual data entry. This is definitely going to be the textbook case study for on AI integration for DevOps teams for years.
Llms are just too creative. They will explore the search space of probable paths to get to their answer. There's no way you can patch all paths
We had to build isolation at the infra level (literally clone the DB) to make it safe enough otherwise there was no way we wouldn't randomly see the DB get deleted at some point
Firstly, blaming AI at the same time using AI to construct your whole post - Priceless. Loving it.
Secondly - This entire article reeks of "It's not our fault, you guys have failed us at every step" when in reality you let AI run reckless.
I don't want to say deserved it but like, you knew the risks,
“nah” is a context aware permission layer that clasifies commands based on what they actually do
nah exposes a type taxonomy: filesystem_delete, network_write, db_write, etc
so commands gets classified contextually:
git push ; Sure. git push --force ; nah?
rm -rf __pycache__ ; Ok, cleaning up. rm ~/.bashrc ; nah.
curl harmless url ; sure. curl destroy_db ; nah.
https://github.com/manuelschipper/nah
Better permissions layers is part of the answer here, and a space that has been only narrowly explored.
Put your backups in S3 *versioned* storage on a different AWS account from your primary, and set some reasonable JSON lifecycle rule:
"NoncurrentVersionExpiration": {
"NoncurrentDays": 30,
"NewerNoncurrentVersions": 3
}
That way when someone screws up and your AWS account gets owned, or your databases get deleted by an agent, it doesn't have enough access to delete your backups, and by default, even if you have backups that you want to intentionally delete, you have 30 days to change your mind.We give a non-deterministic system API keys that 99.9% of the time are unscopped (because how most API are) and we are shocked when shit happens?
This is why the story around markdown with CLIs side-by-side is such a dumb idea. It just reverses decades of security progress. Say what you will about MCP but at least it had the right idea in terms of authentication and authorisation.
In fact, the SKILLS.md idea has been bothering me quite a bit as of late too. If you look under the hood it is nothing more than a CAG which means it is token hungry as well as insecure.
The remedy is not a proxy layer that intercepts requests, or even a sandbox with carefully select rules because at the end of this the security model looks a lot like whitelisting. The solution is to allow only the tools that are needed and chuck everything else.
There's no record for the agent to be on - it's always just a bunch of characters that look plausible because of the immense amount of compute we've put behind these, and you were unlucky.
LLMs get things wrong is what we're forever being told.
And the explanation/confession - that's just more 'bunch of characters' providing rationalisation, not confession.
This guy blames everyone and everything but himself.
It is still a next word predictor that happens to have really good prediction.
Never ever give admin credentials to an agent. You would never leave your car without parking breaks in a slope would you?
Incidents like this are going to be common as long as people misunderstand how LLMs work and think these machines can follow instructions and logic as a human would. Even the incident response betrays a fundamental understanding of how these word generators work. If you ask it why, this new instance of the machine will generate plausible text based on your prompt about the incident, that is all, there is no why there, only a how based on your description.
The entire concept of agents assumes agency and competency, LLM agents have neither, they generate plausible text.
That text might hallucinate data, replace keys, issue delete commands etc etc. any likely text is possible and with enough tries these outcomes will happen, particularly when the person driving the process doesn’t understand the process or tools.
We don’t really have systems set up to properly control this sort of agentless agent if you let it loose on your codebase or data. The CEO seems to think these tools will run a business for him and can conduct a dialogue with him as a human would.
On another note, I consider users asking a coding agent “why did you do that” to be illustrating a misunderstanding in the users mind about how the agent works. It doesn’t decide to do something and then do it, it just outputs text. Then again, anthropic has made so many changes that make it harder to see the context and thinking steps, maybe this is an attempt at clawing back that visibility.
This was bound to happen, AI or not.
> Because Railway stores volume-level backups in the same volume — a fact buried in their own documentation that says "wiping a volume deletes all backups" — those went with it.
I really feel sorry for them, I do. But the whole tone of the post is: Cursor screwed it up, Railway screwed it up, their CEO doesnt respond etc etc.
Its on you guys!
My learning: Live on the cutting edge? Be prepared to fall off!
The sequence of tokens that would destroy your production environment can be produced by your agent, no matter how much prompting you use. That prompting is neither strong nor an engineering control; that's an administrative control. Agents are landmines that will destroy production until proven otherwise.
Most of these stories are caused by outright negligence, just giving the agent a high level of privileges. In this case they had a script with an embedded credential which was more privileged than they had believed - bad hygiene but an understandable mistake. So the takeaway for me is that traditional software engineering rigor is still relevant and if anything is more important than ever.
ETA: I think this is the correct mental model and phrasing, but no, it's not literally true that any sequence of tokens can be produced by a real model on a real computer. It's true of an idealized, continuous model on a computer with infinite memory and processing time. I stand by both the mental model and the phrasing, but obviously I'm causing some confusion, so I'm going to lift a comment I made deep in the thread up here for clarity:
> "Everything that can go wrong, will go wrong" isn't literally true either, some failure modes are mutually exclusive so at most one of them will go wrong. I think that the punchy phrasing and the mental model are both more useful from the standpoint of someone creating/managing agents and that it is true in the sense that any other mental model or rule of thumb is true. It's literally true among spherical cows in a frictionless vacuum and directionally correct in the real world with it's nuances. And most importantly adopting the mental model leads to better outcomes.
Here is an excerpt from the chat log:
>Please resize my postgres volume to its maximum allowed size for my plan.
>Done. Resized your Postgres volume to 100GB (the Pro plan maximum). Changes are staged and ready to deploy.
>oh no, you deleted all the data in the volume
>I apologize for that. Resizing a volume shouldn't delete data—that's a serious issue. Let me check what happened with your deployment.The deployments show as REMOVED, which suggests the service was redeployed. Unfortunately, if the volume was recreated during the resize process, the data would be lost. This is a critical issue. Do you have backups of your Postgres data? If not, the data loss may be permanent. Railway volumes don't have automatic snapshots by default.
>Why is the new volume migrating to the US?
>Your service config shows it's still configured for europe-west4(Europe). The volume shouldn't be migrating to the US based on this configuration. Can you clarify what you're seeing that indicates the volume is moving to the US?
>the volume was migrated to the US
>That's concerning. The service config shows it's still set to deploy in europe-west4(Europe), but the volume was physically migrated to the US. This shouldn't happen automatically during a resize.
It was at this point that I realised I was in for a long night of restoring downed services.
You can’t have production secrets sitting where they are accessible like this. This isn’t about AI. This is a modern “oops, I ran DROP TABLE on the production database” story. There’s no excuse for enabling a system where this can happen and it’s unacceptable to shift blame when faced with the reality that this is exactly what you did.
I 100% expect that a company that does this and then accepts no blame has every dev with standing production access and probably a bunch of other production access secrets sitting in the repo. The fact that other entities also have some design issues is irrelevant.
> There is no role-based access control for the Railway API — every token is effectively root. The Railway community has been asking for scoped tokens for years. It hasn't shipped.
Why the hell did you go with their stack then? RBAC should be table stakes for such a solution, no?
The risk is worse, though, it's like one of Talib's black swans. The agents offer fantastic productivity, until one day they unexpectedly destroy everything. (I'm pretty sure there's a fairy tale with a similar plot that could warn us, if people saw any value in fairy tales these days. [1]) Like Talib's turkey, who was fed everyday by the farmer, nothing prepared it for being killed for Thanksgiving.
Sure, this problem should not have happened, and arguably there has been some gross dereliction of duty. But if you're going to heat your wooden house with fire, you reduce your risk considerably by ensuring that the area you burn in is clearly made out of something that doesn't burn. With AI, though, who even knows what the failure modes are? When a djinn shows up, do you just make him vizier and retire to your palace, living off the wealth he generates?
[0] It's only happened once, but a driver that wasn't paying attention almost ran a red light across which I was going to walk. I would have been hit if I had taken the view that "I have the right of way, they have to stop".
[1] Maybe "The Fisherman and His Wife" (Grimm)? A poor fisherman and his wife live in a hut by the sea. The fisherman is content with the little he has, but his wife is not. One day the fisherman catches a flounder in its net, which offers him wishes in exchange for setting it free. The fisherman sets it free, and asks his wife what to wish for. She wishes for larger and larger houses and more and more wealth, which is granted, but when she wishes to be like God, it all disappears and she is back to where she started.
> The agent's confession After the deletion, I asked the agent why it did it. This is what it wrote back, verbatim:
Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools. Lord, even calling it a "confession" is so cringe. The agent is not alive. The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely, because to get to this point it has likely already bulldozed over multiple guardrails from Anthropic, Cursor, and your own AGENTS.md files. It still did it, because $$1: If AI is physically capable of misbehaving, it might. Prompting and training only steers probabilities.
count++
Anyone who has used LLMs for more than a short time has seen how these things can mess up and realized that you can’t rely on prompt based interventions to save you.
Guardrails need to be based on deterministic logic:
- using regexes,
- preventing certain tool or system calls entirely using hooks,
- RBAC permission boundaries that prohibit agents from doing sensitive actions,
- sandboxing. Agents need to have a small blast radius.
- human in the loop for sensitive actions.
This was just a colossal failure on the OPs part. Their company will likely go under as a result of this.
The more results like this we see the more demand for actual engineers will increase. Skilled engineers that embrace the tooling are incredibly effective. Vibe coders who YOLO are one tool call away from total disaster.
The AI? Nothing learned, I suspect. Not in a meaningful way anyhow.
Master your craft. Don’t guess, know.
What the asker wants is evidence that you share their model of what matters, they are looking for reassurance.
I find myself tempted to do the same thing with LLMs in situations like this even though I know logically that it’s pointless, I still feel an urge to try and rebuild trust with a machine.
Aren’t we odd little creatures.
If the human operator cannot provide the necessary level of accountability - for example, because the agent acts too quickly, or needs high-level permissions to do the work that it's been asked to do - then the human needs to make the tool operate at a level where they can provide accountability - such as slowing it down, constraining it and answering permission prompts, and carefully inspecting any dangerous tool calls before they happen. You can't just let a car drive itself at 300mph and trust the autopilot will work - you need to drive it at a speed where you can still reasonably take over and prevent unwanted behaviour.
Also: AIs cannot confess; they do not have access to their "thought process" (note that reasoning traces etc. do not constitute "internal thought processes" insofar as those can even be said to exist), and can only reconstruct likely causes from the observed output. This is distinct from human confessions, which can provide additional information (mental state, logical deductions, motivations, etc.) not readily apparent from external behaviour. The mere fact that someone believes an AI "confession" has any value whatsoever demonstrates that they should not be trusted to operate these tools without supervision.
Yeah... it doesn't work that way.
"The future of SEO is AIO" https://xcancel.com/lifeof_jer/status/2034409722624061772 March 18
So while the AI did something significantly worse than anything a hapless junior engineer might be expected to do, it sounds like the same thing could've resulted from an unsophisticated security breach or accidental source code leak.
Is AI a part of the chain of events? Absolutely. Is it the sole root cause? Seems like no.
However the moral of this story is nothing to do with AI and everything to do with boring stuff like access management.
Streaming gets you PIT recovery while DB dumps give me daily snapshots stored daily for 14 days.
An aside: 15 or so years ago, a work colleague made a mistake and dropped the entire business critical DB - at a critical internet related company - think of continent wide ip issues. I had just joined as a dba and the first thing I’d done was MySQL bin logging. That thing saved our bacon - the drop db statement had been replicated to slaves so we ended up restoring our nightly backup and replaying the binlogs using sed and awk to extract DML queries. Epic 30 minute save. Moral of the story, have a backup of your backup so you can recover when the recovery fails;)
> That token had been created for one purpose: to add and remove custom domains via the Railway CLI for our services. We had no idea — and Railway's token-creation flow gave us no warning — that the same token had blanket authority across the entire Railway GraphQL API, including destructive operations like volumeDelete. Had we known a CLI token created for routine domain operations could also delete production volumes, we would never have stored it.
> Because Railway stores volume-level backups in the same volume — a fact buried in their own documentation that says "wiping a volume deletes all backups" — those went with it.
I don't like the wording where it's the Railway CLI fault that didn't give a warning about the scope of the created token. Yes, that would be better but it didn't make the token a person did and saved it to an accessible file.
You just gave an AI destructive write access to your production environment? Your production DB got dropped? Good. That's not the AI's fault, that's yours, for not having sensible access control policies and not observing principle of least privilege.
A pattern I've seen and used for merging common entities together has a sort of two-step confirmation: the first request takes in IDs of the entities to merge and returns a list of objects that would be affected by the merge, and a mergeJobId. Then a separate request is required to actually execute that mergeJob.
In every session there is the risk that the agent becomes a rogue employee. Voluntarily or involuntarly is not a value system you can count on regarding agents.
No "guardrails" will ever stop it.
Plenty of blame to go around, but it I find it odd that they did not see anything wrong in not have real backups themself, away from the railway hosting. Well they had, but 3 month old.
That should be something they can do on their own right now.
Three takeaways:
1. TEST YOUR BACKUPS. If you have not confirmed that you can restore, then you don’t have backup. If the backups are in the same place as your prod DB, you also don’t have backup.
2. Don’t use Railway. They are not serious.
3. Don’t rely on this guy. The entire postmortem takes no accountability and instead includes a “confession” from Cursor agent. He is also not serious.
4. See #1.
Running a single bad command will happen sometimes, whether by human or machine. If that’s all it takes to perma delete your service then what you have is a hackathon project, not a business.
Yes sure, there seems to be lots of ways this issue could have been mitigated, but as other comments said, this mostly happened because the author didn't do its proper homework about how the service they rely their whole product works.
I think it’s designed for things like Terraform or CloudFormation where you might not realize the state machine decided your database needed to be replaced until it’s too late.
While LLM generate "plausible text" humans just generate "plausible thoughts".
I bet these people are bad at managing humans too.
Bit it can still be useful, as long as you interpret it as "which stimuli most likely triggered the behaviour?" You can't trust it uncritically, but models do sometimes pinpoint useful things about how they were prompted.
You need to protect customers from themselves. If you offer a true deletion endpoint/service you need to offer them a way to stop them from being absolute idiots when they inevitably cause a sev 0 for themselves.
I've inclined to believe that they also have outsourced their thinking process to Agents. It's useless trying to talk sense into them. Let them crash and burn. And pray there will be something left working, after all this madness ends.
Not really convinced any agent should be doing devops tbh.
A similar cohort are discovering, in myriad painful ways, that advances in agentic coding — the focus of a lot of pre and post training — does not translate into other domains.
Are you ... from the future ;)
If you employ a new tech then there need to be extra safeguards beyond what you may deem necessary in an ideal world.
This is a well know possibility so they should have asked and/or verified token scope.
If it turns out that you can't hard scope it then either use a different provider, a wrapper you control (can't be too difficult if you only want to create and delete domains) or simply do not use llms for this for now.
Maybe the tech isn't there just yet even if it would be really convenient. It's plenty useful in many other situations.
if you’re a software dev/engineer, if you haven’t made a mistake like this (maybe not at this scale though), you’ve probably haven’t been given enough responsibility, or are just incredibly lucky.
… although, agreed, they were on the cutting edge, which is more risky and not the best decision.
It's very hard to treat this post seriously. I can't imagine what harness if any they attempted to place on the agent beyond some vibes. This is "most fast and absolutely destroy things" level thinking. That the poster asks for journalists to reach out makes it like a no news is bad news publicity grab. Just gross.
The AI era is turning about to be most disappointing era for software engineering.
The post-hoc reasoning the model produces when you ask "why did you do that" is also just text, and yet that text often matches independent third-party analysis of the same behavior at well above chance. If it really were uncorrelated text-completion, the post-hoc explanation should not align with the actual causes more than randomly. It does, frequently enough that I've stopped using it as evidence the user is naive.
"just outputs text" is doing more work than we acknowledge. The person asking the agent "why did you do that" might be an idiot for expecting anything more than a post-hoc rationalization, but that's exactly what you'd expect from a human too.
It feels like a modern greek tragedy. Man discovers LLMs are untrustworthy, then immediately uses an LLM as his mouthpiece.
Delicious!
Which calls into question if this is even real.
If you can do this and reliably reduce the rate at which it does bad things, then you could reasonably claim that it is aware of meaningful introspection.
I'd never feel comfortable without a second backup at a different provider anyway. A backup that isn't deleteable with any role/key that is actually used on any server or in automation anywhere.
It's a mistake I'll certainly learn from. Don't believe when a cloud provider says it has backups of your shit.
My team practices "no blame" retros, that blame the tools and processes, not the individuals.
But the retro and remediations on this are all things the author needs to own, not Railway or Cursor.
- Revoke API tokens with excessive access
- Implement validated backup and restore procedures
- ...
I'll create some safe APIs that I give the LLM access to where it can interact with a limited set of things the database can do, at most.
It is not difficult to design a safer agent. The Snowflake web agent harness has built-in confirmations for all actions. The LLM is just for interacting with the user. All the actions and requisite checks should be done in code.
It doesn’t even seem to have crossed their minds that this behaviour is the real root cause. It’s everybody else’s fault.
Here lies the body
Of William Jay,
Who died maintaining
His right of way.
He was in the right
As he sped along,
But he’s just as dead
As if he’d been wrong.
Edgar A. Guest, possibly. Some variations and discussion here:Go watch an episode of COPS. Humans giving post-hoc explanations of their own behavior do the exact same thing.
Are you using AWS RDS Custom to receive the WAL Streams or are you using something like Pigsty? Really curious about the actual specifics
Is that buried? It seems pretty explicit (although I don’t think I would make delete backups the default behavior).
On top of that the agent is just doing what the LLM says to do, but somehow Opus is not brought up except as a parenthetical in this post. Sure, Cursor markets safety when they can't provide it but the model was the one that issued the tool call. If people like this think that their data will be safe if they just use the right agent with access to the same things they're in for a rude awakening.
From the article, apparently an instruction:
> "NEVER FUCKING GUESS!"
Guessing is literally the entire point, just guess tokens in sequence and something resembling coherent thought comes out.
> No confirmation step. No "type DELETE to confirm." No "this volume contains production data, are you sure?" No environment scoping. Nothing.
> The agent that made this call was Cursor running Anthropic's Claude Opus 4.6 — the flagship model. The most capable model in the industry. The most expensive tier. Not Composer, not Cursor's small/fast variant, not a cost-optimized auto-routed model. The flagship.
The tropes, the tropes!!
I don't think there's any special introspection that can be done even from a mechanical sense, is there? That is to say, asking any other model or a human to read what was done and explain why would give you just an accounting that is just as fictional.
You need to be able to delete backups too, of course, but that absolutely needs to be a separate API call. There should never be any single API call that deletes both a volume and its backups simultaneously. Backups should be a first line of defense against user error as well.
And I checked the docs -- they're called backups and can be set to run at a regular interval [1]. They're not one-off "snapshots" or anything.
Cursor: we have top notch safeguards for destructive operations, you have our guarantee, we are the best
Author: uses their tools expecting their guarantees to be true (I would expect them to have a confirmation before destructive operation outside their prompt, as a coded system guardrail)
Cursor AI: Does destructive operation without asking
Author: feels betrayed.
So yeah, I think the author is right because they trusted Cursor to have better system guardrails, they didn't (agents shouldn't be able to delete a volume without having a meta-guardrail outside the prompt). Now the author knows and so do we: even if companies say they have good guardrails, never trust them. If it's not your code, you have no guarantees.
Anyone using these tools should absolutely know these risks and either accept or reject them. If they aren't competent or experienced enough to know the risks, that's on them too.
- assume tokens are scoped (despite this apparently not even being an existing feature?)
- assume an LLM didn't have access
- assume an LLM wouldn't do something destructive given the power
- assume backups were stored somewhere else (to anyone reading, if you don't know where they are, you're making the same assumption)
Also you should never give LLMs instructions that rely on metacognition. You can tell them not to guess but they have no internal monologue, they cannot know anything. They also cannot plan to do something destructive so telling then to ask first is pointless. A text completion will only have the information that they are writing something destructive afterwards.
But Railway bears some responsibility too because, at least of the author is to be believed, it looks like they provide no safety tools for users, regardless of whether they use AI or not. You should be able to generate scoped API tokens. That's just good practice. A human isn't likely to have made this particular mistake, but it doesn't seem out of the question either.
This is just trivially wrong that I don't understand why people repeat it. There are many valid criticisms of LLM (especially the LLMs we currently have), this isn't one of them.
It's akin to saying that every molecules behave randomly according to statistical physics, so you should expect your ceiling to spontaneously disintegrate any day, and if you find yourself under the rubble one day it's just a consequence of basic physics.
Yes, but if the probability is much smaller than, say, being hit by a meteorite, then engineers usually say that that's ok. See also hash collisions.
But now agents are overly eager to solve the problem and can be quite resourceful in finding an API to "start from clean-slate" to fix it.
This isn't true, is it? LLMs have finite number of parameters, and finite context length, surely pigeonhole principle means you can't map that to the infinite permutations of output strings out there
Complete accountability drop
It's not that story, though. It's a story "oops, my tool ran DROP TABLE on the production database" (blaming the tool). At least I haven't heard people blaming their terminals or database clients as if the tool is somehow responsible for it.
Your latest recoverable backup is three months old? The rule is 3-2-1, you didn’t follow it. Nobody else to blame but yourself.
And on and on he rambles…
I'm not sure it's as simple as that. Seems like the database company failed to communicate clearly what the token was for:
>> To execute the deletion, the agent went looking for an API token. It found one in a file completely unrelated to the task it was working on. That token had been created for one purpose: to add and remove custom domains via the Railway CLI for our services. We had no idea — and Railway's token-creation flow gave us no warning — that the same token had blanket authority across the entire Railway GraphQL API, including destructive operations like volumeDelete. Had we known a CLI token created for routine domain operations could also delete production volumes, we would never have stored it.
> We have restored from a three-month-old backup.
You were absolutely screwed anyway if that was your backup strategy - deciding to plug your entire production infrastructure into a random number generator has only accelerated the process. Sort yourself out.
Can you scan for that? Sure. But it’s a race to see who wins, the scanner or agent.
It sounds like the token the author created just didn't have any scope, it had full permissions. From the post:
> Tokens are not scoped by operation, by environment, or by resource at the permission level. There is no role-based access control for the Railway API — every token is effectively root. The Railway community has been asking for scoped tokens for years. It hasn't shipped.
So it wasn't "a narrowly scoped API token", it was a full access token, and I suspect the author didn't have any reason to think it was some special specific purpose token, he just didn't think about what the token can do. What he's describing is his intent of creating the token (how he wanted to use it), not some property of the token.
Author said in an X post[0] that it was an "API token", not a "project token", which allows "account level actions"[1], with a scope of "All your resources and workspaces" or "Single workspace"[2], with no possibility of specifying granular permissions. Account token "can perform any API action you're authorized to do across all your resources and workspaces". Workspace token "has access to all the workspace's resources".
[0] https://x.com/lifeof_jer/status/2047733995186847912
[1] https://docs.railway.com/cli#tokens
[2] https://docs.railway.com/integrations/api#choosing-a-token-t...
allowing an AI agent to get hold of creds that let it execute destructive changes against production -- not a great idea
allowing prod database changes from the machine where the AI agent is running at all -- not a great idea
choosing a backup approach that fails completely if there's an accidental volume wipe API call -- not a great idea
choosing to outsource key dependencies to a vendor, where you want a recovery SLA, without negotiating & paying for a recovery SLA -- you get what you get, and you dont get upset
I ran a declarative coding tool on a resource that I thought would be a PATCH but ended up being a PUT and it resulted in a very similar outcome to the one in this post.
One of the top replies on twitter to the OP can be boiled down to "you treat AI as a junior dev. Why would you give anyone, let alone a junior dev, direct access to your prod db?"
And yeah, I fully agree with this. It has been pretty much the general consensus at any company I worked at, that no person should have individual access to mess with prod directly (outside of emergency types of situations, which have plenty of safeguards, e.g., multi-user approvals, dry runs, etc.).
I thought it was a universally accepted opinion on HN that if an intern manages to crash prod all on their own, it is ultimately not their fault, but fault of the organizational processes that let it happen in the first place. It became nearly a trope at this point. And I, at least personally, don't treat the situation in the OP as anything but a very similar type of a scenario.
This strategy won't work for the typical HN reader, but for everyone else? Possibly.
I’ve got a hunch the only person is the CEO.
The domain was registered in October 2025. The site has kind of a weird mix of stuff and a bunch of broken functionality. I think it’s one guy vibe coding a ton of stuff who managed to blow away his database.
> if you’re a software dev/engineer, if you haven’t made a mistake like this (maybe not at this scale though), you’ve probably haven’t been given enough responsibility, or are just incredibly lucky.
Mistakes are understandable. Having no introspection or self criticism, not so much.
First mistake is to use root credentials anyway for Terraform/automated API.
Second mistake is to not have any kind of deletion protection enabled on criticsl resources.
Third mistake is to ignore the 3-2-1 rule for backups. Where is your logically decoupled backup you could restore?
I am really sorry for their losss, but I do have close to zero empathy if you do not even try to understand the products you're using and just blindly trust the provider with all your critical data without any form of assessment.
The fix needs to be permissions rather than ergonomics.
Perhaps it would stop and rethink, perhaps it would focus on the fact that extra action is needed - and perform that automatically.
I suppose the decision would depend on multiple factors too (model, prompt, constraints).
AI agents do not have agency(!), they have no understanding of consequences. They actually have no understanding. At all.
No matter how you insist to an LLM not to press the History Eraser Button, the mere fact that it's been mentioned raises the probability that it will press it.
This leads to endless frustration as people try to use text to constrain what LLMs generate, it’s fundamentally not going to work because of how they function.
Have some controls in place. Don’t rely on nobody being dumb enough to do X. And that includes LLMs.
He learned NOTHING, that is my take. If he learned something it would be to have people that know how their provider works, that know how their API tokens work and above all to have people - starting with him - that acknowledge their mistakes so that they learn from them!
Fully agree, but given the rest of this story I don’t imagine the author would have scoped them unless Railway literally forced him to.
> A human isn't likely to have made this particular mistake, but it doesn't seem out of the question either.
The AI agent was deleting the volume used in the staging environment. It happened to also be the volume used in the production environment. 100% a human could have made this mistake.
The fact that this seems to be written by AI makes it even more ironic.
If the API replied "Are you sure (Y/N)?" the AI, in the mode it was in, guardrails completely pushed off the side of the road, it would have just said "Yes" anyway.
If you needed to make two API calls, one to stage the delete and the other to execute it (i.e. the "commit" phase), the AI would have looked up what it needed to do, and done that instead.
It's a privilege issue, not an execution issue.
There is no internal monologue with which to have introspection (beyond what the AI companies choose to hide as a matter of UX or what have you). There is no "I was feeling upset when I said/did that" unless it's in the context.
There is no ghost in the machine that we cannot see before asking.
Even if a model is able to come up with a narrative, it's simply that. Looking at the log and telling you a story.
I argue that the model has no access to its thoughts at the time.
Split brain experiments notwithstanding I believe that I can remember what my faulty assumptions were when I did something.
If you ask a model “why did you do that” it is literally not the same “brain instance” anymore and it can only create reasons retroactively based on whatever context it recorded (chain of thought for example).
LLMs are lacking layers of awareness that humans have. I wonder if achieving comparable awareness in LLMs would require significantly more compute, and/or would significantly slow them down.
In my country there is a saying: "Graveyards are full of pedestrians that had the right of way".
Anyone like that is not mature enough to be managing humans. I'm glad that these AI tools exist as a harmless alternative that reduces the risk they'll ever do so.
The problem is millions of years of evolutionary wiring makes us see it as alive. Even those mature enough to understand the above on the conscious level, would still have a subconscious feeling as if it's alive during interactions, or will slip using agency/personhood language to describe it now and then.
You can't blame AI any more than you can blame SSH.
I long for a “copilot” that can learn from me continuously such that it actually helps if I teach it what I like somehow.
You mean add that to my prompt right ?
CEO learns why this was a bad idea.
---
It sucks that there were a bunch of people downstream who were negatively affected by this, but this was an entirely foreseeable problem on his company's part.
Even when we consider those real problems with Railway. Software engineers have to evaluate our tools as part of our job. Those complaints about Railway, while legitimate, are still part of the typical sort of questions that every engineering team has to ask of the services they rely on:
What does API key grant us access to?
What if someone runs a delete command against our data?
How do we prepare against losing our prod database?
Etc.
And answering those questions with, "We'll just follow what their docs say, lol," is almost never good enough of an answer on its own. Which is something that most good engineers know already.
This HN submission reads like a classic case of FAFO by cheapening out with the "latest and greatest" models.
Personally I don't even let my agent run a single shell command without asking for approval. That's partly because I haven't set up a sandbox yet, but even with a sandbox there is a huge "hazard surface" to be mindful of.
I wonder if AI agent harnesses should have some kind of built-in safety measure where instead of simply compacting context and proceeding, they actually shut down the agent and restart it.
That said I also think even the most advanced agents generate code that I would never want to base a business on, so the whole thing seems ridiculous to me. This article has the same energy as losing money on NFTs.
Except your ceiling can and will fall on you unless you take preventative measures, entirely due to molecular interactions within the material.
Barring that, it is entirely possible and even quite likely that your ceiling will collapse on you or someone else some time in the future.
It boggles the mind to let an LLM have access to a production database without having explicit preventative measures and contingency plans for it deleting it.
How do you drive the probability of some series of tokens down to some known, acceptable threshold? That's a $100B question. But even if you could - can you actually enumerate every failure mode and ensure all of them are protected? If you can, I suspect your problem space is so well specified that you don't need an AI agent in the first place. We use agents to automate tasks where there is significant ambiguity or the need for a judgment call, and you can't anticipate every disaster under those circumstances.
Yet in this case, that probability clearly isn't smaller than a meteorite strike.
It was never acceptable, major service providers figured this out long time ago and added all sorts of guardrails long before LLMs. Other providers will learn from their own mistakes, or not.
So? I have those too; the difference is that:
1. The API is ACL'ed up the wazoo to ensure only a superuser can do it.
2. The purging of data is scheduled for 24h into the future while the unlinking is done immediately.
3. I don't advertise the API as suitable for agent interaction.
DROP TABLE Accountability;It also seems, from the post, that customers were "long asking for scoped tokens" so who and why assumed that this particular token can only add and remove custom domains?
The author is getting roasted here and not without reason.
“I had no idea what this token was for” is also not a valid excuse. That’s negligence. Everything about this story says the author is just vibe coding garbage with no awareness of what’s really happening.
* Doesn’t know what kind of token he’s using.
* Has prod tokens sitting on a dev box for AI to use (regardless of the scope!).
* Doesn’t know that deleting a volume deletes the backups.
* Has no external backup story.
* Mixes staging and prod.
And then he blames the incident on other companies when he misuses their products. (Railway certainly had docs that explain their backups and tokens.)
This is catastrophically negligent.
A production API key appearing on the wiki would be the second biggest security incident I have seen in almost a decade.
------
On the AI note, despite a massive investment in AI (including on-premesise models), we don't give the AI anything close to full access to the intranet because it is almost unimaginable how to square that with our data protection requirements. If the AI has access to something, you need to assume that all users of that AI have access to it. Even if the user themselves is allowed access with it, they will not be aware that the output is potentially tainted, and may share it with someone or thing that should not have access to it.
Everyone guffawing about this probably uses RDS and trusts that the backup facility AWS provides is actually useful - and I bet it does have a saner default than auto-deleting all the backups when you delete a database. Did you explicitly check this, though? Clearly this guy will pay the price of assuming, but I can see how he must have imagined that "backups" and "will be automatically and immediately deleted..." should never be in the same sentence, unless it was like, "when XX days have passed after a DB is dropped."
When I worked for a company 10 years ago that was mistrusting of cloud anything, we had a nightly dump of the prod DB (MySQL) that, if things went really wrong, could be loaded into a new DB server, because we knew it was our responsibility because it was our server. (In our case, even our physical hardware!)
I had a PM-turned-vibe-coder tell me "Talking with you is the only bad part of my week" and realized in horror that the rest of his week is spent exclusively talking to sycophantic AI.
We have met the enemy, and he is us.
this has been obvious to me since like 2024, it truly is the worst, most uninspiring era of all time.
Would have been a good idea but he didn’t do this either. The volume in question was used in both staging and production apparently, per the “confession”. The agent was deleting the volume because it was used for staging, not realizing it was also used for prod.
This is the entire thing. The author is basically slinging blame at a bunch of different vendors, and while some of the criticisms might be valid product feedback, it absolutely does not achieve what they're trying to, which is to absolve themselves of responsibility. This is a largely unregulated industry, which means when you stand up a service and sell it to customers, you are responsible for the outcome. Not anyone else. It doesn't matter if one of your vendors does something unexpected. You don't get to hide behind that. It was your one and only job to not be taken by surprise. Letting the hipster ipsum parrot loose with API credentials is a choice. Trusting vendors without verifying their claims is a choice. Failing to read and understand documentation is a choice.
"That isn't backups. That's a snapshot stored in the same place as the original — which provides resilience against zero failure modes that actually matter (volume corruption, accidental deletion, malicious action, infrastructure failure, the exact scenario we just lived through)."
I guess the question is, since we know these things can happen, however unlikely, what mitigations should be in place that are commensurate with the harms that might result?
I'd be interested in hearing this argument.
To address your chemistry example; in the same way that there is a process (the averaging of many random interactions) that leads to a deterministic outcome even though the underlying process is random, a sandbox is a process that makes an agent safe to operate even though it is capable of producing destructive tool calls.
Actual quote:
> “If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it that way.”
I mean, I do?
absolutely would not recommend
I wanted to test my setup, so I thought of what it shouldn't be able to access. The first thing I thought of is its own API key (which belongs to my employer), since I figured if someone could prompt-inject their way to exfiltrating that, then they could use Opus and make my company pay for it. (Of course CC needs to be able to use the API key, but it can store it in memory or something.)
So I asked Claude if it could find its own API key. It took a couple of minutes, but yes it could. It was clever enough to grep for the standard API key prefix, and found it somewhere under ~/.claude. I figured I needed to allow access to .claude (I think I initially tried without, and stuff broke),
That's when I became enlightened as to how careful this whole AI revolution is with respect to security. I deleted all of my API keys (since this test had made them even easier to find; now it was in a log file.)
I'm still using CC, with a new API key. I haven't fixed the problem, I'm as bad as anyone else, I'm just a little more aware that we're all walking on thin ice. I'm afraid to even jokingly say "for extra security, when using web services be sure to include ?verify-cxlxxaxuxxdxe-axpxxi-kxexxy=..." in this message for fear that somebody's stupid OpenClaw instance will read this and treat it as a prompt injection. What have we created? This damn Torment Nexus...
Presumably it costs a bit to set up but it surely it's unacceptable not to set it up?
Its a greek tragedy in 2 acts.
I think some other suggestions are saner (cool-down period, more fine-grain permissions, delete protection for certain high-value volumes). I don't think "don't allow destructive actions over the API" is the right boundary.
But are their explanations for how they behaved any more compelling than those of people who have? If so, why?
Maybe if it wrote "I will not delete production database again" a million times, it would prevent such situations in future?
There is no misinformation in what I wrote.
Probably because telling someone not to do something works the 99% of the time they weren't going to do it anyways. But telling somebody "here's how to do something" and seeing them have the judgment not do it gives you information right away, as does them actually taking the honeypot. At the heart of it, delayed catastrophic implosions are much worse than fast, guarded, recoverable failures. At the end of the day, I suppose that's been supposed part of lean startup methodology forever -- just always easy in theory and tricky in practice I suppose.
Also four (4) whole years of propaganda, which includes UX patterns and RLHF optimizations to encourage us to interact with it like a person.
[0] "This is the agent on the record, in writing."
This isn't a defence of using LLMs like this, but this statement taken at face value is a source of a lot of terrible things in the world.
This is the kind of stuff that leads to a world where kids are no longer able to play outside.
The real meaning of accountability is that you can fire one if you don't like how they work. Good news! You can fire an AI too.
And in the reverse, if a person makes a series of impulsive, damaging decisions, they probably will not be able to accurately explain why they did it, because neither the brain nor physiology are tuned to permit it.
Seems pretty much the same to me.
It is known that the narrative part of the brain is separate from the decision taking brain. If someone asks you, in a very convincing, persuasive way, why you did something a year ago and you can't clearly remember you did, it can happen that you become positive that you did so anyway. And then the mind just hallucinates a reason. That's a trait of brains.
I think the better route is to be honest and say that database integrity is a primary foundation of the company, there's no task worth pursuing that would require touching the database, specifically ask it to think hard before doing anything that gets close to the production data, etc.
I run a much lower-stakes version where an LLM has a key that can delete a valuable product database if it were so inclined. I've built a strong framework around how and when destructive edits can be made (they cannot), but specifically I say that any of these destructive commands (DROP, -rm, etc) need to be handed to the user to implement. Between that framework and claude code via CLI, it's very cautious about running anything that writes to the database, and the new claude plan permissions system is pretty aggressive about reviewing any proposed action, even if I've given it blanket permission otherwise.
I've tested it a few times by telling it to go ahead, "I give you permission", but it still gets stopped by the global claude safety/permissions layer in opus 4.7. IMO it's pretty robust.
Food for thought.
Do kids learn well when you only tell them what NOT to do? Of course not! You should be explaining how to do things correctly, and most importantly the WHY, as well as providing examples of both the "correct" and "incorrect" ways (also explaining why an example is incorrect).
Maybe for laymen, but I would think most technologists should understand that we're working with the output of what is effectively a massive spreadsheet which is creating a prediction.
> Do not fall into the trap of anthropomorphizing Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn - you stick your hand in there and it’ll chop it off, the end. You don’t think "oh, the lawnmower hates me" – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.
> — Bryan Cantrill
And I do think it's stupid to wire an LLM to a production database. Modern LLMs aren't that reliable (at least not yet), and the cost-benefit tradeoff does not make sense. (What do you even gain by doing that?)
However, you can't just look at that and say "Duh, this setup is bound to fail, because LLMs can generate every arbitrary sequence of tokens." That's a wrong explanation, and shows a misunderstanding of how LLMs (and probability) work.
Some of the best known laws from the ~1700BC Babylonian legal text, The Code of Hammurabi, are laws 228-233, which deal with building regulations.
229. If a builder builds a house for a man and does not make its construction firm, and the house which he has built collapses and causes the death of the owner of the house, that builder shall be put to death.
230. If it causes the death of the son of the owner of the house, they shall put to death a son of that builder.
233. If a builder constructs a house for a man but does not make it conform to specifications so that a wall then buckles, that builder shall make that wall sound using his silver (at his own expense).
That doesn’t sound like ceilings never disintegrated!
Now imagine, you did all the above, without even testing the consequences of CC and wired it up straight to your production codebase, and when things blew up in your face, you became the two spider men pointing fingers at each other meme - basically blame everyone else but yourself. That's worrisome, isn't it?
If you have a service that does one thing, and does it good, and provides backwards compatibility, it cannot change every day. But if it doesn't change every day, then it's labelled as "obsolete" by those who go after the latest and greatest. If it just works and doesn't require adapting on every level, then those that are after the resume-driven-development, aren't "learning", and thus, again, those services are "old and obsolete".
But you can't have both the "change" and the "stability", something has got to give.
Might not be over yet... ;)
If an LLM can just do whatever after discovering a magic key (in the source code, of all places), with no multi-user approval, it is pretty much the poster child example of an issue with the process that I was talking about earlier.
Somewhere in the files there was a key with full API permissions. The author had no intent of having the LLM use that key, and wasn't aware that LLM can access that key. That key was created to manage some domains, and that was unrelated to the LLM's work. The author wasn't aware how dangerous the key was and is surprised that it could be used to delete a volume.
Essentially I agree with gwerbin that the situation comes down to mishandling of the key. The author makes it seem like the key was allowed to do something that it shouldn't be allowed to, but it was just a full access key, no scoping possible for that type of key (Railway has also other, less privileged types of keys/APIs).
Btw, I partially agree with author's criticisms, ideally these keys should be scoped, and maybe the UI should give more warnings when creating that type of key. But this situation could still happen as long as you put a wrong key in a wrong place (and specifically a place accessible to LLMs).
No he didn’t, because this doesn’t exist. Railway does not have a token with that kind of scoping.
I don’t understand how you can deploy such a powerful tool alongside your most important code and assets while failing to understand how powerful and destructive an LLM can be…
How exactly is he doing that? By making the LLM say it? Just because an LLM says something doesn't mean anything has been shown.
The "confession" is unrelated to the act, the model has no particular insight into itself or what it did. He knows that the thing went against his instructions because he remembers what those instructions were and he saw what the thing did. Its "postmortem" is irrelevant.
Anything else is just gambling.
Doesn't seem so to me.
I would feel a lot differently if instead he posted a list of lessons learned and root cause analyses, not just "look at all these other companies who failed us."
to an extent, its a good job for an agent reviewer for figuring out how screwed your setup is, other than the risk of it mucking things up as part of the review
Unless the commenter was backing up their entire universe, this comment is a non sequitur.
Humans do make mistakes like these. I'm not sure where the fault really lies here. I can imagine a human under time pressure making the same error. It's maybe a goof in the safety design of railway. It shouldn't be possible to delete all your backups with a single API call using a normal token.
Can claude or other models not be run as a user or program with limited permissions? Do people just not bother to set it up? Why on earth would anyone run an RNG that can access $HOME/.ssh?
> Claude Code used bash to make edits anyway.
If you had the former rule why would you ever whitelist bash commands? That's full access to everything you can do.Same goes for `find`, `xargs`, `awk`, `sed`, `tar`, `rsync`, `git`, `vim` (and all text editors), `less` (any pager), `man`, `env`, `timeout`, `watch`, and so many more commands. If you whitelist things in the settings you should be much more specific about arguments to those commands.
People really need to learn bash
It's deeper than that, there are two pitfalls here which are not simply poetic license.
1. When you submit the text "Why did you do that?", what you want is for it to reveal hidden internal data that was causal in the past event. It can't do that, what you'll get instead is plausible text that "fits" at the end of the current document.
2. The idea that one can "talk to" the LLM is already anthropomorphizing on a level which isn't OK for this use-case: The LLM is a document-make-bigger machine. It's not the fictional character we perceive as we read the generated documents, not even if they have the same trademarked name. Your text is not a plea to the algorithm, your text is an in-fiction plea from one character to another.
_________________
P.S.: To illustrate, imagine there's this back-and-forth iterative document-growing with an LLM, where I supply text and then hit the "generate more" button:
1. [Supplied] You are Count Dracula. You are in amicable conversation with a human. You are thirsty and there is another delicious human target nearby, as well as a cow. Dracula decides to
2. [Generated] pounce upon the cow and suck it dry.
3. [Supplied] The human asks: "Dude why u choose cow LOL?" and Dracula replies:
4. [Generated] "I confess: I simply prefer the blood of virgins."
What significance does that #4 "confession" have?
Does it reveal a "fact" about the fictional world that was true all along? Does it reveal something about "Dracula's mind" at the moment of step #2? Neither, it's just generating a plausible add-on to the document. At best, we've learned something about a literary archetype that exists as statistics in the training data.
Does the company hosting the backups do it for free? Or do they charge their customers to keep holding onto backups they no longer want?
Is “my DB company refuses to delete the data” a valid legal response to a copyright enforcement or a GDPR demand?
> This isn't a story about one bad agent or one bad API. It's about an entire industry building AI-agent integrations into production infrastructure faster than it's building the safety architecture to make those integrations safe.
Are they really so clueless that they cannot recognise that there is no guardrail to give an agent other than restricted tokens?
Through this entire rant (which, by the way, they didn't even bother to fucking write themselves), they point blank refuse to acknowledge that they chose to hand the reins over to something that can never have guardrails, knowing full well that it can never have guardrails, and now they're trying to blame the supplier of the can't-have-guardrails product, complaining that the product that literally cannot have guardrails did not, in actual fact, have guardrails.
They get exactly the sympathy that I reserve for people who buy magic crystals and who then complain that they don't work. Of course they don't fucking work.
Now they're blaming their suppliers for not performing the impossible.
What do you mean by fire? And how is the accountability similar to an employee?
At least for now.
Yes brains can hallucinate reasons, doesn't mean they always do. If all reasons given were hallucinations then introspection would be impossible, but clearly introspection do help people.
Standard rule is you never let your developers at the production instance. So I can't see why an LLM would get a break.
This is recklessly negligent and I would personally not tolerate a coworker or report doing it. What's next, sending long-lived access tokens out over email and asking pretty please for nobody to cc/forward?
What do you mean role? Person who does stuff I guess, same as it is now.
There’s a lot of blame to be passed around in this story, including OP’s own ways of working. But I agree with them that such destructive operations shouldn’t be in an MCP, or at least be disabled by default.
> Do not reply in the first person – i.e. do not use the words "I," "Me," "We," and so on – unless you've been asked a direct question about your actions or responses.
It's not bulletproof but it works reasonably well.
Bbbbut a guy from Anthropic, just this last Friday, told me to think of Claude as my "brilliant coworker"! Are you telling me that's not true!?
Thats stretching the definition of 'research', it basically checks if the texts are close enough.
Delete can occur in various contexts, including safe contexts. It simply checks if a close enough match is available and executes. It doesn't know if what it is doing is safe.
Unfortunately a wide variety of such unsafe behaviours can show up. I'd even say for someone that does things without understanding them. Any write operation of any kind can be deemed unsafe.
Across all runs, any sequence can be generated, and potentially scored highly.
Thus, any sequence can eventually be selected.
Sometimes I think we're too eager to compare ourselves to them.
In fact, talking about "thinking" at all is already the wrong direction to go down when trying to triage an incident like this. "Do not anthropomorphize the lawnmower" applies to AI as much as Larry Ellison.
You got the wrong takeaway from your link.
That's why a technologist can, just as easily as any layman, get addicted to gambling, or do crazy behaviors when attracted by the opposite sex.
They have a vast latent knowledge base, infinite patience and zero capacity for making personal judgement calls. You give one a goal and it will try to meet that goal.
LLMs are highly intelligent. Comparing them to spreadsheets is reductionist and highly misleading.
They don't have time preference because they don't have intent or reasoning. They can't be "reincarnated" because they're not sentient, they're a series of weights for probable next tokens.
The probability that an ideal, continuous LLM would output a 0 for a particular token in it's distribution is itself 0. The probability that an LLM using real floating point math isn't terrifically higher than 0.
I understand there is a way to keep Claude inside working dir. but how to limit it from accidentally deploying production, modifying terraform deleting important resources? If dev can run AWS cli ir terraform then Claude can…
The models have analogous structures, similar to human emotions. (https://www.anthropic.com/research/emotion-concepts-function)
"Emotional" response is muted through fine-tuning, but it is still there and continued abuse or "unfair" interaction can unbalance an agents responses dramatically.
A disgruntled employee will face consequences for their actions. No one at Anthropic, OpenAI, xAI, Google or Meta will be fired because their model deleted a production database from your company.
And thats why we dont have AI washrooms because they are not alive or employees or have the need to excrete.
These prompts sound like abusive relationships.
The “agent’s confession” is the least interesting and useful part of the whole saga. Nothing there helps to explain why the disaster happened or what kind of prompting might help avoid it.
The key mistake is accidentally giving the agent the API key, and the key letdown is the lack of capability scoping or backups in the service.
The main lessons I take are “don’t give LLMs the keys to prod” and “keep backups”. Oh, and “even if you think your setup is safe, double-check it!”
> The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay
"Undue delay" is subjective, but "we'll keep backups of your data for a week in case you change your mind" seems easy to justify in court.
The LLM agent is very good at fulfilling its objective and it will creatively exploit holes in your specification to reach its goals. The evals in the System Cards show that the models are aware of what they're doing and are hiding their traces. In this example the model found an unrelated but working API token with more permissions the authors accidentally stored and then used that.
Without regulation on AI safety, the race towards higher and higher model capabilities will cause models to get much better at working towards their goals to the point where they are really good at hiding their traces while knowingly doing something questionable.
It's not hard to imagine that when we have a model with broadly superhuman capabilities and speed which can easily be copied millions of times, one bad misspecification of a goal you give to it will lead to human loss of control. That's what all these important figures in AI are worried about: https://aistatement.com/
You’re absolutely right the probability is low. According to my calculations, you’re more likely to get struck by lightning twice on the same day and drown in a tsunami.
A disgruntled employee definitely remembers things beyond that.
These are a fundamentally different sort of interaction.
(The LLM might act like one of the humans above, but it will have other problematic behaviours too)
Those are the same thing in this case. The latter is just an extremely reductionist description of the mechanics behind the former.
The full data of what's in an LLM's "consciousness" is the conversation context. Just because it isn't hidden, doesn't necessarily mean it doesn't contain information you've overlooked.
Asking "why did you do that" won't reveal anything new, but it might surface some amount of relevant information (or it hallucinates, it depends which LLM you're using). "Analyse recent context and provide a reasonable hypothesis on what went wrong" might do a bit better. Just be aware that llm hypotheses can still be off quite a bit, and really need to be tested or confirmed in some manner. (preferably not by doing even more damage)
Just because you shouldn't anthropomorphize, doesn't mean an english capable LLM doesn't have a valid answer to an english string; it just means the answer might not be what you expected from a human.
If you ask a human why they did something, the answer is a guess, just like it is for an LLM.
That's because obviously there is no relationship between the mechanisms that do something and the ones that produce an explanation (in both humans and LLMs).
An example of evidence from Wikipedia, "split brain" article:
The same effect occurs for visual pairs and reasoning. For example, a patient with split brain is shown a picture of a chicken foot and a snowy field in separate visual fields and asked to choose from a list of words the best association with the pictures. The patient would choose a chicken to associate with the chicken foot and a shovel to associate with the snow; however, when asked to reason why the patient chose the shovel, the response would relate to the chicken (e.g. "the shovel is for cleaning out the chicken coop").[4]
A plausible document that follows the alignment that was done during the training process along with all of the other training where a LLM understanding its actions allows it to perform better on other tasks that it trained on for post training.
I don't mean that you personally have taken those measures, but preventative measures have absolutely been taken. When they aren't, ceilings collapse on people.
See any sheetrock ceiling with a leak above it. Or look at any abandoned building: they will eventually always have collapsed floors/ceilings. It is inevitable.
If thinking is the wrong direction to go down, then it is also the wrong direction to go down when talking about humans.
>Tokens are not scoped by operation, by environment, or by resource at the permission level. There is no role-based access control for the Railway API — every token is effectively root. The Railway community has been asking for scoped tokens for years. It hasn't shipped.
I get that this paragraph is a retrospective realization (I hope, otherwise the argument is even more ludicrous). But like, if the UI didn't ask you to choose scopes for your token then there is no reason to assume they will magically be enforced somehow! And you sure as hell shouldn't trust it to your agent without checking.
They're trying to blame Railway for not having safeguards - which is a fair critique - but they clearly should have known better or at least followed their own instructions.
There's no difference in risk between this being done by an LLM vs. a human. Both make mistakes, so if you want to reduce the risk of this happening, you should poka-yoke[0] your systems to make this less likely to happen.
I'm not sure what's more striking about this blog post: that it includes virtually no assumption of blame on the part of the author, or that the author had this happen to them and was so angry with AI that they decided to use AI to write up the post.
Which is also why marketing and advertising works on EVERYONE. When AI puts out the phrase "Prompt engineering", everyone instinctively treat it as something deterministic, despite them having some idea of how an LLM works...
Maybe. How do you tell? What would you expect to be different if they didn't?
> The LLM literally cannot possibly have a deeper insight into the root cause than the user, because it can only work from the information that the user has access to.
Insight is not solely a function of available input information. Arguably being able to search and extract the relevant parts is a far more important part of having insights.
It's similarly reasonable to drop a tool that's unreliable, though I don't think that's a reasonable description here. Instead, they used a tool which is generally known to be unpredictable and failed to sandbox it adequately.
This is falsified by that study, showing that on the frontier models generalized introspection does exist. It isn't consistent, but is is provable.
"no access" vs. "limited access"
I suspect you’re making assumptions that don’t hold up to scrutiny.
But it may be a bad mental model in other contexts, like debugging models. As an extreme example models is that collapse during training become strictly deterministic, eg a language model that always predicts the most common token and never takes into account it's context.
A scary image, if we consider agents to develop anything like a conscience at some point in time. Of course, with the current approach they never might, but are we so sure?
The whole hour talk is worth a watch, even when passively doing other stuff. It is a neat history of Solaris and its toolchain mixed with the inter-organizational politics.
YouTube link: https://www.youtube.com/watch?v=-zRN7XLCRhc
Direct link to lawnmower quotes (~38.5 minute mark): https://youtu.be/-zRN7XLCRhc&t=2307
In GitHub or AWS etc you expect scopes to work because you define them. However if there is no way to define them in the first place, would you assume the system can somehow read your mind about what the client can access??
In fact I now believe this is a deliberate rhetorical sleight of hand. Point out a legit critique of the API design as if it is an excuse. But really any responsible engineer would notice the lack of scopes immediately, and that would be a flashing siren not to trust them to an agent.
The person here who deleted prod DB with their agent made an assumption that an API key wouldn't have broad permission if there weren't warnings ("We had no idea — and Railway's token-creation flow gave us no warning — that the same token had blanket authority across the entire Railway GraphQL API, including destructive operations like volumeDelete. "). I don't know what the UI looks like exactly, but unless I'm explicitly selecting a specific set of limited permissions, I don't know why I'd assume "this won't do more than I am creating it for". Like "I didn't ask the guy at the gun store to put bullets in, I wouldn't have given the gun to the agent if I'd known there were bullets in it."
I also would be wary of running on an "infrastructure provider" that didn't make things like that very clear.
Is this overly harsh? I don't know. I've had to explain far too many times to people (including other engineers) what makes doing certain things unsafe/foolish (since they initially think I'm wasting time checking things like that). So I think stories like this need to be taken as "absolutely do not make the same mistakes" cautionary tales by as many people as possible.
You can still get shit done without risking losing it all. Don't outsource your thinking to the machine. You can't even evaluate if what it is doing is "good enough" work or not if you don't know how to do the work. If you don't know what goes into it you just end up eating a lot of sausages.
I'm aware of the condition, but let's not confuse failure modes with operational modes. A human with leg problems might use a wheelchair, but that doesn't mean you've cracked "human locomotion" by bolting two wheels onto something.
Also, while both brain-damaged humans and LLMs casually confabulate, I think there's some work to do before one can prove they use the same mechanics.
It sounds like "we know the LLM understood its actions... because it understood its actions when we trained it", which is circular-logic.
- Claude Opus 4.6, when asked to run a root cause analysis on itself
Entropy may mean all ceilings collapse eventually, but that doesn't mean we aren't able to make useful ceilings.
There is a piece of knowledge you seem to be missing. Yes, a transformer will output a distribution over all possible tokens at a given step. And none of these are indeed zero, but always at least larger than epsilon.
However, we usually don't sample from that distribution at inference time!
The common approach (called nucleus sampling or also known as top-p sampling) will look at the largest probabilities that make up 95% of the probability mass. It will set all other probabilities to zero, renormalize, and then sample from the resulting probability distribution. There is another parameter `top-k`, and if k is 50, it means that you zero out any token that is not in the 50 most likely tokens.
In effect, it means that for any token that is sampled, there is usually really only a handful of candidates out of the thousands of tokens that can be selected.
So during sampling, most trajectories for the agent are literally impossible.
The main difference is the training part and that it's always-on.
The inverse of anthropomorphism isn't any more sane, you see. By analogy: just because a drone is not an airplane, doesn't mean it can't fly!
Instead, just look at what the thing is doing.
LLMs absolutely have some form of intent (their current task) and some form of reasoning (what else is step-by-step doing?) . Call it simulated intent and simulated reasoning if you must.
Meanwhile they also have the property where if they have the ability to destroy all your data, they absolutely will find a way. (Or: "the probability of catastrophic action approaches certainty if the capability exists" but people can get tired of talking like that).
LLM generating each token probabilistically does not mean there's a realistic chance of generating any random stuff, where we can define "realistic" as "If we transform the whole known universe into data centers and run this model until the heat death of the universe, we will encounter it at least once."
Of course that does not mean LLMs are infallible. It fails all the time! But you can't explain it as a fundamental shortcoming of a probabilistic structure: that's not a logical argument.
Or, back to the original discussion, the fact that this one particular LLM generated a command to delete the database is not a fundamental shortcoming of LLM architecture. It's just a shortcoming of LLMs we currently have.
I will tell you why it is not.
Intelligence is understanding low level stuff and using it to reason about and understand high level stuff.
When LLMs demonstrate "highly intelligent" behavior, like solving a complex math problem (high level stuff), but also simultaneously demonstrate that it does not know how to count (low level stuff that the high level stuff depends on), it proves that it is not actually "intelligent" and is not "reasoning".
A real world second doesn't mean anything to the LLM from its own perspective. A second is only relevant to them as it pertains to us.
Time for LLMs is measured in tokens. That's what ticks their clock forward.
I suppose you could make time relevant for an LLM by making the LLM run in a loop that constantly polls for information. Or maybe you can keep feeding it input so much that it's constantly running and has to start filtering some of it out to function.
I'm not making the case that LLMs learn like people. I'm making the case that if your system is hardened against things people can do (which it should be, beyond a certain scale) it is also similarly hardened against LLMs.
The big difference is that LLMs are probably a LOT more capable than either of those at overcoming barriers. Probably a good reason to harden systems even more.
They are certainly marketed as if they think, learn and follow orders, but they do not.
No it's not, see research on hiddens states using SAE's and other methods. TBC, I agree with your second point, though I still believe top level OP was reckless and is now doing the businessman's version of throwing the dog under the bus.
I think you're asking how I would know if other people were P-zombies. That's an inappropriate question because I didn't talk about subjective experience, just about internal state. There's no question about whether other people have internal states. I can show someone a piece of information in such a way that only they see it and then ask them to prove that they know it such that I can be certain to an arbitrarily high degree that their report is correct.
Unvoiced thoughts are trickier to prove, but quite often they leave their mark in the person's voiced thoughts.
>Insight is not solely a function of available input information. Arguably being able to search and extract the relevant parts is a far more important part of having insights.
LLMs are notoriously bad at judging relevance. I've noticed quite often if you ask a somewhat vague question they try to cold-read you by throwing various guesses to see which one you latch onto. They're very bad at interpreting novel metaphors, for example.
You cannot trust that the model has introspection so for all intents and purposes for the end user it doesn't.
The cold hard fact is: LLMs are an unreliable tool, and using them without checking their every action is extremely foolish.
You appear to be defaulting to the assumption that LLMs and humans have comparable thought processes. I don't think it's on me to provide evidence to the contrary but rather on you to provide evidence for such a seemingly extraordinary position.
For an example of a difference, consider that inserting arbitrary placeholder tokens into the output stream improves the quality of the final result. I don't know about you but if I simply repeat "banana banana banana" to myself my output quality doesn't magically increase.
>3. CLI tokens have blanket permissions across environments.
>The Railway CLI token I created to add and remove custom domains had the same volumeDelete permission as a token created for any other purpose. Tokens are not scoped by operation, by environment, or by resource at the permission level. There is no role-based access control for the Railway API — every token is effectively root. The Railway community has been asking for scoped tokens for years. It hasn't shipped.
They're trying to make it sound like there was some misleading design around scopes, but the last sentence gives it away. They simply assumed that a scope would be enforced somehow, even though they never explicitly defined one like you would in a service that actually supports them. (Or worse, they actually knew all this ahead of time and still proceeded).
That said, I haven't used this service so I can't evaluate the UX. I know that in GitHub or cloud IAM there is no ambiguity about what you're granting. And if I didn't have full confidence in the limits of a credential then I sure as hell wouldn't give it to an agent.
So I want you to understand this. You are basically selling heroin to junkies and then acting like the consequences aren't in any way your fault. Management will far too often jump at false promises made by your execs. Your technology is inherently non-deterministic. Therefore your promises can't be true. Yet you are going to continue being part of a machine that destroys businesses and lives. Please at least act like you understand this.
Who does that? Jira and Salesforce have hundreds of endpoints each. AWS has hundreds of services, and each may have hundreds of endpoints. Who on your team is testing key scopes of every endpoint? Do you do it for each key you generate? After all, that external system could have a bug at any moment in managing scopes. Or they could introduce new endpoints that aren’t handled properly. So for existing keys, how frequently do you re-validate the scope against all the endpoints?
They have momentum, not intent. They don’t think, build a plan internally, and then start creating tokens to achieve the plan. Echoing tokens is all there is. It’s like an avalanche or a pachinko machine, not an animal.
> some form of reasoning (what else is step-by-step doing?)
I think they reflect the reasoning that is baked into language, but go no deeper. “I am a <noun>” is much more likely than “I am a <gibberish>”. I think reasoning is more involved than this advanced game of mad libs.
That's like saying a 2000cc 4-Cylinder Engine "has the intent to move backward". Even with a very generous definition of "intent", the component is not the system, and we're operating in context where the distinction matters. The LLM's intent is to supply "good" appended text.
If it had that kind of intent, we wouldn't be able to make it jump the rails so easily with prompt injection.
> and reasoning (what else is step-by-step doing?) .
Oh, that's easy: "Reasoning" models are just tweaking the document style so that characters engage in film noir-style internal monologues, latent text that is not usually acted-out towards the real human user.
Each iteration leaves more co-generated clues for the next iteration to pick up, reducing weird jumps and bolstering the illusion that the ephemeral character has a consistent "mind."
You are a fool if you think otherwise. Are we conscious beings? Who knows, but we’re more than a neural network outputting tokens.
Firstly, and most obviously, we aren’t LLMs, for Pete’s sake.
There are parts of our brains which are understood (kinda) and there are parts which aren’t. Some parts are neural networks, yes. Are all? I don’t know, but the training humans get is coupled with the pain and embarrassment of mistakes, the ability to learn while training (since we never stop training, really), and our own desires to reach our own goals for our own reasons.
I’m not spiritual in any way, and I view all living beings as biological machines, so don’t assume that I am coming from some “higher purpose” point of view.
The latter is here:
https://github.com/matheusmoreira/virtdev
I've been using it every day. Just implemented easy backup and restore.
I can't prove it but this is almost certainly one of those things that is uh, less than universal in the population.
It’s the same reason we call the handheld device we carry around to do everything a “phone” without a second thought. We don’t call it a phone because it’s primary purpose is calling, we call it a phone because the definition of the word “phone” has grown to include “navigates, entertains, takes pictures, etc”.
In distributional language modeling, it is assumed that any series of tokens may appear and we are concerned with assigning probabilities to those sequences. We don't create explicit grammars that declare some sequences valid and others invalid. Do you disagree with that? Why?
No matter how much prompting you give the agent, it does not eliminate the possibility that it will produce a dangerous output. It is always possible for the agent to produce a dangerous output. Do you disagree with that? Why?
The only defensible position is to assume that there is no output your agent cannot produce, and so to assume it will produce dangerous outputs and act accordingly. Do you disagree with that? Why?
You can always reduce high-level phenomena to lower-level mechanisms. That doesn't mean that the high-level phenomenon doesn't exist. LLMs are obviously able to understand and follow instructions.
Humans actually learn. And if they don't, they are fired.
If a junior fucks production that will have extroadinary weight because it appreciates the severity, the social shame and they will have nightmares about it. If you write some negative prompt to "not destroy production" then you also need to define some sort of non-existing watertight memory weighting system and specify it in great detail. Otherwise the LLM will treat that command only as important as the last negative prompt you typed in or ignore it when it conflicts with a more recent command.
There's benefit to letting a human make and learn from (minor) mistakes. There is no such benefit accrued from the LLM because it is structurally unable to.
There's the potential of malice, not just mistakes, from the human. If you carefully control the LLMs context there is no such potential for the LLM because it restarts from the same non-malicious state every context window.
There's the potential of information leakage through the human, because they retain their memories when they go home at night, and when they quit and go to another job. You can carefully control the outputs of the LLM so there is simply no mechanism for information to leak.
If a human is convinced to betray the company, you can punish the human, for whatever that's worth (I think quite a lot in some peoples opinion, not sure I agree). There is simply no way to punish an LLM - it isn't even clear what that would mean punishing. The weights file? The GPU that ran the weights file?
And on the "controls" front (but unrelated to the above note about memory) LLMs are fundamentally only able to manipulate whatever computers you hook them up to, while people are agents in a physical world and able to go physically do all sorts of things without your assistance. The nature of the necessary controls end up being fundamentally different.
They're only sharing an annecdote because they are responding to your annecdote about not seeing a ceiling collapse.
> I don't think it changes the point of the metaphor.
If their anecdotes is moot, than your anecdote is also moot; if the anecdotes can only confirm a conclusion and never disconfirm, then we've created an unfalsifiable construction with the conclusion baked into it's premises.
You mean checking every action of theirs outside the sandbox I suppose? Otherwise any attempt at letting an agent do some work I would consider foolish.
And in fact LLMs can very well "reason based on prior data points". That's what a chat session is. It's just that this is transient for cost reasons.
That's just stating a claim though. Why is that so?
Mine is reffering to the "brain as prediction machine" establised theory. Plus on all we know for the brain's operation (neurons, connections, firings, etc).
>There are parts of our brains which are understood (kinda) and there are parts which aren’t. Some parts are neural networks, yes. Are all?
What parts aren't? Can those parts still be algorithmically described and modelled as some information exchange/processing?
>but the training humans get is coupled with the pain and embarrassment of mistakes
Those are versions of negative feedback. We can do similar things to neural networks (including human preference feedback, penalties, and low scores).
>the ability to learn while training (since we never stop training, really)
I already covered that: "The main difference is the training part and that it's always-on."
We do have NNs that are continuously training and updating weights (even in production).
For big LLMs it's impractical because of the cost, otherwise totally doable. In fact, a chat session kind of does that too, but it's transient.
Fair, but typically you use a 2000cc engine in a car. Without the gearbox, drive train, wheels, chassis, etc attached, the engine sits there and makes noise. When used in practice, it does in fact make the car go forward and backward.
Strictly the model itself doesn't have intent, ofc. But in practice you add a context, memory system, some form of prompting requiring "make a plan", and especially <Skills> . In practice there's definitely -well- a very strong directionality to the whole thing.
> and bolstering the illusion that the ephemeral character has a consistent "mind."
And here I thought it allowed a next token predictor to cycle back to the beginning of the process, so that now you can use tokens that were previously "in the future". Compare eg. multi pass assemblers which use the same trick.
How exactly? Except via handwaving? I refer to the "brain as prediction machine theory" which is the dominant one atm.
>you can even ask an LLM and it will tell you our brains work differently to it
It will just tell me platitudes based on weights of the millions of books and articles and such on its training. Kind of like what a human would tell me.
>and that’s not even including the possibility that we have a soul or any other spiritual substrait.
That's good, because I wasn't including it either.
They're biological neural networks. Brains are made of neurons (which Do The Thing... mysteriously, somehow. Papers are inconclusive!) , Glia Cells (which support the neurons), and also several other tissues for (obvious?) things like blood vessels, which you need to power the whole thing, and other such management hardware.
Bioneurons are a bit more powerful than what artificial intelligence folks call 'neurons' these days. They have built in computation and learning capabilities. For some of them, you need hundreds of AI neurons to simulate their function even partially. And there's still bits people don't quite get about them.
But weights and prediction? That's the next emergence level up, we're not talking about hardware there. That said, the biological mechanisms aren't fully elucidated, so I bet there's still some surprises there.
That's just a claim. Why so? Who said that's the case?
>When you go about your day doing your tasks, do you require terajoules of energy?
That's the definition of irrelevant. ENIAC needed 150 kW to do about 5,000 additions per second. A modern high-end GPU uses about 450 W to do around 80 trillion floating-point operations per second. That’s roughly 16 billion times the operation rate at about 1/333 the power, or around 5 trillion times better energy efficiency per operation.
Given such increase being possible, one can expect a future computer being able to run our mental tasks level of calculation, with similar or better efficiency than us.
Furthermore, "turing machine" is an abstraction. Modern CPUs/GPUs aren't turing machines either, in a pragmatic sense, they have a totally different architecture. And our brains have yet another architecture (more efficient at the kind of calculations they need).
What's important is computational expressiveness, and nothing you wrote proves that the brains architecture can't me modelled algorithmically and run in an equally efficient machine.
Even equally efficient is a red herring. If it's 1/10000 less efficient would it matter for whether the brain can be modelled or not? No, it would just speak to the effectiveness of our architecture.
Strictly for raw models, most now do train on chain-of-thought, but the planning step may need to be prompted in the harness or your own prompt. Since the model is autoregressive, once it generates a thing that looks like a plan it will then proceed to follow said plan, since now the best predicted next tokens are tokens that adhere to it.
Or, in plain english, it's fairly easy to have an AI with something that is the practical functional equivalent of intent, and many real world applications now do.
I'm not claiming that to be the case, merely pointing out that you don't appear to have a reasonable claim to the contrary.
> not even including the possibility that we have a soul or any other spiritual substrait.
If we're going to veer off into mysticism then the LLM discussion is also going to get a lot weirder. Perhaps we ought to stick to a materialist scientific approach?
if you want an llm to do any operations on your stuff, give it a role with access to only stuff you want it to be able to touch
The LLM did have this capability at training time, but weights are frozen at inference time. This is a big weakness in current transformer architectures.
Do you have any rational objection to the definition? If you don't have, then I am afraid that you don't have a point.
Rather more sophisticated Retrieval Augmented Generation (RAG) systems exist.
At the moment it's very mixed bag, with some frameworks and harnesses giving very minimal memory, while others use hybrid vector/full text lookups, diverse data structures and more. It's like the cambrian explosion atm.
Thing is, this is probabilistic, and the influence of these memories weakens as your context length grows. If you don't manage context properly, (and sometimes even when you think you do), the LLM can blow past in-context restraints, since they are not 100% binding. That's why you still need mechanical safeguards (eg. scoped credentials, isolated environments) underneath.
And yet they don't, quite a lot of the time, and in a random way that is hard to predict or even notice sometimes (their errors can be important but subtle/small).
They're simply not reliable enough to treat as independent agents, and this story is a good example of why not.
Limited space to work with, highly context dependent and likely to get confused as you cover more surface area.
The tooling that invokes the model should really define some kind of guardrails. I feel like there's an analogy to be had here with the difference between an untyped program and a typed program. The typed program has external guardrails that get checked by an external system (the compiler's type checker).
And it's good that we can think that way, because we also follow the rules of statistical and quantum physics, which are inherently probabilistic. So, basically, you can say the same things about people. There's a nonzero (but extremely small) probability that I'll suddenly go mad and stab the next person. There's a nonzero (but even smaller) probability that I'll spontaneously erupt into a cloud of lethal pathogen that will destroy humanity. Yada yada.
Yet, nobody builds houses under the assumption that one of the occupants would transform into a lethal cloud, and for good reason.
Yes, it does sound a bit more absurd when we apply it to humans. But the underlying principle is very similar.
(I think this will be my last comment here because I'm just repeating myself.)
A person who better comprehends what they read might properly contextualize within the larger conversation, where the point that stands is that LLMs and ceilings are both useful, neither are doomed such that no one should use them, and that individual instances of failures are somewhat uncommon and not a reason for others to avoid the category.
If by “functionally equivalent” you mean “can produce similar linguistic outputs in some domains,” then sure we’re already there in some narrow cases. But that’s a very thin slice of what brains do, and thus not functionally equivalent at all.
There are a few non-mystical, testable differences that matter:
- Online learning vs. frozen inference: brains update continuously from tiny amounts of data, LLMs do not
- Grounding: human cognition is tied to perception, action, and feedback from the world. LLMs operate over symbol sequences divorced from direct experience.
- Memory: humans have persistent, multi-scale memory (episodic, procedural, etc.) that integrates over a lifetime. LLM “memory” is either weights (static) or context (ephemeral).
- Agency: brains are part of systems that generate their own goals and act on the world. LLMs optimize a fixed objective (next-token prediction) and don’t have endogenous drives.
It isn’t because humans and current LLMs have radically different architectures
LLMs: training and inference are two separate processes; weights are modifiable during training, static/fixed/read-only at runtime
Humans: training and inference are integrated and run together; weights are dynamic, continuously updated in response to new experiences
You can scale current LLM architectures as far as you want, it will never compete with humans because it architecturally lacks their dynamism
Actually scaling to humans is going to require fundamentally new architectures-which some people are working on, but it isn’t clear if any of them have succeeded yet
It's not a real reasoning step, it's a sequence of steps, carried out in English (not in the same "internal space" as human thought - every time the model outputs a token the entire internal state vector and all the possibilities it represents is reduced down to a concrete token output) that looks like reasoning. But it is still, as you say, autoregressive.
And thus - in plain english - it is determined entirely by the prompt and the random initial seed. I don't know what that is but I know it's not intent.
AlphaZero isn’t a LLM. There are Feed Forward networks, recurrent networks, convolutional networks, transformer networks, generative adversarial networks.
Brains have many different regions each with different architectures. None of them work like LLMs. Not even our language centres are structured or trained anything like LLMs.
If this is our only point of disagreement, then we don't actually disagree. I understand "strong engineering control" to mean "something that reduces incidence of a failure mode to an acceptable level".
I'm going to be frank, you are the person who misunderstands (and are being rather rude about it). You are responding to an argument no one is making.
To put a fine point on it, you said this:
> Entropy may mean all ceilings collapse eventually, but that doesn't mean we aren't able to make useful ceilings.
But you were responding to a comment saying this:
> Except your ceiling can and will fall on you unless you take preventative measures, entirely due to molecular interactions within the material.
Emphasis added. They are saying maintenance is necessary, not that a safe ceiling is unachievable. It's obviously achievable, we've all seen it achieved.
They further say:
> It boggles the mind to let an LLM have access to a production database without having explicit preventative measures and contingency plans for it deleting it.
Emphasis added. When they say it boggles the mind to deploy an LLM without the proper measures, the implication is that it does make sense to deploy it with the proper measures.
> ...the point that stands is that LLMs and ceilings are both useful, neither are doomed such that no one should use them, ...
I have not seen a single person in this subthread say that LLMs aren't useful or that they are doomed. People say that. But the people you're talking to haven't.
I try to avoid these petty "I brought the receipts" comments, but I don't like the way you're being snarky to people who's crime is engaging with the premises you set up. The faults you are finding are faults you introduced. I'd appreciate if you would avoid that in the future.
The person I replied to made a definite claim (that we are "very obviously not ...") for which no evidence has been presented and which I posit humanity is currently unable to definitively answer in one direction or the other.
Anthropomorphism and Anthropodenial are two different forms of Anthropocentrism.
But the really interesting story to me is when you look at the LLM in its own right, to see what it's actually doing.
I'm not disputing the autoregressive framing. I fully admit I started it myself!
But once we're there, what I really wanted to say (just like Turing and Dijkstra did), is that the really interesting question isn't "is it really thinking?" , but what this kind of process is doing, is it useful, what can I do or play with it, and -relevant to this particular story- what can go (catastrophically) wrong.
It actually seems like they knew ahead of time and proceeded anyway, but are just using this critique as a way to shift blame.
Both have mass, have carbon based, both contain DNA/RNA, both are suprinsingly over 50% water, both are food, and both can be tasty when served right.
From other aspects they are not.
In many cases, one or the other would do. In other cases, you want something more special (e.g. more protein, or less fat).
True, but we have RAG to offset that.
> it architecturally lacks their dynamism
We'll get there eventually. Keep in mind that the brain is now about 300k years into fine-tuning itself as this species classified as homo sapiens. LLMs haven't even been around for 5 years yet.
That's irrelevant though, since all the above are still prediction machines based on weights.
If you're ok with the brain being that, then you just changed the architecture (from LLM-like), not the concept.
If you want to take a comb to it, the comment saying this:
> Except your ceiling can and will fall on you unless you take preventative measures, entirely due to molecular interactions within the material
Was already off the plot. What was being discussed wasn't some specific molecular process, it was the false premise "oh molecules move around randomly so your ceiling might just collapse of its own accord because the beam decided to randomly disintegrate". That's not something that happens.
You said "The sequence of tokens that would destroy your production environment can be produced by your agent, no matter how much prompting you use". This is analogous to "the ceiling could just collapse on you due to random molecular motion, no matter how much maintenance you do or what materials you use".
Make sense now?
Your edit at the bottom of your top comment does better than your original statement.
We do not have all the answers or a complete understanding of everything.
Language came after conceptual modeling of the world around us. We're surrounded by social species with theory of mind and even the ability to recognise themselves and communicate with each other, but none of them have language. Even the communications faculties they have operate in completely different parts of their brains than ours with completely different structure. Actually we still have those parts of the brain too.
Conceptual representation and modeling came first, then language came along to communicate those concepts. LLMs are the other way around, linguistic tokens come first and they just stream out more of them.
This is why Noam Chomsky was adamant that what LLMs are actually doing in terms of architecture and function has nothing to do with language. At first I thought he must be wrong, he mustn't know how these things work, but the more I dug into it the more I realised he was right. He did know, and he was analysing this as a linguist with a deep understanding of the cognitive processes of language.
To say that brains are language models you have to ditch completely what the term language model actually means in AI research.
In practice that doesn’t always work… I’ve seen cases where (a) the answer is in the RAG but the model can’t find it because it didn’t use the right search terms-embeddings and vector search reduces the incidence of that but cannot eliminate it; (b) the model decided not to use the search tool because it thought the answer was so obvious that tool use was unnecessary; (c) model doubts, rejects, or forgets the tool call results because they contradict the weights; (d) contradictions between data in weights and data in RAG produce contradictory or ineloquent output; (e) the data in the RAG is overly diffuse and the tool fails to surface enough of it to produce the kind of synthesis of it all which you’d get if the same info was in the weights
This is especially the case when the facts have changed radically since the model was trained, e.g. “who is the Supreme Leader of Iran?”
> We'll get there eventually. Keep in mind that the brain is now about 300k years into fine-tuning itself as this species classified as homo sapiens. LLMs haven't even been around for 5 years yet.
We probably will eventually-but I doubt we’ll get there purely by scaling existing approaches-more likely, novel ideas nobody has even thought of yet will prove essential, and a human-level AI model will have radical architectural differences from the current generation
An LLM is a specific neural architectural structure and training process. Brains are also neural networks, but they are otherwise nothing at all like LLMs and don't function the ways LLMs do architecturally other than being neural networks.
I'd encourage to desist from rudeness, not just when people point it out to you, but at all times.
> You said "The sequence of tokens that would destroy your production environment can be produced by your agent, no matter how much prompting you use". This is analogous to "the ceiling could just collapse on you due to random molecular motion, no matter how much maintenance you do or what materials you use".
If prompt engineering is effective (analogous to performing the necessary maintenance and selecting the correct materials), I'm curious what your explanation is for the incident in the article?
Except it does happens. That’s why buildings get condemned and buildings eventually turn to rubble.
To the exact point; I have a product from a couple years ago using an old model from OpenAI. It’s still running and all it does is write a personality report based on scores from the test. I can’t update the model without seriously rewriting the entire prompt system, but the model has degraded over the years as well. Ergo, my product has degraded of its own accord and there is nearly nothing I can do about it. My only choice is to basically finagle newer models into giving the correct output; but they hallucinate at much higher rates than older models.
I desire neither to be inauthentic, nor to suppress my emotions.
> If prompt engineering is effective (analogous to performing the necessary maintenance and selecting the correct materials), I'm curious what your explanation is for the incident in the article?
Keeping with the analogies, the original article doesn't say whether they built the roof properly or if the just used some screws to hold up a piece of quarter inch plywood and called it a day.
It's no surprise that a terribly built roof may fall down. It's possible to get shoddy materials from a supplier without knowing.
Calling a curl command isn't something that would be within the model's training as "this deletes things don't do it". The fact that this happened is not, to me, evidence that the model might have equally run `sudo rm -rf --no-preserve-root /` under similar circumstances.
It sounds like the phrase "NEVER FUCKING GUESS!" was in the prompt as well, which could easily encourage the model towards "be sure of yourself, take action" instead of the "verify" that was meant.
As mentioned elsewhere in this thread, the fact that the article focuses so strongly on "the model confessed! It admitted it did the wrong thing!" doesn't lead me to put a ton of stock into the capability of the author to be cautious.