iPhone 17 Pro Demonstrated Running a 400B LLM

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

It's like the sloth from Zootopia

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.

I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.

This reminds me of how excited people were to get models running locally when llama.c first hit.

It’s 400B but it’s mixture of experts so how many are active at any time?

I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.

For small values of "running".

Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)

Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.

I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?

Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

I read this title as: "iPhone 17 Pro demonstrated being an overpriced phone".

This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains

Impressive. Running a 400B model on-device, even at low throughput, is pretty wild.

It will be funny if we go back to lugging around brick-size batteries with us everywhere!

CPU, memory, storage, time tradeoffs rediscovered by AI model developers. There is something new here, add GPU to the trade space.

I can't understand why this is a surprise to anyone. An iphone is still a computer, of course it can run any model that fits in storage albiet very slowly. The implementation is impressive I guess but I don't see how this is a novel capability. And for 0.6t/s, its not a cost efficient hardware for doing it. The iphone can also render pixar movies if you let it run long enough, mine bitcoin with a pathetic hashrate, and do weather simulations but not in time for the forecast to be relevant.

I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.

Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.

The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)

Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.

So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.

As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.

To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.

But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.

Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.

A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.

"400 bytes should be enough for anybody"

The power draw is going to be crazy (today).

Practical LLMs on mobile devices are at least a few years away.

Apple might just win the AI race without even running in it. It's all about the distribution.

It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.

The heat problem is going to be the real constraint here. I've been running smaller models locally for some internal tooling at work and even those make my MacBook sound like a jet engine after twenty minutes. A 400B model on a phone seems like a great way to turn your pocket into a hand warmer, even with MoE routing. The unified memory is clever but physics still applies.

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)

This is not entirely dissimilar to what Cerebus does with their weights streaming.

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

Yeah, this new post is a continuation of that work.

This is not entirely dissimilar to what Cerebus does with their weights streaming.

And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?

It’s 400B but it’s mixture of experts so how many are active at any time?

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

Aren't most companies doing MoE at this point?

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

Yeah, this new post is a continuation of that work.

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

Aren't most companies doing MoE at this point?

Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

Better than waiting 7.5 million years to have a tell you the answer is 42.

I don't think we are ever going to win this. The general population loves being glazed way too much.

I thought you were being sarcastic until I watched the video and saw those words slowly appear.

Emphasis on slowly.

I too thought you were joking

laughed when it slowly began to type that out

I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.

So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.

2 years ago, LLMs failed at answering coherently. Last year, they failed at answering fast on optimized servers. Now, they're failing at answering fast on underpowered handheld devices... I can't wait to see what they'll be failing to do next year.

This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains

Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.

It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.

A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.

Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.

Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.

The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.

Is 100 t/s the stadard for models?

A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.

This isn't a hardware feat, this is a software triumph.

They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).

The software has real software engineers working on it instead of researchers.

Remember when people were arguing about whether to use mmap? What a ridiculous argument.

At some point someone will figure out how to tile the weights and the memory requirements will drop again.

It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.

It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.

/FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.

I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.

Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?

Added to toptext. Thanks!

Apple might just win the AI race without even running in it. It's all about the distribution.

Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.

Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.

It’s only paying Google $1 billion a year for access to Gemini for Siri

Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)

> moving forward, as the information density and architectural efficiency of smaller models continue to increase

If they continue to increase.

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.

I thought you were being sarcastic until I watched the video and saw those words slowly appear.

Emphasis on slowly.

I too thought you were joking

laughed when it slowly began to type that out

It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.

Is 100 t/s the stadard for models?

/FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.

Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?

Added to toptext. Thanks!

Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)

And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

Better than waiting 7.5 million years to have a tell you the answer is 42.

Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.

Which makes it even funnier.

It makes me a little sad that Douglas Adams didn't live to see it.

Should have used a better platform. So long and thanks for all the fish.

Yes and then no one knows the prompt!

Maybe you should have asked a better question. :P

Some one should let Douglas Adams know the calculation could have been so much faster if the machine just lied.

I don't think we are ever going to win this. The general population loves being glazed way too much.

> The general population loves being glazed way too much.

This is 100% correct!

The other day, I got:

"You are absolutely right to be confused"

That was the closest AI has been to calling me "dumb meatbag".

That's an astute point, and you're right to point it out.

You’re absolutely right!

Poor “we”. “They” love looking at their own reflection too much.

I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.

So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.

You need fast storage to make it worthwhile. PCIe x4 5.0 is a reasonable minimum. Or multiple PCIe x4 4.0 accessed in parallel, but this is challenging since the individual expert-layers are usually small. Intel Optane drives are worth experimenting with for the latter (they are stuck on PCIe 4.0) purely for their good random-read properties (quite aside from their wearout resistance, which opens up use for KV-cache and even activations).

The speed on a constrained device isn't entirely the point. Two years ago, LLMs failed at answering coherently. Now...

You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.

LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.

Probably the one elephant in the roomy thing that matters: failing to say they don't know/can't answer

Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

This exists[0], but the chip in question is physically large and won't fit on a phone.

[0] https://www.anuragk.com/blog/posts/Taalas.html

I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.

That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.

I think for many reasons this will become the dominant paradigm for end user devices.

Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.

Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

> Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.

The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.

The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.

This isn't a hardware feat, this is a software triumph.

They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).

It's both.

We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.

The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).

>triumph

It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success

The software has real software engineers working on it instead of researchers.

Remember when people were arguing about whether to use mmap? What a ridiculous argument.

At some point someone will figure out how to tile the weights and the memory requirements will drop again.

The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.

It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.

It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.

If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.

Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.

Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.

whoa, save some disbelief for later, don't show it all at once.

It’s only paying Google $1 billion a year for access to Gemini for Siri

Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.

Plus all those pricey 512GB Mac Studios they are selling to YouTubers.

> moving forward, as the information density and architectural efficiency of smaller models continue to increase

If they continue to increase.

They will. Either new architectures will come out that give us greater efficiency, or we will hit a point where the main thing we can do is shove more training time onto these weights to get more per byte. Similar thing is already happening organically when it comes to efficient token use; see for instance https://github.com/qlabs-eng/slowrun.

The "if" is fair. But when scaling hits diminishing returns, the field is forced to look at architectures with better capacity-per-parameter tradeoffs. It's happened before, maybe it'll happen again now.

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.

The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.

If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.

I hope some company trains their models so that expert switches are less often necessary just for these use cases.

I’m so confused in these comments right now — I thought you had to load an entire MoE model and sparseness just made it so you can traverse the model more quickly.