Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
Emphasis on slowly.
laughed when it slowly began to type that out
EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
This exists[0], but the chip in question is physically large and won't fit on a phone.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
Itβs only paying Google $1 billion a year for access to Gemini for Siri
If they continue to increase.
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that arenβt needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.
With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
Itβs been a lot of years, but all I can hear after reading that is β¦ Iβm making a note here, huge success
The joke revolves around the incongruity of "42" being precisely correct.
Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
Appleβs bet is intelligent, the βpresumed winnersβ are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.
https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...
The $$$ would probably make my eyes bleed tho.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
It is objectively slow at around 100X slower than what most people consider usable.
The quality is also degraded severely to get that speed.
> but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You always could, if you didn't care about speed or efficiency.
iPhone 17 Pro outperforms AMDβs Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...
The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.
If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.
Also I wouldnβt trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).
If you decided to give it a go make sure to use the MLX over the GGUF version! Youβll get a bit more speed out of it.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
Nobody actually quantizes every layer to Q4 in a Q4 quant.
This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.
(One) source: https://www.reddit.com/r/Fedora/comments/1mjudsm/comment/n7d...
When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.
Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)
Practical LLMs on mobile devices are at least a few years away.
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
Why do you say they can't do this?
Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.
It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.
https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP β Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.
"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.
On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.
Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.
Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.
That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.
Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.
The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.
There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.