Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering

I worked on the Xcode team for years and know the lengths Apple goes to make this stuff difficult to figure out.

I just wanted to say that you’ve done an excellent job and am looking forward to the 3rd installment.

Part 2 has benchmarks: https://maderix.substack.com/p/inside-the-m4-apple-neural-en...

6.6 FLOPS/W, plus the ability to completely turn off when not in use, so 0W at idle.

It's insane that the source code of ANE is not available even to the MLX team, possibly one of the reasons Awni (MLX project head) left Apple.

The future is bright for software engineers.

The big takeaway isn't reverse engineering the ANE per se, but what Manjeet could do with his software engineering skills when accelerated by AI.

This is a good example of the present state of software engineering. Not future state - present state.

The recent news is that Apple is supposedly replacing the Core ML framework with an updated version that will make it easier to integrate third party LLMs into your apps.

> the company is also planning a few other software-based AI upgrades, including a new framework called Core AI. The idea is to replace the long-existing Core ML with something a bit more modern.

https://www.bloomberg.com/news/newsletters/2026-03-01/apple-...

This article was clearly written by a human (and AI) but still has a few "LLMisms" such as:

- The key insight - [CoreML] doesn't XXX. It YYY.

With that being said, this is a highly informative article that I enjoyed thoroughly! :)

The article links to their own Github repo: https://github.com/maderix/ANE

Reverse Engineering with AI is only going to get better. I have seen some crazy things friends of mine have done with Claude alone. Let's just says SaaS isn't the only industry that could one day suffer.

> Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively

Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.

I never realized just how much hardware engineering Apple dedicated to enabling people to type faster with their thumbs!

Tangential: Is anyone doing something similar to accelerate the support matrix of Linux on anything higher than M2?

I have always wondered if the neural engine could be used for training - pretty excited for part 3 of this to see if the juice is actually worth the squeeze

I remember the good old days when Apple was desperate for developers and produced great documentation and there were a lot of great 3rd-party books too. You can't just give out awards in hopes that someone will make that great app.

Unreadable Claude slop

Genuine question, not trying to throw a shade or anything, but are those cores actually useful with the state of apple intelligence being what it is?

I never realized just how much hardware engineering Apple dedicated to enabling people to type faster with their thumbs!

The future is bright for software engineers.

The big takeaway isn't reverse engineering the ANE per se, but what Manjeet could do with his software engineering skills when accelerated by AI.

This is a good example of the present state of software engineering. Not future state - present state.

The recent news is that Apple is supposedly replacing the Core ML framework with an updated version that will make it easier to integrate third party LLMs into your apps.

> the company is also planning a few other software-based AI upgrades, including a new framework called Core AI. The idea is to replace the long-existing Core ML with something a bit more modern.

https://www.bloomberg.com/news/newsletters/2026-03-01/apple-...

I worked on the Xcode team for years and know the lengths Apple goes to make this stuff difficult to figure out.

I just wanted to say that you’ve done an excellent job and am looking forward to the 3rd installment.

>I worked on the Xcode team for years

Why did you guys remove the ability to detach the console and move it to another window?

This article was clearly written by a human (and AI) but still has a few "LLMisms" such as:

- The key insight - [CoreML] doesn't XXX. It YYY.

With that being said, this is a highly informative article that I enjoyed thoroughly! :)

The article links to their own Github repo: https://github.com/maderix/ANE

Tangential: Is anyone doing something similar to accelerate the support matrix of Linux on anything higher than M2?

Unreadable Claude slop

We've got about a year before so many people are interacting with LLMs on a daily basis that its style starts to reverse infect human speech and writing

Also the Prior Art section, which has telltale repetition of useless verbs like "documenting," "providing insight into," and "confirming" on each line. This was definitely AI-written, at least in part.

I have always wondered if the neural engine could be used for training - pretty excited for part 3 of this to see if the juice is actually worth the squeeze

In principle most if not all inference hardware should be usable for training.

Efficiency is the question.

It's insane that the source code of ANE is not available even to the MLX team, possibly one of the reasons Awni (MLX project head) left Apple.

Yeah, the Inside Macintosh guides were epic.

Genuine question, not trying to throw a shade or anything, but are those cores actually useful with the state of apple intelligence being what it is?

They are also used by ML models that are deeply integrated in macos and ios without you knowing. Like object and text detection in images.

If you strip away the branding, Apple has and continues to ship a ton of algorithms that likely use the ANE and end users can use CoreML to do the same.

Just some things that people will likely take for granted that IIRC Apple have said use the ANE or at least would likely benefit from it: object recognition, subject extraction from images and video, content analysis, ARKit, spam detection, audio transcription.

Apple's OSes run a lot of local ML models for many tasks that aren't branded as Apple Intelligence, and they have done so for many years now.

You can convert your own ML models to MLX to use them; Apple Intelligence is not the only application.

Part 2 has benchmarks: https://maderix.substack.com/p/inside-the-m4-apple-neural-en...

6.6 FLOPS/W, plus the ability to completely turn off when not in use, so 0W at idle.

But not 38 TOPS that Apple claims, with the weak explanation of

> Apple’s “38 TOPS INT8” is computed as 19 TFLOPS FP16 × 2, following the industry convention of counting INT8 operations as 2× the FP16 rate. But the hardware doesn’t actually execute INT8 operations twice as fast.

Why would Apple follow that convention when the hardware explicitly doesn't seems like a more straight-faced lie that even Apple usually does.

Humans also write endless amounts of convincing bullshit, and have done since time immemorial. False papers and faked results have been a growing scourge in academia before LLMs were a thing, and that's just counting the intentional fraud - the reproducibility crisis in science, especially medical and psychological science, affects even the best designed and well intentioned of studies.

Humans also make mistakes and assumptions while reverse engineering, so it will always need more engineers to go through the results, test things

Claude likes to hide bad benchmarks from you, so it will show you where you are clearly winning. You even see some weird benchmarks in the article.

>I worked on the Xcode team for years

Why did you guys remove the ability to detach the console and move it to another window?

In principle most if not all inference hardware should be usable for training.

Efficiency is the question.

Yeah, the Inside Macintosh guides were epic.

They are also used by ML models that are deeply integrated in macos and ios without you knowing. Like object and text detection in images.

And help in Photos, Final Cut Pro, and other apps.

I wish they would (or wouldn't if they are) hook it up to the ios keyboard.

If you strip away the branding, Apple has and continues to ship a ton of algorithms that likely use the ANE and end users can use CoreML to do the same.

Don’t forget FaceID and many of the image manipulation.

And while everyone else went to more powerful giant LLMs, Apple moved most of Siri from the cloud to your device. Though they do use both (which you can see when Siri corrects itself during transcription—you get the local Siri version corrected later by the cloud version).

Apple's OSes run a lot of local ML models for many tasks that aren't branded as Apple Intelligence, and they have done so for many years now.

Humans also make mistakes and assumptions while reverse engineering, so it will always need more engineers to go through the results, test things

This is a nice article. Thanks for sharing.

We've got about a year before so many people are interacting with LLMs on a daily basis that its style starts to reverse infect human speech and writing

It's already happened to me. I've started to have dreams where instead of some sort of interpersonal struggle the entire dream is just a chatbot UI viewport and I'm arguing with an LLM streaming the responses in. Which is super trippy when I become aware its a dream. In the old days I'd dream about playing chess against myself and lose which was quite bizzare feeling because my brain was running both players. But thats totally normal compared to having my brain pretend to be an LLM inside a dream.

Great insight – Would you like to try and identify some specific "AI-isms" that you've noticed creeping into your own writing or your colleagues' emails lately?

This said, there were people that talked like this before LLMs, it didn't develop this whole cloth.

My honest take? You're probably right

You can convert your own ML models to MLX to use them; Apple Intelligence is not the only application.

But not 38 TOPS that Apple claims, with the weak explanation of

Why would Apple follow that convention when the hardware explicitly doesn't seems like a more straight-faced lie that even Apple usually does.

Claude likes to hide bad benchmarks from you, so it will show you where you are clearly winning. You even see some weird benchmarks in the article.

MLX does not run on NPUs AFAIK; just gpu and cpu. You have to use CoreML to officially run code on the neural engine.

Below are the items from that section. How should they be written to not look like an AI?

> hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.

> mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.

> eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.

> apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.

I wish they would (or wouldn't if they are) hook it up to the ios keyboard.

And help in Photos, Final Cut Pro, and other apps.

Don’t forget FaceID and many of the image manipulation.

Great insight – Would you like to try and identify some specific "AI-isms" that you've noticed creeping into your own writing or your colleagues' emails lately?

This is a nice article. Thanks for sharing.

Below are the items from that section. How should they be written to not look like an AI?

> mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.

> eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.

> apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.

My honest take? You're probably right

You are absolutely right.

Here is why you are correct:

- I see what you did there.

- You are always right.

This said, there were people that talked like this before LLMs, it didn't develop this whole cloth.

MLX does not run on NPUs AFAIK; just gpu and cpu. You have to use CoreML to officially run code on the neural engine.

The article above doesn't read well, at all.

It's not my subject, but it reads as a list of things. There's little exposition.

Exactly. LLM's are mimics.

People seem to be going around pointing out that people talk like parrots, when in reality it's parrots talk like people.

Even then there is no transparency on how it decides what runs on the ANE/GPU etc

You are absolutely right.

Here is why you are correct:

- I see what you did there.

- You are always right.

The article above doesn't read well, at all.

It's not my subject, but it reads as a list of things. There's little exposition.

Gawd Damn LISTICLES!!!! And all of those articles that list in bullet points at the top of the article the summary of the article. And all of those people saying they don't want to read exposition, just give me the bullet points.

Even then there is no transparency on how it decides what runs on the ANE/GPU etc

Correct. OS level stuff get first priority, so you can’t count on using it.

Exactly. LLM's are mimics.

People seem to be going around pointing out that people talk like parrots, when in reality it's parrots talk like people.

I mean, it's both.

Did you develop your own whole language at any point to describe the entire world? No, you, me, and society mimic what is around us.

Humans have the advantage, at least at this point, of being a continuous learning device so we adapt and change with the language use around us.

Correct. OS level stuff get first priority, so you can’t count on using it.

Turns out third party actually gets priority for ANE

I mean, it's both.

Did you develop your own whole language at any point to describe the entire world? No, you, me, and society mimic what is around us.

Humans have the advantage, at least at this point, of being a continuous learning device so we adapt and change with the language use around us.

Turns out third party actually gets priority for ANE

[

](https://substackcdn.com/image/fetch/$s_!fV3j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e6d0ce8-21e5-43eb-9499-a49b2f6f44be_1024x487.jpeg)

A note on “we”:

Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively — human intuition driving the exploration, AI reasoning through the data and writing the analysis. We think this kind of human–AI collaboration is a new and natural way to do systems research: one partner as the architect with intuition, the other as the engineer writing the code and crafting experiments .

This whole thing started with a simple question: can you train a model on Apple’s Neural Engine?

Apple doesn’t want you to know the answer. They don’t publish the ANE’s ISA. They don’t document its internal architecture. They don’t even give you a way to program it directly — everything goes through CoreML, which adds layers of abstraction, optimization passes, and overhead that make it nearly impossible to understand what the hardware is actually doing.

So we reverse-engineered it.

Over several days, we mapped the entire software stack from CoreML down to the IOKit kernel driver, discovered how to compile and execute programs on the ANE without CoreML, cracked the binary format, measured the true peak performance (spoiler: Apple’s “38 TOPS” number is misleading), and ultimately got a neural network training on a chip designed exclusively for inference.

This is Part 1 of a three-part series. Here we cover the reverse engineering — how we peeled back the layers to understand what the M4 Neural Engine actually is and how to talk to it directly.

The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine — a fixed-function accelerator that takes a compiled neural network graph and executes the entire thing as one atomic operation. You don’t issue individual multiply-accumulate instructions. You submit a compiled program describing an entire computation graph, and the hardware executes it end-to-end.

Apple introduced the Neural Engine in the A11 (2017) as a 2-core design. Each generation has scaled it up:

[

](https://substackcdn.com/image/fetch/$s_!ZV53!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbec77ad-32ae-47a5-8108-d33761640f58_707x212.png)

The M4’s ANE (codename H16G) is what we’re working with. 16 cores, a queue depth of 127 evaluation requests, independent DVFS (dynamic voltage/frequency scaling), and hard power gating that drops it to exactly 0 milliwatts when idle.

We weren’t the first to poke at ANE internals.

hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.
mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.
eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.
apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.

But to our knowledge, nobody had previously: (a) achieved direct _ANEClient API access without CoreML on M4, (b) cracked the in-memory MIL compilation path, (c) measured true peak throughput bypassing CoreML overhead, or (d) trained a model on ANE.

Our approach combined several techniques:

Class discovery via dyld_info -objc on AppleNeuralEngine.framework — this dumps every Objective-C class and method
Method swizzling to intercept CoreML’s calls to the private ANE frameworks
Binary analysis of compiled E5 bundles to understand the neural program format
Scaling analysis — varying matrix sizes, graph depths, and channel counts to infer hardware topology

We discovered 40+ private classes in AppleNeuralEngine.framework, including _ANEClient, _ANEModel, _ANERequest, _ANEIOSurfaceObject, _ANEInMemoryModel, and many more.

Here’s what the full ANE software stack looks like, from the public CoreML API down to hardware:

[

](https://substackcdn.com/image/fetch/$s_!Rx9Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7eab77-e06f-41eb-8e3c-7e6799a22824_605x493.png)

The key insight: CoreML is not the only way in. The _ANEClient class in AppleNeuralEngine.framework provides direct access to the compile → load → evaluate pipeline. CoreML is just a convenience layer on top.

Here’s the complete sequence to compile and run a program on ANE without CoreML:

The I/O uses IOSurfaces — the same shared memory mechanism used for GPU textures. This means zero-copy transfers between GPU and ANE are theoretically possible if you share the same IOSurfaceRef.

Key finding: The ANE supports a queue depth of 127 — you can have up to 127 evaluation requests in-flight simultaneously. This is far deeper than most accelerator queues and suggests the hardware is designed for high-throughput streaming inference.

CoreML doesn’t send neural networks to ANE in ONNX or protobuf format. It uses MIL — Machine Learning Intermediate Language — a typed SSA (Static Single Assignment) representation that looks like this:

MIL is surprisingly readable. Every value is typed with both precision and shape. Operations are named and take keyword arguments. The function signature declares input tensors with explicit dimensions.

The tensor layout follows ANE’s native NCDHW + Interleave format: [Batch, Channels, Depth, Height, Width]. For a 1024×1024 matrix, this becomes [1, 1024, 1, 1024] in 4D.

When ANECompiler processes a MIL program, it produces an E5 binary — a FlatBuffer-structured file with these sections:

[

](https://substackcdn.com/image/fetch/$s_!9aBm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56d0824b-6b4a-4eb7-876e-3e256dfeebb8_727x317.png)

Here’s the fascinating part: a 1024×1024 matmul compiles to 2,688 bytes. A 128×128 matmul compiles to 2,680 bytes. Nearly identical. The E5 binary isn’t encoding the matrix multiplication algorithm — it’s encoding a parameterized program whose behavior is controlled by tensor descriptors at runtime. The “microcode” is more like a configuration than traditional machine code.

Implication: The ANE hardware likely has a small set of fixed compute primitives (convolution, matrix multiply, elementwise) that are parameterized by tensor shape descriptors. The E5 binary describes which primitives to chain and how to connect them, not the compute itself.

The file-based compilation path works but has a problem: it requires writing MIL text to disk, creating a directory structure, and pointing the compiler at it. For training — where we need to recompile with updated weights every few steps — this filesystem round-trip is unacceptable.

We discovered _ANEInMemoryModelDescriptor, which accepts MIL text directly in memory:

Getting this to work required solving three gotchas that cost us days of debugging:

NSData, not NSString: The milText parameter wants an NSData* containing UTF-8 bytes, not an NSString*. Passing a string fails silently.
NSDictionary, not NSData: The weights parameter is a dictionary mapping weight names to NSData blobs, not a single data buffer.
Temp directory workaround: Even the “in-memory” path internally writes to a temp directory. If you don’t have write access to the default location, compilation fails with an opaque error. We had to ensure a writable temp path was available.

And one delightful discovery: Apple’s internal code references a Desctiptor (sic) in one of the class names. Even Apple engineers make typos in private APIs. :)

Through IOKit probing, scaling analysis, and power measurement, we’ve built this profile of the M4 ANE:

[

](https://substackcdn.com/image/fetch/$s_!hM2M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f07d7fd-c5f2-4696-a86b-7b4b129b1950_720x579.png)

IOKit’s IOReportLegend reveals the ANE has its own independent power management with adaptive clocking, dithering, and multiple hardware/software triggers:

This level of DVFS sophistication suggests the ANE can independently scale its frequency and voltage based on workload characteristics, separate from the CPU and GPU power domains.

From ANECompiler.framework exports, the ANE natively supports:

[

](https://substackcdn.com/image/fetch/$s_!fdRo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8864de-4c70-453a-8d2b-d59f3ec9b9fd_663x86.png)

Notably, Conv appears to be the ANE’s primary compute primitive. As we’ll show in Part 2, expressing matmul as 1×1 convolution unlocks significantly higher throughput.

All data transfer to and from the ANE uses IOSurfaces. The protocol is straightforward:

Since IOSurfaces are the same mechanism used for GPU texture sharing, this opens up the possibility of zero-copy GPU↔ANE pipelines where both accelerators operate on the same memory.

The ANE compiler caches E5 binaries on disk to avoid recompilation:

First compile takes ~20-40ms. Cache hits are effectively free. This matters for inference (compile once, run forever) but creates challenges for training, where weights change every step.

Several discovered classes remain unexplored and hint at capabilities we haven’t tested:

_ANEChainingRequest — may enable chaining multiple compiled models in a single dispatch
_ANESharedEvents / _ANESharedSignalEvent / _ANESharedWaitEvent — Metal-style fence/signal primitives for GPU↔ANE synchronization
_ANEPerformanceStats — possibly hardware performance counters
_ANEVirtualClient — virtualized ANE access, potentially for multi-process sharing

And some things we genuinely don’t know:

The exact ANE core microarchitecture and ISA
How cores are assigned to operations within a graph
The ANE clock frequency (DVFS makes this dynamic)
Whether hardware perf counters are accessible
The exact SRAM topology (banked? unified? per-core?)

Now that we have direct access to the ANE, we can actually measure what it can do. In Part 2, we’ll benchmark everything: matmul scaling, the SRAM performance cliff, why convolution is 3× faster than matmul, why Apple’s “38 TOPS” claim is misleading, and how bypassing CoreML gives you 2-4× more throughput.

In Part 3, we’ll do the thing Apple says you can’t: train a neural network on the Neural Engine.

All code is available at github.com/maderix/ANE in the ane/ directory. Tested on M4 Mac Mini, macOS 15.x.

No posts

Hacker Times

Hacker Times

Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering

Discussion

Discussion