The big takeaway isn't reverse engineering the ANE per se, but what Manjeet could do with his software engineering skills when accelerated by AI.
This is a good example of the present state of software engineering. Not future state - present state.
> the company is also planning a few other software-based AI upgrades, including a new framework called Core AI. The idea is to replace the long-existing Core ML with something a bit more modern.
https://www.bloomberg.com/news/newsletters/2026-03-01/apple-...
I just wanted to say that you’ve done an excellent job and am looking forward to the 3rd installment.
- The key insight - [CoreML] doesn't XXX. It YYY.
With that being said, this is a highly informative article that I enjoyed thoroughly! :)
The article links to their own Github repo: https://github.com/maderix/ANE
6.6 FLOPS/W, plus the ability to completely turn off when not in use, so 0W at idle.
Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.
Why did you guys remove the ability to detach the console and move it to another window?
Efficiency is the question.
Just some things that people will likely take for granted that IIRC Apple have said use the ANE or at least would likely benefit from it: object recognition, subject extraction from images and video, content analysis, ARKit, spam detection, audio transcription.
Humans also make mistakes and assumptions while reverse engineering, so it will always need more engineers to go through the results, test things
> Apple’s “38 TOPS INT8” is computed as 19 TFLOPS FP16 × 2, following the industry convention of counting INT8 operations as 2× the FP16 rate. But the hardware doesn’t actually execute INT8 operations twice as fast.
Why would Apple follow that convention when the hardware explicitly doesn't seems like a more straight-faced lie that even Apple usually does.
And while everyone else went to more powerful giant LLMs, Apple moved most of Siri from the cloud to your device. Though they do use both (which you can see when Siri corrects itself during transcription—you get the local Siri version corrected later by the cloud version).
> hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.
> mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.
> eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.
> apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.
Here is why you are correct:
- I see what you did there.
- You are always right.
It's not my subject, but it reads as a list of things. There's little exposition.
People seem to be going around pointing out that people talk like parrots, when in reality it's parrots talk like people.
Did you develop your own whole language at any point to describe the entire world? No, you, me, and society mimic what is around us.
Humans have the advantage, at least at this point, of being a continuous learning device so we adapt and change with the language use around us.
[

A note on “we”:
Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively — human intuition driving the exploration, AI reasoning through the data and writing the analysis. We think this kind of human–AI collaboration is a new and natural way to do systems research: one partner as the architect with intuition, the other as the engineer writing the code and crafting experiments .
This whole thing started with a simple question: can you train a model on Apple’s Neural Engine?
Apple doesn’t want you to know the answer. They don’t publish the ANE’s ISA. They don’t document its internal architecture. They don’t even give you a way to program it directly — everything goes through CoreML, which adds layers of abstraction, optimization passes, and overhead that make it nearly impossible to understand what the hardware is actually doing.
So we reverse-engineered it.
Over several days, we mapped the entire software stack from CoreML down to the IOKit kernel driver, discovered how to compile and execute programs on the ANE without CoreML, cracked the binary format, measured the true peak performance (spoiler: Apple’s “38 TOPS” number is misleading), and ultimately got a neural network training on a chip designed exclusively for inference.
This is Part 1 of a three-part series. Here we cover the reverse engineering — how we peeled back the layers to understand what the M4 Neural Engine actually is and how to talk to it directly.
The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine — a fixed-function accelerator that takes a compiled neural network graph and executes the entire thing as one atomic operation. You don’t issue individual multiply-accumulate instructions. You submit a compiled program describing an entire computation graph, and the hardware executes it end-to-end.
Apple introduced the Neural Engine in the A11 (2017) as a 2-core design. Each generation has scaled it up:
[

The M4’s ANE (codename H16G) is what we’re working with. 16 cores, a queue depth of 127 evaluation requests, independent DVFS (dynamic voltage/frequency scaling), and hard power gating that drops it to exactly 0 milliwatts when idle.
We weren’t the first to poke at ANE internals.
hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.
mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.
eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.
apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.
But to our knowledge, nobody had previously: (a) achieved direct _ANEClient API access without CoreML on M4, (b) cracked the in-memory MIL compilation path, (c) measured true peak throughput bypassing CoreML overhead, or (d) trained a model on ANE.
Our approach combined several techniques:
Class discovery via dyld_info -objc on AppleNeuralEngine.framework — this dumps every Objective-C class and method
Method swizzling to intercept CoreML’s calls to the private ANE frameworks
Binary analysis of compiled E5 bundles to understand the neural program format
Scaling analysis — varying matrix sizes, graph depths, and channel counts to infer hardware topology
We discovered 40+ private classes in AppleNeuralEngine.framework, including _ANEClient, _ANEModel, _ANERequest, _ANEIOSurfaceObject, _ANEInMemoryModel, and many more.
Here’s what the full ANE software stack looks like, from the public CoreML API down to hardware:
[

The key insight: CoreML is not the only way in. The _ANEClient class in AppleNeuralEngine.framework provides direct access to the compile → load → evaluate pipeline. CoreML is just a convenience layer on top.
Here’s the complete sequence to compile and run a program on ANE without CoreML:
The I/O uses IOSurfaces — the same shared memory mechanism used for GPU textures. This means zero-copy transfers between GPU and ANE are theoretically possible if you share the same IOSurfaceRef.
Key finding: The ANE supports a queue depth of 127 — you can have up to 127 evaluation requests in-flight simultaneously. This is far deeper than most accelerator queues and suggests the hardware is designed for high-throughput streaming inference.
CoreML doesn’t send neural networks to ANE in ONNX or protobuf format. It uses MIL — Machine Learning Intermediate Language — a typed SSA (Static Single Assignment) representation that looks like this:
MIL is surprisingly readable. Every value is typed with both precision and shape. Operations are named and take keyword arguments. The function signature declares input tensors with explicit dimensions.
The tensor layout follows ANE’s native NCDHW + Interleave format: [Batch, Channels, Depth, Height, Width]. For a 1024×1024 matrix, this becomes [1, 1024, 1, 1024] in 4D.
When ANECompiler processes a MIL program, it produces an E5 binary — a FlatBuffer-structured file with these sections:
[

Here’s the fascinating part: a 1024×1024 matmul compiles to 2,688 bytes. A 128×128 matmul compiles to 2,680 bytes. Nearly identical. The E5 binary isn’t encoding the matrix multiplication algorithm — it’s encoding a parameterized program whose behavior is controlled by tensor descriptors at runtime. The “microcode” is more like a configuration than traditional machine code.
Implication: The ANE hardware likely has a small set of fixed compute primitives (convolution, matrix multiply, elementwise) that are parameterized by tensor shape descriptors. The E5 binary describes which primitives to chain and how to connect them, not the compute itself.
The file-based compilation path works but has a problem: it requires writing MIL text to disk, creating a directory structure, and pointing the compiler at it. For training — where we need to recompile with updated weights every few steps — this filesystem round-trip is unacceptable.
We discovered _ANEInMemoryModelDescriptor, which accepts MIL text directly in memory:
Getting this to work required solving three gotchas that cost us days of debugging:
NSData, not NSString: The milText parameter wants an NSData* containing UTF-8 bytes, not an NSString*. Passing a string fails silently.
NSDictionary, not NSData: The weights parameter is a dictionary mapping weight names to NSData blobs, not a single data buffer.
Temp directory workaround: Even the “in-memory” path internally writes to a temp directory. If you don’t have write access to the default location, compilation fails with an opaque error. We had to ensure a writable temp path was available.
And one delightful discovery: Apple’s internal code references a Desctiptor (sic) in one of the class names. Even Apple engineers make typos in private APIs. :)
Through IOKit probing, scaling analysis, and power measurement, we’ve built this profile of the M4 ANE:
[

IOKit’s IOReportLegend reveals the ANE has its own independent power management with adaptive clocking, dithering, and multiple hardware/software triggers:
This level of DVFS sophistication suggests the ANE can independently scale its frequency and voltage based on workload characteristics, separate from the CPU and GPU power domains.
From ANECompiler.framework exports, the ANE natively supports:
[

Notably, Conv appears to be the ANE’s primary compute primitive. As we’ll show in Part 2, expressing matmul as 1×1 convolution unlocks significantly higher throughput.
All data transfer to and from the ANE uses IOSurfaces. The protocol is straightforward:
Since IOSurfaces are the same mechanism used for GPU texture sharing, this opens up the possibility of zero-copy GPU↔ANE pipelines where both accelerators operate on the same memory.
The ANE compiler caches E5 binaries on disk to avoid recompilation:
First compile takes ~20-40ms. Cache hits are effectively free. This matters for inference (compile once, run forever) but creates challenges for training, where weights change every step.
Several discovered classes remain unexplored and hint at capabilities we haven’t tested:
_ANEChainingRequest — may enable chaining multiple compiled models in a single dispatch
_ANESharedEvents / _ANESharedSignalEvent / _ANESharedWaitEvent — Metal-style fence/signal primitives for GPU↔ANE synchronization
_ANEPerformanceStats — possibly hardware performance counters
_ANEVirtualClient — virtualized ANE access, potentially for multi-process sharing
And some things we genuinely don’t know:
The exact ANE core microarchitecture and ISA
How cores are assigned to operations within a graph
The ANE clock frequency (DVFS makes this dynamic)
Whether hardware perf counters are accessible
The exact SRAM topology (banked? unified? per-core?)
Now that we have direct access to the ANE, we can actually measure what it can do. In Part 2, we’ll benchmark everything: matmul scaling, the SRAM performance cliff, why convolution is 3× faster than matmul, why Apple’s “38 TOPS” claim is misleading, and how bypassing CoreML gives you 2-4× more throughput.
In Part 3, we’ll do the thing Apple says you can’t: train a neural network on the Neural Engine.
All code is available at github.com/maderix/ANE in the
ane/directory. Tested on M4 Mac Mini, macOS 15.x.
No posts