It has some interesting conclusions, such as that it covers certain AVX512 gaps:
"AVX512 plugs many of the holes that SSE had, whilst SVE2 adds more complex operations (such as histogramming and bit permutation), and even introduces new ‘gaps’ (such as 32/64-bit element only COMPACT, no general vector byte left-shift, non-universal predication etc)."
And also that rusty x86 developers might face skill issues:
"Depending on your application, writing code for SVE2 can bring about new challenges. In particular, tailoring fixed-width problems and swizzling data around vectors may become much more difficult when the length is unknown."
I feel these days however, for any comparison of performance, power envelope needs to be included (I realise this is dependent on the final chip)
Only found this which talks about performance-per-area (PPA) and performance-per-clock ()I assume cycle) (PPC): https://www.reddit.com/r/hardware/comments/1gvo28c/latest_ar...
https://www.androidauthority.com/arm-c1-cpu-mali-g1-gpu-deep...
ARM is free to reorder stores and loads. This is called a weak memory model. So unless it's explicitly told to the compiler, like C++ memory_order::acquire and memory_order::release, you might get invalid behavior. Heisenbugs in the worst case.
That said, I've never actually run into one of these issues.
Better favor as much as possible RISC-V implementations.
But, I don't know if there are already good modern-desktop-grade RISC-V implementations (in the US, Sifive is moving fast as far as I know)... and the hard part: accessing the latest and greatest silicon process of TMSC, aka ~5GHz.
Those markets are completely saturated, namely at best, it will be very slow unless something big does happen: for instance AMD adapts its best micro-architecture to RISC-V (ISA decoding mostly), etc.
And if valve start to distribute a client with a strong RISC-V game compilation framework...
So far I've only found various M cores online. It would be fun to have something to experiment with on a cheapish FPGA like Kintex XC7-K480T, that may have enough resources for some in-order A core, and can be had for $50 or so.
While the parent article shows AMD Zen 5 having significantly better results in floating-point SPEC CPU2017, these benchmark results are still misleading, because in properly optimized for AVX-512 applications the difference between Zen 5 and Cortex-X925 would be much greater. I have no idea how SPEC has been compiled by the author of the article, but the floating-point results are not consistent with programs optimized for Zen 5.
One disadvantage of Cortex-X925 is having narrower vector instructions and registers, which requires more instructions for the same task and it is only partially compensated by the fact that Cortex-X925 can execute up to 6 128-bit instructions per clock cycle (vs. up to 4 vector instructions per clock cycle for Intel/AMD, but which are wider, 256-bit for Intel and up to 512-bit for Zen 5). This has been shown in the parent article.
The second disadvantage of Cortex-X925 is that it has an unbalanced microarchitecture for vector operations. For decades most CPUs with good vector performance had an equal throughput for fused multiply-add operations and for loads from the L1 cache memory. This is required to ensure that the execution units are fed all the time with operands in many applications.
However, Cortex-X925 can do at most 4 loads, while it can do 6 FMAs. Because of this lower load throughput Cortex-X925 can reach the maximum FMA throughput only much less frequently than the AMD or Intel CPUs. This is compounded by the fact that achieving better FMA to load ratios requires more storage space in the architectural vector registers, and Cortex-X925 is also disadvantaged for this, by having 4-time smaller vector registers than Zen 5.
The arithmetic intensity of most SPECfp subtests is quite low. You see this wall because it ends up reaching bandwidth limitations long before running out of compute on cores with beefy SIMD.
A few years ago they were writing articles about Apple Silicon.
But I did some comparisons when I tested the same Dell GB10 hardware late last year: https://www.jeffgeerling.com/blog/2025/dells-version-dgx-spa...
And Qualcomm.
Yes, Asahi exists, and props to the developers, but I don't think I'm alone in being unwilling to buy hardware from a manufacturer who obviously is not interested in supporting open operating systems
This is an industry blog, not a consumer oriented blog.
What most people want is interactivity and fast web pages which doesn't have much to do with wide vector instructions (except possibly for optimized video decoding).
First, I don't like making blind tradeoffs. If what I need (for whatever reason) is a really beefy ARM CPU, I'd like to know what the "Apple-less tax" costs me (if anything!)
Second, the status quo is that Apple Silicon is the undisputed king of ARM CPU performance, so it's the obvious benchmark to compare this thing against. Providing that context is just basic journalistic practice, even if just to say "but it's irrelevant because we can't use the hardware without the software".
I don't see how that's holding you back from using these tools for your work anymore than using a Makita power tool with LXT battery pack.
So they don’t actively help (or event make it easy by providing clear docs), but they do still do enough to enable really motivated people
Anyway, here it is in GB10 form-
https://browser.geekbench.com/v6/cpu/14078585
And here is a comparable M5 in a laptop-
https://browser.geekbench.com/macs/macbook-pro-14-inch-2025
M5 has about a 32% per core advantage, though the DGX obviously has a much richer power budget so they tossed in 10 high performance cores and 10 efficiency cores (versus the 4 performance and 6 efficiency in the latter). Given the 10/10 vs 4/6 core layouts I would expect the former to massively trounce the latter on multicore, while it only marginally does.
Samsung used the same X925 core in their Exynos 2500 that they use on a flip phone. Mediatek put it in a couple of their chips as well.
"Reaching desktop" is always such a weird criteria though. It's kind of a meaningless bar.
It took ARM decades to get to where it is, and that involved a long stint in low-margin niche applications like embedded or appliances where x86 was poorly suited due to head and power consumption.
*a = 1;
*b = 'p';
both the compiler and the CPU can freely pick the order in which those two happen (or even execute them in parallel, or do half of one first, then the other, then the other half of the first, but I think those are hypothetical cases)x86-64 will never do such a swap, but x86-64 compilers might.
If you write
*a = 1;
*b = 2;
, things might be different for the C compiler because a and b can alias. The hardware still is free to change that order, though.I am looking for a CPU.
I don't want to confront my users with "Please enter your Apple ID" or any other unexpected messages that I have no control over.
Is Apple M series an option for me?
Also most software runs on ARM now and I don't think that has actually happened in practice.
"Daily driver" is probably a better term, but everyone's daily usage patterns will vary. I could do my day job with a VT100 emulator on a phone for example.
The successor of x925 is C1 Ultra and even that was released 6 months ago in September 2025 with the MediaTek 9500 and GeekerWan even has a phone review they did with that chip last year.
Honestly, Apple is the strange one because they never discuss CPUs until they are available to buy in a product; they don't need to bother.
If there's space between the SIMD instructions, then double-pumping or even quad-pumping isn't very expensive (and with 6 SIMD ports, it might even be basically free).
I'd love to have a Xeon 6, a big EPYC, or an AmpereOne (or a loaded IBM LinuxOne Express) as my daily driver, but that's just not something I can justify. It'd not be easy to come up with something for all this compute capacity to do. A reasonable GPU is a much better match for most of my workloads, which aren't even about pushing pixels anymore - iGPUs are enough these days - but multiplying matrices with embarrassingly low precision, so it can pretend to understand programming tasks.
Running the SPEC benchmark interger and floating piitnt suites takes all day, but it's hard to game a benchmark with that much depth.
It's a shame that nobody has been willing to offer that level of detail.
If your metric is single thread performance yes but on just about anything else Graviton 4 wins.
We know how Apple's hardware performs on native workloads. We know how it performs emulating x86 workloads (and why). Surely "... and this is how this hardware measures up against the other guys trying to achieve the exact same thing" is a relevant comparison? I can't be the only person who reads "reaching desktop performance" and wonders "you mean comparable to the M1, or to the M3 Ultra?"
For better or worse if you make a (high end) consumer CPU it will be judged against the M-series, just like if you make a high end phone it will be judged against the iPhone.
This is an article testing shipping hardware you can buy today.
worked with their cores in $pastJob. I'd say their main products are flowery promises and long errata sheets.
I think you'll see ever more accelerating RISC-V adoption in China if the United States continues on its "cold war" style mentality about relations with them.
That said we're a long long way from Actually Existing RISC-V being at performance parity with ARM64, let alone x86.
Developed on Intel i860, then MIPS, and only then on x86, alongside Alpha.
Colima is backed by qemu, not Rosetta, so if Rosetta disappeared tomorrow I don't think I'd notice. I'm sure it's "better" but when the competition is "good enough" it doesn't really matter.
user-scalable=0To do this lookup in the first place, you pull a number of bits from the virtual/physical address you're looking up, which tells you what bucket to start at. The minimum page size determines how many bits you can use from these addresses to refer to unique buckets. If you don't have a lot of bits, then you can't count very high (6 bits = 2^6 = 64 buckets) -- so to increase the size of the cache, you need to instead increase the associativity, which makes latency worse. For L1 cache, you basically never want to make latency worse, so you are practically capped here.
Platforms like Apple Silicon instead set the minimum page size to 16k, so you get more bits to count buckets (8 bits = 256 buckets). Thus you can increase the size of the cache while keeping associativity low; L1 cache on Apple Silicon is something crazy like 192kb, and L2 (for the same reasons) is +16MB. x86 machines and software, for legacy reasons, are very much tied to 4k page size, which puts something of a practical limit on the size of their downstream caches.
Look up "Virtually Indexed, Physically Tagged" (VIPT) caches for more info if you want it.
x = *a;
if (x) y = *b;
The compiler cannot reorder the load of b before the load of a, because it may not be a valid pointer if x is false. But the CPU is free to speculate long ahead, and if the pointer in b isn't valid, that's fine, the CPU can attempt a speculative load and fail.It's not particularly common and code that has this issue will probably crash only rarely, but it's not too hard to do.
At least in my house, ARM cores outnumber x86 cores by at least four to one. And I'm not even counting the 32-bit ARM cores in embedded devices.
There is a lot of space for memory ordering bugs to manifest in all those devices.
The real reason is probably because they are supported by patrons and can only get new equipment to review when people donate (either money or sometimes the hardware itself).
If you like what they do (as pretty much the last in-depth hardware reviewers), consider supporting them.
The other massive point: RISC-V integrates a lot of CPU "we know now" in a very elegant "sweet spot".
And it is not china only, the best implementations are US, and RISC-V is a US/berkley initiative re-centered in switzerland for "neutrality" reasons.
If good large RISC-V implementations do reach TMSC silicon process (5GHz), some markets won't even look at arm or x86 anymore.
And there is the ultimate "standard ISA" point: assembly written code then become very appropriate, hence strong de-coupling from all those, very few, backdoor injecting compilers.
On many of my personal projects, I don't bother anymore: I write RISC-V assembly which I run with a small x86_64 interpreter, that with a very simple pre-processor and assembler, aka SDK toolchain complexity close to 0 compared to the other abominations.
And I think the main drawback is: big mistakes will be made, and you must account for them.
You're not. IMHO it's a fairly obvious, narrow and uncontroversial observation (and hence why its the top comment). That said, I personally still enjoyed the back and forth as many others one could imagine. There can be value in the counterarguments from multiple other usernames, as this facilitates sharpening reasoning for the conclusion from readers. (even when the original premise stays in tact)
The lack of others agreeing could be the result of many reasons. IMHO, a not insignificant one could be the incentive structure skews heavily towards lurking as HN rightfully disincentives "me too" type replies and not everyone always has something interesting to add
2c not an epistemologist ymmv
As in, they don't sell you the parts, they only sell you the entire product. If you don't want the entire package, the processors alone are irrelevant.
All he is saying: We currently have products in a similar product category (arm based desktop computers) that are widely used and have known benchmark scores (and general reviews) and it would make sense if I publish a new cpu for the same product category ("Reaching Desktop Performance" implies that) that I'd compare it to the known alternatives.
In the end you can just run Asahi on your macbook, the OS is not that relevant here. A comparison to macbooks running Asahi Linux would be fine.
Your “specifically ARM cores designed by and licensable from ARM Holdings” argument doesn’t hold any water.
Individual benchmarks tell the bigger picture. These two are optimized for different use cases, with Apple heavily leaning towards low latency single thread throughput with low sustained power usage.
https://browser.geekbench.com/v6/cpu/compare/16833358?baseli...
EDIT: The M4 Max compares much more closely https://browser.geekbench.com/v6/cpu/compare/16834801?baseli...
I learned to live with macOS, but I also like and use Gnome, which many Linux-only people hate. I tried most WMs on Linux, like Hyprland, Sway, i3, but none ever felt worth the config hassle when compared to the sane defaults of Gnome.
"Starting with computers using macOS 28, Rosetta functionality will be available only for certain older, unmaintained games that rely on Intel-based frameworks."
https://support.apple.com/en-us/102527
And
"Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks."
https://developer.apple.com/documentation/apple-silicon/abou...
Desktop and laptop use cases demand high single threaded performance across a large variety of workloads. Creating CPU cores to meet those demands is no easy task. AMD and Intel traditionally dominated this high performance segment using high clocked, high throughput cores with large out-of-order engines to absorb latency. Arm traditionally optimized for low power and low area, and not necessarily maximum performance. Over the years though, Arm steadily built more complex cores and looked for opportunities to expand into higher performance segments. Matching the best from Intel and AMD must have been a distant dream in 2012, when Arm launched their first 64-bit core, the Cortex A57. Today, that dream is a reality.
Cortex X925 in Nvidia’s GB10 achieves performance parity with AMD’s Zen 5 and Intel’s Lion Cove in their fastest desktop implementations. That gives Arm a core fast enough to not just play in laptop segments, but potentially in the most performance sensitive desktop applications too. Nvidia’s GB10 uses ten X925 cores, split across two clusters. One of those X925 cores reaches 4 GHz, while the others are not far behind at 3.9 GHz. Dell uses the GB10 chip in their Pro Max series, and we’re grateful to Dell for letting us test that product.
Arm’s Cortex X925 is a massive 10-wide core with a lot of everything. It has more reordering capacity than AMD’s Zen 5, and L2 capacity comparable to that of Intel’s recent P-Cores. Unlike Arm’s 7-series cores, X925 makes few concessions to reduce power and area. It’s a core designed through and through to maximize performance.
[

Rough block diagram of the Cortex X925’s microarchitecture
In Arm tradition, X925 has a number of configuration options. However, X925 omits the shoestring budget options present for A725. X925’s caches are all either parity or ECC protected, dropping A725’s option to do without error detection or correction. L1 caches on X925 are fixed at 64 KB, removing the 32 KB options on A725. X925’s most significant configuration options happen at L2, where implementers can pick between 2 MB or 3 MB of capacity. They can also choose either a 128-bit or 256-bit ECC granule to make area and reliability tradeoffs.
X925 interfaces with the rest of the system via Arm’s DSU-120, which acts as a cluster-level interconnect and hosts a L3 cache with up to 32 MB of capacity. X925 and its DSU support 40-bit physical addresses, which is adequate for consumer systems. However, it’s clearly not designed for server applications, where larger 48-bit or even 52-bit physical address spaces are common.
Performance and power efficiency starts with good branch prediction. Arm knows this, and X925 doesn’t disappoint. Its branch predictor can recognize extremely long repeating patterns. In a test with branches that are taken or not-taken in random patterns of increasing lengths, X925 behaves a lot like AMD’s Zen 5. AMD’s cores have featured very strong branch predictors since Zen 2, so X925’s results are impressive.
[

[

Cortex X925’s branch target caching compares well too. Arm has a large first level BTB capable of handling two taken branches per cycle. Capacity for this first level BTB varies with branch spacing, but it seems capable of tracking up to 2048 branches. This large capacity brings X925’s branch target caching strategy closer to Zen 5’s, rather than prior Arm cores that used small micro-BTBs with 32 to 64 entries. For larger branch footprints, X925 has slower BTB levels that can track up to 16384 branches and deliver targets with 2-3 cycle latency. There may be a mid-level BTB with 4096 to 8192 entries, though it’s hard to tell.
[

Compared to AMD’s Zen 5, X925 has roughly comparable capacity in its fastest BTB level depending on branch spacing. Zen 5 has more maximum branch target caching capacity, especially when it can use a single BTB entry to track two branches. Still, X925 has more branch target storage than Arm cores from a few years ago. Cortex X2 for example topped out at about 10K branch targets.
[

A 29 entry return stack helps predict returns from function calls, or branch-with-link in Arm instruction terms. Like Intel’s Sunny Cove and later cores, the return stack doesn’t work if return sites aren’t spaced far enough apart. I spaced the test “function” by 128 bytes to get clear results.
In SPEC CPU2017, Cortex X925 achieves branch prediction accuracy roughly on par with AMD’s Zen 5 across most tests, and may even be slightly ahead. 505.mcf and 541.leela consistently challenge branch predictors, and X925 pulls ahead in both. Intel’s Lion Cove is a bit behind both Zen 5 and X925.
[

[

SPEC’s floating point workloads are gentler on the branch predictor, but X925 still shows its strength. It’s again on par or slightly better than Zen 5.
Cortex X925 ditches the MOP cache from prior Arm generations, much like mid-core companion (A725). While Cortex X925 doesn’t have the same tight power and area restrictions as A725, many of the same justifications apply. Arm already tackles decode costs through a variety of measures like predecode and running at lower clock speeds. A MOP cache would be excessive.
On the predecode side, X925’s TRM suggests the L1I stores data at 76-bit granularity. Arm instructions are 32-bits, so 76 bits would store two instructions and 12 bits of overhead. Unlike A725, Arm doesn’t indicate that any subset of bits correspond to an aarch64 opcode. They may have neglected to document it, or X925’s L1I may store instructions in an intermediate format that doesn’t preserve the original opcodes.
[

Zen 5 can achieve 8 instructions per cycle with 8B NOPs. There seems to be a strange limitation with 4B NOPs, but I’m using the same NOP size for all cores for consistency
X925’s frontend can sustain 10 instructions per cycle, but strangely has lower throughput when using 4 KB pages. Using 2 MB pages lets it achieve 10 instructions per cycle as long as the test fits within the 64 KB instruction cache. Cortex X925 can fuse NOP pairs into a single MOP, but that fusion doesn’t bring throughput above 10 instructions per cycle. Details aside, X925 has high per-cycle frontend throughput compared to its x86-64 peer, but slightly lower actual throughput when considering Zen 5 and Lion Cove’s much higher clock speed. With larger code footprints, Cortex X925 continues to perform well until test sizes exceed L2 capacity. Compared to X925, AMD’s Zen 5 relies on its op cache to deliver high throughput for a single thread.
MOPs from the frontend go through register renaming and have various bookkeeping resources allocated for them, letting the backend carry out out-of-order execution while ensuring results are consistent with in-order execution. While allocating resources, the core can carry out various optimizations to expose additional parallelism. X925 can do move elimination like prior Arm cores, and has special handling for moving an immediate value of zero into a register. Like on A725, the move elimination mechanism tends to fail if there are enough register-to-register MOVs close by. Neither optimization can be carried out at full renamer width, though that’s typical as cores get very wide.
[

Unlike A725, X925 does not have special handling for PTRUE, which sets a SVE predicate register to enable all lanes. A725 could eliminate PTRUE like a zeroing idiom, and process it without allocating a physical register. While this is a minor detail, it does show divergence between Arm’s mid-core and big-core lines.
A CPU’s out-of-order backend executes operations as their inputs become ready, letting the core keep its execution units fed while waiting for long latency instructions to complete. Different sources give conflicting information about Cortex X925’s reordering window. Android Authority claims 750 MOPs. Wikichip believes it’s 768 instructions, based off an Arm slide that states Cortex X925 doubled reordering capacity over Cortex X4. Testing shows X925 can keep 948 NOPs in flight, which doesn’t correspond with either figure unless NOP fusion only worked some of the time.
[

Because results with NOPs were inconclusive, I tried testing with combinations of various instructions designed to dodge other resource limits. Mixing instructions that write to the integer and floating point registers showed X925 could have a maximum of 448 renamed registers allocated across its register files. Recognized zeroing idioms like MOV r,0 do not allocate an integer register, but also run up against the 448 instruction limit. I tried mixing in predicate register writes, but those also share the 448 instruction limit. Adding in stores showed the core could have slightly more than 525 instructions in flight. Adding in not-taken branches did not increase reordering capacity further. Putting an exact number on X925’s reorder buffer capacity is therefore difficult, but it’s safe to say there’s a practical limitation of around 525 instructions in flight. That puts it in the same neighborhood as Intel’s Lion Cove (576) and ahead of AMD’s Zen 5 (448).
[

X925’s register files, memory ordering queues, and other resources have comparable capacity to those in Zen 5 and Lion Cove. The only weakness is 128-bit vector execution, with correspondingly wide register file entries. AMD and Intel’s big cores have wider vector registers, and more of them available for renaming.
Arm laid out Cortex X925’s integer side to deliver high throughput while controlling port count for both the integer register file and scheduling queues. Eight ALU ports and three branch units are distributed across four schedulers in a layout that maximizes symmetry for common ALU operations. All four schedulers have two ALU ports and 28 entries. Similarly, each scheduler has one multiply-capable ALU pipe. Branches and special integer operations see a split, with the first three schedulers getting a branch pipe and the fourth scheduler getting support for pointer authentication and SVE predicate operations.
[

The aarch64 instruction set has a madd instruction that performs integer multiply-adds. Cortex A725 and older Arm cores had dedicated integer multi-cycle pipes that could handle madd along with other complex integer instructions. Cortex X925 instead breaks madd into two micro-ops, and handles it with any of its four multiply-capable integer pipes. Likely, Arm wanted to increase throughput for that instruction without the cost of implementing three register file read ports for each multiply-capable pipe. Curiously, Arm’s optimization guide refers to the fourth scheduler’s pipes as “single/multi-cycle” pipes. “Multi-cycle” is now a misnomer though, because the core’s “single-cycle” integer pipes can handle multiplies, which have two cycle latency. On Cortex X925, “multi-cycle” pipes distinguish themselves by handling special operations and being able to access FP/vector related registers.
[

The two multi-cycle pipes on the right can take three inputs each from predicate registers (two inputs, and a predicate as a mask), so treat the purple input arrow as three. Drawing more arrows is hard.
Thanks to symmetry across the integer schedulers, X925’s renamer likely uses a simple round-robin allocation scheme for operations that can go to multiple schedulers. If I test scheduler capacity by interleaving dependent and independent integer adds, X925 can only keep half as many dependent adds in flight. Following dependent adds by independent ones only slightly reduces measured scheduling capacity. That suggests the renamer assigns a scheduler for each pending operation, and stalls if the targeted scheduling queue is full without scanning other eligible schedulers for free entries.
[

Cortex X925’s FPU has six pipes, all of which can handle vector floating point adds, multiplies, and multiply-adds. All six pipes also support vector integer adds and multiplies. Less common instructions like addv are still serviced by four pipes. X925’s FP schedulers are impressively large with approximately 53 entries each. For perspective, each of X925’s three FP schedulers has nearly as much capacity as Bulldozer’s 60 entry unified FP scheduler. Bulldozer used that scheduler to service two threads, while X925 uses its three FP schedulers to service a single thread.
[

High scheduler capacity and high pipe count should give X925 good performance in vectorized applications despite its 128-bit vector width.
Memory accesses are among the most complicated and performance critical operations on a modern CPU. For each memory access, the load/store unit has to translate program-visible virtual addresses into physical addresses. It also has to determine whether loads should get data from an older store, or from the cache hierarchy. Cortex X925 has four address generation units that calculate virtual addresses. Two of those can handle stores.
Address translations are cached in a standard two-level TLB setup. The L1 DTLB has 96 entries and is fully associative. A 2048 entry 8-way L2 TLB handles larger data footprints, and adds 6 cycles of latency. Zen 5 for comparison has the same L1 DTLB capacity and associativity, but a larger 4096 entry L2 DTLB that adds 7 cycles of latency. Another difference is that Zen 5 has a separate L2 ITLB for instruction-side translations, while Cortex X925 uses a unified L2 TLB for both instructions and data. AMD’s approach could further increase TLB reach, because data and instructions often reside on different pages.
[

And adapting it to use vector registers with 128-bit stores and 64-bit loads
Store forwarding on the integer side works for all loads contained within a prior store. It’s an improvement over prior arm cores like the Cortex X2, which could only forward either half of a 64-bit store to a 32-bit load. Forwarding on the FP/vector side still works like older Arm cores, and only works for specific load alignments with respect to the store address. Unlike recent Intel and AMD cores, Cortex X925 can’t do zero latency forwarding when store and load addresses match exactly. To summarize store forwarding behavior:
[

Memory dependencies can’t be definitively determined until address translation finishes. Some cores do an early check before address translation completes, using bits of the address that represent an offset into a page. Cortex X925 might be doing that because it takes a barely measurable penalty when loads and stores appear to overlap in the low 12 bits.
Cortex X925 has a 64 KB L1 data cache with 4 cycle latency like A725 companions in GB10, but takes advantage of its larger power and area budget to make that capacity go further. It uses a more sophisticated re-reference interval prediction (RRIP) replacement policy rather than the pseudo-LRU policy used on A725. Bandwidth is higher too. Arm’s technical reference manual says the L1D has “4x128-bit read paths and 4x128-bit write paths”. Sustaining more than two stores per cycle is impossible because the core only has two store-capable AGUs. Loads can use all four AGUs, and can achieve 64B/cycle from the L1 data cache. That’s competitive against many AVX2-capable x86-64 CPUs from a few generations ago. However, more recent Intel and AMD cores can use their wider vector width and faster clocks to achieve much higher L1D bandwidth, even if they also have four AGUs.
[

Arm offers 2 MB 8-way and 3 MB 12-way L2 cache options. Mediatek and Nvidia chose the 2 MB option, and testing shows it has 12 cycles of latency. THis low cycle count latency lets Arm remain competitive against Intel and AMD’s L2 caches, despite running at lower clock speeds. L2 bandwidth comes in at 32 bytes per cycle for reads, and increases to approximately 45 bytes per cycle with a read-modify-write pattern.
Like AMD, Arm makes the L2 strictly inclusive of the L1 data cache, which lets the L2 act as a snoop filter. If an incoming snoop misses in the L2, the core can be sure it won’t hit in the L1D either.
Cortex X925 turns in an excellent performance in SPEC CPU2017’s integer suite. Its score in the integer suite is within margin of error compared to Intel and AMD’s highest performance cores, in their highest performance desktop configurations. AMD’s Zen 5 still slips ahead in SPEC’s floating point suite, but not by a large margin. Running Zen 5 with faster DDR5-6000 instead of DDR5-5600 memory can slightly increase its integer score, but only to 11.9 - not enough to pull away from X925. Creating a high performance core involves making the right tradeoff between frequency and performance per clock, while keeping everything else in balance. It’s safe to say Arm has found a combination of frequency and performance per clock that lets them compete with AMD and Intel’s best.
[

Diving deeper into individual workloads shows a complex picture. X925 trades blows with the higher clocked Intel and AMD cores in core-bound workloads. 548.exchange2 and 500.perlbench both show the advantages of clock speed scaling, with Intel and AMD’s higher clocking 8-wide cores easily outpacing Arm’s 4 GHz 10-wide one. But 525.x264 turns things around. Cortex X925 is able to finish that workload with fewer instructions than its x86-64 peers, while maintaining a large IPC advantage. X925 continues to do well in workloads that challenge the branch predictor, like 541.leela and 505.mcf. Finally, memory bound tests like 520.omnetpp are heavily influenced by factors outside the core.
[

Performance monitoring counter data shows how Cortex X925 uses high IPC to make up for a clock speed deficit. Whether it’s enough IPC to match Lion Cove and Zen 5 varies depending on the individual test, but overall Arm’s chosen IPC and clock speed targets are just as viable as Intel’s and AMD’s.
[

SPEC CPU2017’s floating point workloads throw a wrench into the works. Cortex X925 falls behind Zen 5 on enough tests to leave AMD’s latest and greatest core with a clear victory. In Arm’s favor, they are able to keep pace with Intel’s Lion Cove.
[

PMU data indicates X925 is able to maintain higher IPC relative to the competition. Unfortunately for X925, several tests require many more instructions to complete with the aarch64 instruction set compared to x86-64. Arm needs a large enough IPC advantage to overcome both a clock speed deficit and a less efficient representation of the work. Cortex X925 does achieve high IPC, but it’s not high enough.
[

Plotting instruction counts shows just how severe the situation can get for Cortex X925. 507.cactuBSSN, 521.wrf, 549.fotonik3d, and 554.roms all require more instructions on X925, and by no small margin. 554.roms is the worst offender, and makes X925 execute more than twice as many instructions compared to Zen 5. Average IPC in these four tests is nowhere near core width for any of these tested cores, but crunching through extra instructions isn’t the only issue. Higher instruction counts place more pressure on core out-of-order resources, impacting its ability to hide latency.
[

Arm now has a core with enough performance to take on not only laptop, but also desktop use cases. They’ve also shown it’s possible to deliver that performance at a modest 4 GHz clock speed. Arm achieved that by executing well on the fundamentals throughout the core pipeline. X925’s branch predictor is fast and state-of-the-art. Its out-of-order execution engine is truly gargantuan. Penalties are few, and tradeoffs appear well considered. There aren’t a lot of companies out there capable of building a core with this level of performance, so Arm has plenty to be proud of.
That said, getting a high performance core is only one piece of the puzzle. Gaming workloads are very important in the consumer space, and benefit more from a strong memory subsystem than high core throughput. A DSU variant with L3 capacity options greater than 32 MB could help in that area. X86-64’s strong software ecosystem is another challenge to tackle. And finally, Arm still relies on its partners to carry out its vision. I look forward to seeing Arm take on all of these challenges, while also iterating on their core line to keep pace as AMD and Intel improve their cores. Hopefully, extra competition will make better, more affordable CPUs for all of us.
The tested machine is an nvidia GB10 which nvidia makes and sells as a whole unit and various vendors stick it in different devices to try to differentiate (although in the end they're all basically identical).
And yes, it is extremely weird for it to never mention the Apple chip, which has a little something to do with who they thank for lending them the device. The arbitrary claims for why they ignored the enormous, class-leading ARM processor in the space is not convincing.
I mean, the other claim that this is an "industry blog" and not a "consumer blog" was equally silly. It's basically for curious hobbyists. Zero industry insiders follow this to see about the core in the GB10. It's basically Anandtech.
https://browser.geekbench.com/v6/cpu/compare/16839304?baseli...
The M3 Ultra sacrifices a bunch of single-thread performance for not that much of a multithreaded gain:
https://browser.geekbench.com/v6/cpu/compare/16839654?baseli...
I have to admit that when I read this, my eyebrows went up so far that my hat moved.
amelius, if anyone had specific requirements, it was you with your "systems for in-flight entertainment".
OP asked a very reasonable question for a very generic comparison to the 800-pound gorilla in the consumer CPU world in general, and ARM CPU world in particular.
If the article can reference AMD's Zen 5 cores and Intel's Lion/Sunny Cove, they could have made at least a brief reference to M-series CPUs. As a reader and potential buyer of any of them, I find it would have been a very useful comparison.
Otoh L1 sizes hasn't increased since my first processor, those CPU designers probably know more than I do.
This is not possible with Apple parts.
That's what my example was about. It was only specific because I wanted to have a concrete example.
Talk about specifics, eh? Didn't you just argue against an article addressing "_their_" specific usecase?
In a store people will ask "is this better than an Apple?".
And I'll tell you one more thing, when I was in the industry and taking computing parts to build products with them I did not form an opinion by reading internet reviews. I haven't met anyone who did.
(Of course, it only took a few years for this to be rectified with the byte-word extension, which became required by ~all "real software" that supported Alpha)
It's also one of the only architectures Windows NT supported that didn't have 4KB pages, along with Itanium. I've wondered how (or if?) it handled programs that expect 4KB pages, especially in the x86 translation subsystem.
Divide the cache into "meta-caches" indexed by the virtual bits and treat them as separate from the L2's point of view. Duplicate the data and if somebody writes back invalidate all the other copies. The hardware already exists for doing this on any multicore system. Sure, you will end up duplicating data sometimes and it will actually be slower if you're actually writing to aliased locations. But is this happening often enough to be a problem compared to generally having a bigger cache?
It sounds to me like an engineering tradeoff that might or might not make sense, not a hard limit which at least was what I think was being asserted. But as I also said, L1 sizes hasn't increased in a while and smart people are working on it, so there is probably something I don't know.