15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

It's amazing to step back and look at how much of NVIDIA's success has come from unforeseen directions. For their original purpose of making graphics chips, the consumer vs pro divide was all about CAD support and optional OpenGL features that games didn't use. Programmable shaders were added for the sake of graphics rendering needs, but ended up spawning the whole GPGPU concept, which NVIDIA reacted to very well with the creation and promotion of CUDA. GPUs have FP64 capabilities in the first place because back when GPGPU first started happening, it was all about traditional HPC workloads like numerical solutions to PDEs.

Fast forward several years, and the cryptocurrency craze drove up GPU prices for many years without even touching the floating-point capabilities. Now, FP64 is out because of ML, a field that's almost unrecognizable compared to where it was during the first few years of CUDA's existence.

NVIDIA has been very lucky over the course of their history, but have also done a great job of reacting to new workloads and use cases. But those shifts have definitely created some awkward moments where their existing strategies and roadmaps have been upturned.

While implementing double-precision by double-single may be a solution in some cases, the article fails to mention the overflow/underflow problem, which is critical in scientific/technical computing (a.k.a. HPC).

With the method from the article, the exponent range remains the same as in single precision, instead of being increased to that of double precision.

There are a lot of applications for which such an exponent range would cause far too frequent overflows and underflows. This could be avoided by introducing a lot of carefully-chosen scaling factors in all formulae, but this tedious work would remove the main advantage of floating-point arithmetic, i.e. the reason why computations are not done in fixed-point.

The general solution of this problem is to emulate double-precision with 3 numbers, 2 FP32 for the significand and a third number for the exponent, either a FP number or an integer number, depending on which format is more convenient for a given GPU.

This is possible, but it lowers considerably the achievable ratio between emulated FP64 throughput and hardware FP32 throughput, but the ratio is still better than the vendor-enforced 1:64 ratio.

Nevertheless, for now any small business or individual user can achieve a much better performance per dollar for FP64 throughput by buying Intel Battlemage GPUs, which have a 1:8 FP64/FP32 throughput ratio. This is much better than you can achieve by emulating FP64 on NVIDIA or AMD GPUs.

Intel B580 is a small GPU, so it has only a FP64 throughput about equal to a Ryzen 9 9900X and smaller than a Ryzen 9 9950X. However it provides that throughput at a much lower price. Thus if you start with a PC with a 9900X/9950X, you can double or almost double the FP64 throughput for a low additional price with an Intel GPU. Multiple GPUs will proportionally multiply the throughput.

The sad part is that with the current Intel CEO and with NVIDIA being a shareholder of Intel, it is unclear whether Intel will continue to compete in the GPU market, or they will abandon it, leaving us at the mercy of NVIDIA and AMD, which both refuse to provide products with good FP64 support to small businesses and individual users.

No mention of the Radeon VII from 2019 where for some unfathomable reason AMD forgot about the segmentation scam and put real FP64 into a gaming GPU. From this 2023 list, it's still faster at FP64 than any other consumer GPU by a wide margin (enterprise GPU's aren't in the list). Scroll all the way to the end.

https://www.eatyourbytes.com/list-of-gpus-by-processing-powe...

I'm not sure why the article dismisses cost.

Let's say X=10% of the GPU area (~75mm^2) is dedicated to FP32 SIMD units. Assume FP64 units are ~2-4x bigger. That would be 150-300mm^2, a huge amount of area that would increase the price per GPU. You may not agree with these assumptions. Feel free to change them. It is an overhead that is replicated per core. Why would gamers want to pay for any features they don't use?

Not to say there isn't market segmentation going on, but FP64 cost is higher for massively parallel processors than it was in the days of high frequency single core CPUs.

FP64 performance is limited on consumer because the US government deems it important to nuclear weapons research.

Past a certain threshold of FP64 throughput, your chip goes in a separate category and is subject to more regulation about who you can sell to and know-your-customer. FP32 does not matter for this threshold.

https://en.wikipedia.org/wiki/Adjusted_Peak_Performance

It is not a market segmentation tactic and has been around since 2006. It's part of the mind-numbing annual export control training I get to take.

To me it is crazy that NVIDIA somehow got away with telling owners of consumer grade hardware.that they cannot be used in datacenters.

Table to compare Blackwell U300 to U200 (-97% FP64 performance): https://www.forum-3dcenter.org/vbulletin/showpost.php?p=1380...

I hope for their fall. I invest in their success

A question that has been bugging me for a while is what will NVIDIA do with its HPC business? By HPC I mean clusters intended for non-AI related workloads. Are they going to cater to them separetely, or are they going to tell them to just emulate FP64?

this article is so dumb. NVIDIA delivered what the market wanted - gamers dont need FP64, they dont waste silicon on it. now enterprise doesnt want FP64 anymore and they are reducing silicon for it too

weird way to frame delivering exactly what the consumer wants as a big market segmentation fuck the user conspiracy

https://www.eatyourbytes.com/list-of-gpus-by-processing-powe...

Most people don't appreciate how many dead end applications NVIDIA explored before finding deep learning. It took a very long time, and it wasn't luck.

They were also bailed out by Sega.

When they couldn't deliver the console GPU they promised for the Dreamcast (the NV2), Shoichiro Irimajiri, the Sega CEO at the time let them keep the cash in exchange for stock [0].

Without it Nvidia would have gone bankrupt months before Riva 128 changed things.

Sega console arm went bust not that it mattered. But they sold the stock for about $15mn (3x).

Had they held it, Jensen Huang ,estimated itd be worth a trillion[1]. Obviously Sega and especially it's console arm wasn't really into VC but...

My wet dream has always been what if Sega and Nvidia stuck together and we had a Sega tegra shield instead of a Nintendo switch? Or even what if Sega licensed itself to the Steam Deck? You can tell I'm a sega fan boy but I can't help that the Mega Drive was the first console I owned and loved!

[0] https://www.gamespot.com/articles/a-5-million-gift-from-sega...

[1] https://youtu.be/3hptKYix4X8?t=5483&si=h0sBmIiaduuJiem_

The whole GPU history is off and being driven by finance bros as well. Everyone believes Nvidia kicked off the GPU AI craze when Ilya Sutskever cleaned up on AlexNet with an Nvidia GPU back in 2012, or when Andrew Ng and team at Stanford published their "Large Scale Deep Unsupervised Learning using Graphics Processors" in 2009, but in 2004, a couple of Korean researches were the first to implement neural networks on a GPU, using ATI Radeons (now AMD): https://www.sciencedirect.com/science/article/abs/pii/S00313...

I remember ATI and Nvidia were neck-and-neck to launch the first GPUs around 2000. Just so much happening so fast.

I'm not sure why the article dismisses cost.

Not to say there isn't market segmentation going on, but FP64 cost is higher for massively parallel processors than it was in the days of high frequency single core CPUs.

> Assume FP64 units are ~2-4x bigger.

I'm not a hardware guy, but an explanation I've seen from someone who is says that it's not much extra hardware to add to a 2×f32 FMA unit the capability to do 1×f64. You already have all of the per-bit logic, you mostly just need to add an extra control line to make a few carries propagate. So the size overhead of adding FP64 to the SIMD units is more like 10-50%, not 100-300%.

> Assume FP64 units are ~2-4x bigger.

I'm pretty sure that's not a remotely fair assumption to make. We've seen architectures that can eg. do two FP32 operations or one FP64 operation with the same unit, with relatively low overhead compared to a pure FP32 architecture. That's pretty much how all integer math units work, and it's not hard to pull off for floating point. FP64 units don't have to be—and seldom have been—implemented as massive single-purpose blocks of otherwise-dark silicon.

When the real hardware design choice is between having a reasonable 2:1 or 4:1 FP32:FP64 ratio vs having no FP64 whatsoever and designing a completely different core layout for consumer vs pro, the small overhead of having some FP64 capability has clearly been deemed worthwhile by the GPU makers for many generations. It's only now that NVIDIA is so massive that we're seeing them do five different physical implementations of "Blackwell" architecture variants.

Why would gamers want to pay for any features they don't use?

Obviously they don't want to. Now flip it around and ask why HPC people would want to force gamers to pay for something that benefits the HPC people... Suddenly the blog post makes perfect sense.

FP64 performance is limited on consumer because the US government deems it important to nuclear weapons research.

https://en.wikipedia.org/wiki/Adjusted_Peak_Performance

It is not a market segmentation tactic and has been around since 2006. It's part of the mind-numbing annual export control training I get to take.

This is so interesting, especially given that it is in theory possible to emulate FP64 using FP32 operations.

I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."

I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.

Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?

https://www.federalregister.gov/documents/2018/10/24/2018-22... What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."

Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?

Can't wait until they update this to also include export controls around FP8 and FP4 etc in order to combat deepfakes, and then all of a sudden not be able to buy increasingly powerful consumer GPUs.

They were also bailed out by Sega.

When they couldn't deliver the console GPU they promised for the Dreamcast (the NV2), Shoichiro Irimajiri, the Sega CEO at the time let them keep the cash in exchange for stock [0].

Without it Nvidia would have gone bankrupt months before Riva 128 changed things.

Sega console arm went bust not that it mattered. But they sold the stock for about $15mn (3x).

Had they held it, Jensen Huang ,estimated itd be worth a trillion[1]. Obviously Sega and especially it's console arm wasn't really into VC but...

[0] https://www.gamespot.com/articles/a-5-million-gift-from-sega...

[1] https://youtu.be/3hptKYix4X8?t=5483&si=h0sBmIiaduuJiem_

> Assume FP64 units are ~2-4x bigger.

I remember ATI and Nvidia were neck-and-neck to launch the first GPUs around 2000. Just so much happening so fast.

> Assume FP64 units are ~2-4x bigger.

This is so interesting, especially given that it is in theory possible to emulate FP64 using FP32 operations.

Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?

Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?

Can't wait until they update this to also include export controls around FP8 and FP4 etc in order to combat deepfakes, and then all of a sudden not be able to buy increasingly powerful consumer GPUs.

Most people don't appreciate how many dead end applications NVIDIA explored before finding deep learning. It took a very long time, and it wasn't luck.

It was luck that a viable non-graphics application like deep learning existed which was well-suited to the architecture NVIDIA already had on hand. I certainly don't mean to diminish the work NVIDIA did to build their CUDA ecosystem, but without the benefit of hindsight I think it would have been very plausible that GPU architectures would not have been amenable to any use cases that would end up dwarfing graphics itself. There are plenty of architectures in the history of computing which never found a killer application, let alone three or four.

It was luck, but that doesn't mean they didn't work very hard too.

Luck is when preparation meets opportunity.

Why would gamers want to pay for any features they don't use?

Obviously they don't want to. Now flip it around and ask why HPC people would want to force gamers to pay for something that benefits the HPC people... Suddenly the blog post makes perfect sense.

Similar to when Nvidia released LHR GPUs that nerfed performance for Ethereum mining.

NVIDIA GeForce RTX 3060 LHR which tried to hinder mining at the bios level.

The point wasn't to make the average person lose out by preventing them mining on their gaming GPU. But to make miners less inclined to buy gaming GPUs. They also released a series of crypto mining GPUs around the same time.

So fairly typical market segregation.

https://videocardz.com/newz/nvidia-geforce-rtx-3060-anti-min...

Similar to when Nvidia released LHR GPUs that nerfed performance for Ethereum mining.

NVIDIA GeForce RTX 3060 LHR which tried to hinder mining at the bios level.

So fairly typical market segregation.

https://videocardz.com/newz/nvidia-geforce-rtx-3060-anti-min...

It was luck, but that doesn't mean they didn't work very hard too.

Luck is when preparation meets opportunity.

Table to compare Blackwell U300 to U200 (-97% FP64 performance): https://www.forum-3dcenter.org/vbulletin/showpost.php?p=1380...

I hope for their fall. I invest in their success

With the method from the article, the exponent range remains the same as in single precision, instead of being increased to that of double precision.

This is possible, but it lowers considerably the achievable ratio between emulated FP64 throughput and hardware FP32 throughput, but the ratio is still better than the vendor-enforced 1:64 ratio.

Yeah fair enough. The exponent of an FP32 has only 8 bits instead of 11 bits. I'll make an edit to make this explicit.

It's also fairly interesting how Nvidia handles this for the Ozaki scheme: https://docs.nvidia.com/cuda/cublas/#floating-point-emulatio.... They generally need to align all numbers in a matrix row to the maximum exponent (of a number in the row) but depending on scale difference of two numbers this might not be feasible without extending the number of mantissa bits significantly. So they dynamically (Dynamic Mantissa Control) decide if they use Ozaki's scheme or execute on native FP64 hardware. Or they let the user decide on the number of mantissa bits (Fixed Mantissa Control) which is faster but has no longer the guarantees for FP64 precision.

Yeah, double-word floating-point loses many of the desirable properties of the usual floating-point.

To me it is crazy that NVIDIA somehow got away with telling owners of consumer grade hardware.that they cannot be used in datacenters.

My understanding is this was not enforceable in Europe, and maybe elsewhere

Hopper had 60 TF FP64, Blackwell has 45 TF, and Rubin has 33 TF.

It is pretty clear that Nvidia is sunsetting FP64 support, and they are selling a story that no serious computational scientist I know believes, namely that you can use low precision operations to emulate higher precision.

See for example, https://www.theregister.com/2026/01/18/nvidia_fp64_emulation...

It seems the emulation approach is slower, has more errors, and doesn't apply to FP64 vector, only matrix operations.

For a long time AMD has been offering much better FP64 performance than NVIDIA, in their CDNA GPUs (which continue the older AMD GCN ISA, instead of resembling the RDNA used in gaming GPUs).

Nevertheless, the AMD GPUs continue to have their old problems, weak software support, so-and-so documentation, software incompatibility with the cheap GPUs that could be used directly by a programmer for developing applications.

There is a promise that AMD will eventually unify the ISA of their "datacenter" and gaming GPUs, like NVIDIA has always done, but it in unclear when this will happen.

Thus they are a solution only for big companies or government agencies.

AMD MI430X is taking that market.

Yeah fair enough. The exponent of an FP32 has only 8 bits instead of 11 bits. I'll make an edit to make this explicit.

For a long time AMD has been offering much better FP64 performance than NVIDIA, in their CDNA GPUs (which continue the older AMD GCN ISA, instead of resembling the RDNA used in gaming GPUs).

There is a promise that AMD will eventually unify the ISA of their "datacenter" and gaming GPUs, like NVIDIA has always done, but it in unclear when this will happen.

Thus they are a solution only for big companies or government agencies.

Yeah, double-word floating-point loses many of the desirable properties of the usual floating-point.

My understanding is this was not enforceable in Europe, and maybe elsewhere

> My understanding is this was not enforceable in Europe, and maybe elsewhere

"Not enforceable" just means they can't sue you. It doesn't mean they can't say "We won't sell to you anymore".

Code 43 was a worldwide thing before we found a workaround.

They did it by limiting the supply of cards. Even if you are ready to pay 4x of MSRP, you can't buy 100 of the card at once. Many consumers bought 1 GPU at 2-4x of MSRP.

Hopper had 60 TF FP64, Blackwell has 45 TF, and Rubin has 33 TF.

See for example, https://www.theregister.com/2026/01/18/nvidia_fp64_emulation...

It seems the emulation approach is slower, has more errors, and doesn't apply to FP64 vector, only matrix operations.

This is kind of amazing - I still have a bunch of Titan V's (2017-2018) that do 7 TF FP64. 8 years old and managing 1/4 of what Rubin does, and the numbers are probably closer if you divide by the power draw.

(Needless to say, the FP32 / int8 / etc. numbers are rather different.)

weird way to frame delivering exactly what the consumer wants as a big market segmentation fuck the user conspiracy

Your framing is what's backwards. NVIDIA artificially nerfed FP64 for a long time before they started making multiple specialized variants of their architectures. It's not a conspiracy theory; it's historical fact that they shipped the same die with drastically different levels of FP64 capability. In a very real way, consumers were paying for transistors they couldn't use, subsidizing the pro parts.

AMD MI430X is taking that market.

> My understanding is this was not enforceable in Europe, and maybe elsewhere

"Not enforceable" just means they can't sue you. It doesn't mean they can't say "We won't sell to you anymore".

They did it by limiting the supply of cards. Even if you are ready to pay 4x of MSRP, you can't buy 100 of the card at once. Many consumers bought 1 GPU at 2-4x of MSRP.

(Needless to say, the FP32 / int8 / etc. numbers are rather different.)

Code 43 was a worldwide thing before we found a workaround.

Yep, I do GPU passthrough to virtual machines because I would not let Windows touch my bare metal. You have to patch your ROM headers and hide the fact that you're in a VM from the OS.

So even as an end-user, a single person, I cannot naturally use my card how I please without significant technical investment. Imagine buying a $1000 piece of equipment and then being told what you can and can't do with it.

> subsidizing the pro parts.

You got this wrong way around. It's the high margin (pro) products subsidizing low margin (consumer) products.

> consumers were paying for transistors they couldn’t use

This is Econ 101 these days. It’s cheaper to design and manufacture 1 product than 2. Many many products have features that are enabled for higher paying customers, from software to kitchen appliances to cars, and much much more.

The combined product design is also subsidizing some of the costs for everyone, so be careful what you wish for. If you could use all the transistors you have, you’d be paying more either way, either because design and production costs go up, or because you’re paying for the higher end model and being the one subsidizing the existence of the high end transistors other people don’t use.

Yep, I do GPU passthrough to virtual machines because I would not let Windows touch my bare metal. You have to patch your ROM headers and hide the fact that you're in a VM from the OS.

> consumers were paying for transistors they couldn’t use

> subsidizing the pro parts.

You got this wrong way around. It's the high margin (pro) products subsidizing low margin (consumer) products.

In general, yes, but when consumer parts are spending silicon area on features they can't use, it is happening in the other direction too.

This isn't really true, and it wouldn't be a big deal even if it was.

Die areas for consumer card chips are smaller than die areas for datacenter card chips, and this has held for a few generations now. They can't possibly be the same chips, because they are physically different sizes. The lowest-end consumer dies are less than 1/4 the area of datacenter dies, and even the highest-end consumer dies are only like 80% the area of datacenter dies. This implies there must be some nontrivial differentiation going on at the silicon level.

Secondly, you are not paying for the die area anyway. Whether a chip is obtained from being specially made for that exact model of GPU, or it is obtained from being binned after possibly defective areas get fused off, you are paying for the end-result product. If that product meets the expected performance, it is doing its job. This is not a subsidy (at least, not in that direction), the die is just one small part of what makes a usable GPU card, and excess die area left dark isn't even pure waste, as it helps with heat dissipation.

The fact that nVidia excludes decent FP64 from all of its prosumer offerings (*) can still be called "artificial" insofar as it was indeed done on purpose for market segmentation purposes, but it's not some trivial trick. They really are just not putting it into the silicon. This has been the case for longer than it wasn't by now, even.

* = The Quadro line of "professional" workstation cards nowadays are just consumer cards with ECC RAM and special drivers

This isn't really true, and it wouldn't be a big deal even if it was.

* = The Quadro line of "professional" workstation cards nowadays are just consumer cards with ECC RAM and special drivers

Buy an RTX 5090, the fastest consumer GPU money can buy, and you get 104.8 TFLOPS of FP32 compute. Ask it to do double-precision math and you get 1.64 TFLOPS. That 64:1 gap is not a technology limitation. For fifteen years, the FP64:FP32 ratio has been slowly getting wider on consumer GPUs, widening the divide between consumer and enterprise silicon. Now the AI boom is quietly dismantling that logic.

The Evolution of FP64 on Nvidia GPUs

The FP64:FP32 ratio on Nvidia consumer GPUs has degraded consistently since the Fermi architecture debuted in 2010. On Fermi, the GF100 die shipped to both GeForce and Tesla lines; the hardware supported 1:2 FP64:FP32, but GeForce cards were driver-capped to 1:8.1

Over time, Nvidia moved away from “artificially” lowering FP64 performance on consumer GPUs. Instead, the architectural split became structural; the hardware itself is fundamentally different across product tiers. While datacenter GPUs have consistently kept a 1:2 or 1:3 FP64:FP32 performance (until the recent AI boom, more on that later), the performance ratio on consumer GPUs has consistently gotten worse. From 1:8 on the Fermi architecture in 2010 to 1:24 on Kepler in 2012 to 1:32 in 2014 to our final 1:64 ratio on Ampere in 2020.

This effectively also means that over 15 years, from the GTX 480 in 2010 to the RTX 5090 in 2025 the FP64 performance on consumer GPUs only increased 9.65x from 0.17 TFLOPS to 1.64 TFLOPS, while in the same time range the FP32 performance improved a whopping 77.63x from 1.35 TFLOP to 104.8 TFLOP.

FP32 versus FP64 throughput scaling on Nvidia consumer GPUs over time

FP32 vs FP64 throughput scaling across Nvidia GPU generations.2

Nvidia's Move to Segment the Market

So why has FP64 performance on consumer GPUs progressively gotten weaker (in relation to FP32) while it stayed consistently strong on enterprise hardware?

If this were purely a technical or cost constraint, you would expect the gap to be smaller. But since historically, Nvidia has taken deliberate steps to limit double-precision (FP64) throughput on GeForce cards, it makes it hard to argue this is accidental. The much simpler explanation is market segmentation.

Most consumer workloads, such as gaming, 3d rendering, or video editing do not need FP64. High-performance computing on the other hand has long relied on double precision (FP64). Fields such as computational fluid dynamics, climate modeling, quantitative finance, and computational chemistry depend on numerical stability and precision that single precision (FP32) cannot always provide. So FP64 becomes a very convenient lever: weaken it on consumer GPUs, preserve it on enterprise versions, and you get a clean dividing line between markets. Nvidia has been fairly open about this. In the consumer Ampere GA102 whitepaper, they note "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly.".3

And the segmentation worked. Over time, the price gap between consumer GPUs and datacenter GPUs widened from roughly 5x around 2010 to over 20x by 2022. Enterprise cards commanded massive premiums, justified in part by their strong FP64 performance (among other features like ECC memory, NVLink, support contracts, and so on). From a business standpoint, the elegance is obvious: closely related silicon sold into two markets at vastly different margins, with FP64 throughput serving as a clear dividing line.

Modern AI training largely does not depend on FP64 though. FP32 works fine, and on the contrary lower precisions (FP16, BF16, FP8, even FP4) are often preferred. Suddenly, consumer GPUs looked surprisingly capable for serious compute workloads. Researchers, startups, and hobbyists could train meaningful models without the purchase of an expensive Tesla or A100. In response, Nvidia updated its GeForce End User License Agreement (EULA) in 2017 to prohibit use of consumer GPUs in datacenters, in a divisive move. In what was (to my knowledge) an unprecedented shift, implicit technical segmentation was replaced by explicit contractual restrictions.5

Enterprise vs consumer GPU price ratio over time

Enterprise vs consumer GPU price ratio (2010-2022). Official MSPR numbers for consumer GPU, best effort for enterprise GPUs.2

How FP64 Emulation and AI Is Changing the Game

What if you have an old RTX 4090 lying around at home and, for some reason, you need the precision of FP64 but the built-in FP64 capabilities are not sufficient? Aside from the obvious answer of purchasing enterprise GPU power, FP64 emulation using FP32 floats can be an answer. This concept dates back to 1971, when T. J. Dekker described double-float arithmetic.6

The simple idea is to split a 64-bit floating point number into two 32-bit floating point numbers: A = a_hi + a_lo. The a_hi term carries the most significant bits, while a_lo captures the rounding error. Andrew Thall proposed a bunch of common algorithms for emulated FP64s (summation, multiplication, etc.) back in 2007 when GPUs did not have FP64 capabilities.7 You lose 5 bits of precision as your effective mantissa is only 48 bits (twice the FP32 effective mantissa) and not the FP64 53 bits of precision. If a modest reduction in numerical precision is acceptable, you may be able to achieve substantially higher throughput by using emulated double-precision computation. This can be advantageous given the steep FP64-to-FP32 performance disparity, even after accounting for the overhead introduced by emulation.

Diagram of emulated double using high and low parts

Emulated double representation using high and low FP32 parts.

A newer scheme that preserves full 64-bit precision but only works for matrix multiplication is the Ozaki scheme.8 This scheme exploits the speedup of tensor cores (specialized hardware for matrix multiply-accumulate (MMA) operations) and the distributive property of matrix multiplication.9 The Ozaki scheme splits FP64 numbers into, for example, FP8 numbers:

A = A1 + A2 + A3 + ... + Ak

where A1 contains the most significant bits and A2 contains the next slice of bits and so on. We then calculate:

Ai Bi

for each Ai and Bi. All the results are summed back up in 64-bit precision:

AB = Σ Ai Bi

The Ozaki scheme is gaining increasing traction thanks to the abundance of extremely fast FP8 and FP4 tensor cores being deployed for AI workloads. NVIDIA added support for the Ozaki scheme in cuBLAS in October 2025 and plans to continue developing it.10

From a GPU manufacturer's perspective, this direction is logical. The majority of enterprise GPU revenue now comes from AI applications; market segmentation based on FP64 performance makes no more sense. Enhancing FP64 emulation through low-precision tensor cores allows a reduction in the relative allocation of dedicated FP64 units in enterprise GPUs while expanding FP8 and FP4 compute resources that directly benefit AI workloads.

The latest generation of NVIDIA enterprise GPUs, the B300 based on the Blackwell Ultra architecture, represents a decisive shift toward low precision. FP64 performance has been significantly reduced in favor of more NVFP4 tensor cores, with the FP64:FP32 ratio dropping from 1:2 to 1:64.11 In absolute terms, peak FP64 performance declines from 37 TFLOPS on the B200 to 1.2 TFLOPS on the B300. Paradoxically, instead of consumer hardware catching up to enterprise-class capabilities, enterprise hardware is now embracing constraints traditionally associated with consumer GPUs.

Does this signal a gradual replacement of physical FP64 units through emulation? Not necessarily. According to NVIDIA, the company is not abandoning 64-bit computing and plans future improvements to FP64 capabilities.11 Nonetheless, FP64 emulation is here to stay, exploiting the abundance of low-precision tensor cores to supplement hardware FP64 for HPC workloads.

But the segmentation logic hasn't disappeared; it may simply be migrating. The RTX 5090 delivers a 1:1 FP16:FP32 ratio, while the B200 sits at 16:1. For fifteen years, FP64 was the dividing line between consumer and enterprise silicon. The next divide may already be taking shape in low-precision floating point.

AnandTech: GTX 480/470 FP64 ratio discussion (archived). https://web.archive.org/web/20100402215300/http://www.anandtech.com/show/2977/nvidia-s-geforce-gtx-480-and-gtx-470-6-months-late-was-it-worth-the-wait-/6 ↩
Google Sheets: numbers used for the computations. https://docs.google.com/spreadsheets/d/1NHHlgVytLx43DGzP8HlPeOCs7HKPElgpSO__9j6oMFo/edit?usp=sharing ↩ ↩
NVIDIA Ampere GA102 GPU Architecture Whitepaper (PDF). https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf ↩
Alibaba Product Insights: A100 vs RTX 3090—Is the A100 Really Worth the Hype and Extra for Deep Learning? https://www.alibaba.com/product-insights/a100-vs-rtx-3090-is-the-a100-really-worth-the-hype-and-extra-for-deep-learning.html ↩
Wccftech on 2017 GeForce EULA datacenter restriction. https://wccftech.com/nvidia-geforce-eula-prohibits-datacenter-blockchain-allowed/ ↩
T. J. Dekker (1971), double-float arithmetic. https://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf ↩
Andrew Thall (2007), Extended-Precision Floating-Point Numbers for GPU Computation. https://andrewthall.org/papers/df64_qf128.pdf ↩
Ozaki et al. (2011), Error-Free Transformations for Matrix Multiplication. https://link.springer.com/article/10.1007/s11075-011-9478-1 ↩
NVIDIA blog: Tensor Cores for Science (ISC 2025). https://developer.nvidia.com/blog/nvidia-top500-supercomputers-isc-2025/#tensor_cores_for_science%C2%A0 ↩
NVIDIA blog: Unlocking Tensor Core Performance with Floating-Point Emulation in cuBLAS. https://developer.nvidia.com/blog/unlocking-tensor-core-performance-with-floating-point-emulation-in-cublas ↩
HPCwire: NVIDIA says it's not abandoning 64-bit computing (Dec 9, 2025). https://www.hpcwire.com/2025/12/09/nvidia-says-its-not-abandoning-64-bit-computing/ ↩

Hacker Times

Hacker Times

15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

Discussion

Discussion

The Evolution of FP64 on Nvidia GPUs

Nvidia's Move to Segment the Market

How FP64 Emulation and AI Is Changing the Game