I also enjoyed the 2002 article by Jonathan Blow [1] that's linked at the bottom. The visualization from the first article helped a lot once this started to go more in-depth.
[1] https://web.archive.org/web/20240706043551/https://number-no...
The HTML/CSS is bad that lets it completely overflow the right edge of the page instead of wrapping.
I re-read this post three times in total confusion before I figured out the most important piece was off-screen entirely.
RGB values represent luminances against some adapted state, and a "zero" in a daylit scene is not "zero luminance" - it's just about 0.001x as bright as the brightest point - it's millions of photons, way more than zero. In a sense our eyes experience contrast on a sliding scale, and there is no absolute zero in the system. For example, broadcast systems historically used 16-235 as their luminance range for SDR. I think any argument that says "we must have zero" is going to have a bias, but I don't think zero is needed for most things.
You can see this confusion again in the histogram example. There are only 255 bins, not 256. If you fix that mistake and remove the 0.5 offset, then the histogram is distributed correctly at both ends.
- i = min(floor(f * 256), 255) (from float to uint8)
- f = i / 255 (from uint8 to float)
Basically a mix of the 2 approaches mentioned in the article.
For all integers between [0,255], if I do uint8 -> float -> uint8 conversion, I will get the same result.
--
edit: I wondered what's the maximum jitter amount that I can introduce to the float and get the same uint8 value. And also these 0->0.0 and 255->1.0 should map properly.
With my approach at the top, the jitter margin that I can introduce is 1/65280.
But with the article's approach
- i = floor(f * 255 + 0.5)
- f = i / 255
maximum jitter margin is 1/510 (which is better).
Why not??? Fight me
> Finally, one should never mix the encode and decode steps of the two quantizers. That’s just broken code. It’s an easy mistake to make, though.
floor( nextafter( 256, 255 ) * value )excuse to argue about the best way aside, if this is the goal you should not be rolling your own image file reading. you should use openimageio. idk what approach it takes in its internal conversion to float, but that library is more likely to have the right answer than you trying to roll it yourself given its the library used internally by tons of professional image manipulation software...
Also, a lot of workflows for image processing and compositing do assume that 0 means zero, whether correctly or not (often incorrectly). So there are often assumptions that for 8-bit, 0u maps to 0.0f and 255 maps to 1.0f for things like masking or alpha: as soon as you have 0 values which become just over 0.0, you then have artifacts because some code somewhere is using a hard threshold of 0.0 to mask some other operation, and vice-versa for 1.0 with alpha, where suddenly because the 255 values are no longer 1.0f, you have very slightly see-through objects (often only visible in certain situations or when pixel-peeping) after pre-multiplication.
(Same thing can happen when 254 becomes 1.0f after +0.5 with masking).
For 8-bit, 16 maps to 7.5IRE which is the well understood legal black. Mapping 235 means they mapped peak to 110IRE. This is based on a 0-120IRE scale. This gets weird as the broadcast limit for video was 100IRE allowing for the chroma to reach 110IRE. So if you're trying to limit your white values to 235, that'll be higher than is broadcast safe. Of course, nobody cares about NTSC broadcast limits any more. However, to this day, I still see out of spec tapes marked as "broadcast master" that have been ingested for streaming use. It drives me crazy to this day, and it's only getting worse as people don't even have scopes to adjust the VTR's TBC properly.
There's a whole visual center to check the amount of incoming light and adjust your pupils for you. It's intentionally reactive.
> and there is no absolute zero in the system.
There maybe is. I think we call that "blind."
> broadcast systems historically used 16-235 as their luminance range for SDR
Mostly because it was a fully analog system and these all translate down to signal voltage. Jokingly NTSC used to be referred to as "Never Twice the Same Color" due to being a compromise bolted onto the side of an already compromised system.
Generally no -- in an 8-bit NTSC-M Rec. 601 system, 16 maps to E'Y = 0 at 7.5 IRE, and 235 maps to E'Y = 1 at 100 IRE. See https://www.poynton.ca/pdf/Poynton-1996-TechIntrDigiVide.pdf
The "16" digital black level is independent of the "7.5 IRE" analog setup. E.g. in Japan with an 8-bit "NTSC-J" Rec. 601 system, my understanding is that 16 still maps to E'Y = 0 which is now at 0 IRE, and 235 is still E'Y = 1 at 100 IRE.
> There maybe is. I think we call that "blind."
If you go looking into that, you'll see that the reality is far far more complex [0]
"The number of people with no light perception is unknown, but it is estimated to be less than 10 percent of totally blind individuals."
[0] https://chicagolighthouse.org/sandys-view/what-blind-people-...
But IIRC the MPEG-2 standard had luma==235 -> 100IRE for all of the analog formats (pal/ntsc-j/ntsc/secam) so I'm not sure why you say that would violate the broadcast limits?
You haven't grasped the fact that the choice isn't obvious, and has subtle trade-offs.
If you don't believe the author, check the other posts he references.
>There are only 255 bins, not 256
There are 256 bins because there are 256 values.
The questions are:
1. What are the boundaries of these bins?
2. Which sample represents a particular bin?
With 1-bit color, we have sample values {0, 1}. What bins do they represent?
Here's one choice:
[0, 1), [1, 2)
Two equally sized bins, spanning the interval [0, 2] of length 2, each defined by its sample at lower bound.Alternatively, we could consider these bins:
[-0.5, 0.5), [0.5, 1.5)
These are also equally sized bins, spanning the interval [-0.5, 1.5] of length 2, defined by samples at the center.We could also define bins like this:
[0, 0.5), [0.5, 1]
Two equally sized bins spanning the interval [0, 1] of length 1, where we sample the first bin at the lower bound, and the last bin at the upper bound.This, in a nutshell is what the author is trying to explain.
Let's look at this again, with 2 bits.
With 2-bit color, we have sample values {0, 1, 2, 3}.
Which bins do they come from?
The three options above yield:
[0, 1), [1, 2), [2, 3), [3, 4)
[-.5, 0.5), [0.5, 1.5), [1.5, 2.5), [2.5, 3.5)
[0, 0.5), [0.5, 1.5), [1.5, 2.5), [2.5, 3]
The first two span an interval of length 4, the third spans an interval of length 3.In the third case, the tail bins are short (have size ½), and the rest have size 1.
The last bin must be a closed interval in the third case, so that it includes the value we picked to represent it.
None of these choices is inherently invalid or better than the others; and none stems from "confusing bins with edges".
The third option does have the distinction that the first and last bins are smaller than the rest. But it's not necessarily a drawback. Especially when we're talking about color, hardware interpretation, and human perception.
When you remap these bins into the [0, 1] interval, you're "dividing by 4" in the first two cases, and by 3 in the third case.
The maps are:
x → x/4
x → (x + ½)/4
x → x/3
The inverse maps (that yield a sample in {0, 1, 2, 3} given a floating point value in interval [0, 1]) are: x → trunc(4x)
x → round(4x - ½) = trunc(4x)
x → trunc(3x + ½)
In the first two options, the domain is [0, 1). It might be necessary to apply clipping because the exact value 1.0 falls outside the range of the forward transform.The 2nd option is the most symmetric, of course, but the 3rd one is the most straightforward (and cheapest) to implement, so that's the default.
The choice amounts to making the highest and lowest bins slightly smaller to make the rest sightly larger.
That's to say, if you generate uniform noise between 0 and 1, you'll get the following samples from your function with equal probability:
0 or 3
1
2
As the author points out, this hardly matters when you are talking about having 256 bins.That, and with color specifically, the "good" histograms aren't uniform anyway (and any photographer wants to avoid getting much at either extreme).
TL;DR: The author is not confusing anything — but their diagram and explanation are, indeed, a bit confusing.
However OIIO is far from perfect in all situations (having had to debug and fix issues with its mip-map generation filtering code in the past), so don't always assume that just because there's a mature open source library out there doing something that it's always perfect.
1.0 lies on the right side of the bin 7. But 0.0 lies on the left of bin 0.
The standard approach assumes that we have centered samples: that zero is dead black, plus (and minus!) some uncertainty and so is bin 7.
If the sampling of the intensity is distortion-free (no clipping took place due to overexposure) then bin 7 represents a range of possible values centered around 1.0.
It is not a half-sized interval.
> This means that when converting floating-point values in the [0,1] range back to integers, the extreme bins have effectively half the width of other bins.
Under any interpretation whatsoever of the image samples, there is latitude for interpreting the maximum value 255 as being distortion: clipping from an arbitrarily higher value. Shifting things by 0.5 doesn't fix this issue of not knowing whether 255 means that an intensity close to 1.0 is being represented (no distortion), or an outlier intensity of 37.49 (severely clamped). That could go the other way too.
In other words, there is a possible bias in the extreme bin. The signal could be limited such that the bin's full sampling range is not in effect, or the signal could be overwhelming, so that values far outside of the range are clipped and included.
The only way around this is to make the highest value a canary which represents "clipped value". That is to say, 255 means "clipped datum", so that only 254 and below is sampling of unclipped signal. Machine-generated image (e.g. 3D rendering) then avoid the 255 value, and camera sensors are calibrated so that it doesn't occur when technical images are being shot.
ive just seen a lot of "ai researchers" who are getting into professional image processing and are both beginners and want things quickly and so could do much worse than just starting from what they get out of oiio. especially for a lot of the non-obvious stuff (more of that in color handling than just the io stuff though)
First, figure out what colorspace the processing needs to happen in. Usually this is linear RGB.
Then, figure out what OETF and EOTF your input/output format use. This will be something like PQ or HLG. This will exactly specify the meaning of each integer value.
This fixes the choice of representation and conversion.
Case against 256: no 0 or 1 values :(
Considering how important having a 0 and 1 value is for arithmetic in general, I think 255 is better.
For real usage, today's CPUs are limited by memory bandwidth.
This is the same kind confusion that happens with sampling positions in modern APIs, where the location is specified in coordinates and not in pixel centers.
It becomes a pain in the ass when you're generating a VGA signal with a microcontroller with 8 color output pins (3 red, 3 green, 2 blue). The meaning of a color value is very real in this setup: it corresponds to a voltage level you must send to the VGA monitor, 0V-0.7V.
So the blue channel will map (0->0V, 1->0.23V, 2->0.47V, 3->0.7V), and the red/green will map (0->0V, 1->0.1V, ..., 7->0.7V). Notice how none of the blue voltages match any of the red/green ones (other than the extremes)? That means you don't get to see any pure grays -- the closest ones will have bit of blue or yellow tint, depending on the direction of the difference.
Not only that, any gradients at all (other than the ones not mixing blue with the other channels) will be noticeable off: for example, the closest colors in the line between pure red to pure white will all be slightly orange or purple.
Code for VGA output in 8-bit color with double-buffered 320x240 framebuffer for the Raspberry Pi Pico 2 here, if anyone cares: https://github.com/moefh/pico-vga-8bit-demo
Why not scale to fill the available bins, though? i.e. trunc(result * 255.999)?
// color4_t result = {
// .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
// .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
// .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
// .a = src.a + (dst.a * inv_alpha) * INV_255
// };
// 1/256 but much faster
color4_t result = {
.r = (src.r * src.a + dst.r * inv_alpha) >> 8,
.g = (src.g * src.a + dst.g * inv_alpha) >> 8,
.b = (src.b * src.a + dst.b * inv_alpha) >> 8,
.a = src.a + ((dst.a * inv_alpha) >> 8)
};The issue isn't in having a representation for 0 photons, but about maximizing information stored in a byte. Ideally you shouldn't be underutilizing the byte value 0, nor add bias to data that should have been assigned to the 0th bucket, regardless of what it represents (you could have a color space that goes from bright to super bright, and still want to ensure that every byte represents equal chunk of your brightness range).
https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...
https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...
In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.
2^2.2 = 4.595, 255^2.2 = 196,964.699
Changing at 30Hz I doubt a human can tell the difference between slightly blue and slightly yellow.
OMG I remember as a kid staring at static-y CRT displays, and seeing these faint blue and yellow lines at the borders of them. I’d always wondered why they appeared and why they were specifically blue and yellow. I finally know! (at least, assuming those specific artifacts are due to the same thing)
The reason is that year 0 never existed. The year 1 BCE was followed by the year 1 CE.
Culturally, anthropologically, and psychologically it might be a different matter. But 2000 years had not passed before the end of that year.
I assume this is why RGBI color was so common in the 80s.
That’s half of the mid-riser staircase quantizer discussed in the article. (The other half is coming up with the reverse.)
(I would implement it as min(floor(x * 256), 255).)
Also, you should use SIMD.
Now I am imagining a weird alternate history where we treat audio like we treat color. OK take three bytes which encode how loud the sound is, one for lows, one for mids and one for highs where lows mids and high frequencies are picked to match human ear response.
(This may be more apparent when you frame gamma as being applied in the 0-1 range, so it doesn’t really turn 2 into 4.595 and 255 into ~200k; it turns (2/255)≈0.00784 into (2/255)^2.2 ≈ 0.0000233, and leaves (255/255)=1 as is.)
Possibly my proposal doesn’t hold up to repeated transforms and operations. It might skew toward 255 in real operations.
Let’s say you’re writing an image processing program. The program takes in an image, converts it to floating point, does some processing and finally saves the modified pixels to disk as 8-bit colors. The question today concerns how exactly the integer-to-float conversion should be done. There are two approaches which, written in Python and NumPy, look like this:
| Standard division by 255 | Alternative division by 256 |
|---|---|
pixels = img / 255.0
result = process(pixels)
output = np.trunc(result * 255 + 0.5)
|
pixels = (img + 0.5) / 256.0
result = process(pixels)
output = np.trunc(result * 256)
|
I assume that in both cases the output values are clamped before the final typecast:
# Clamp and cast to 8 bits
output_8bit = output.clip(0, 255).astype(np.uint8)
The standard approach maps the integer 0 to 0.0 and 255 to 1.0. It works perfectly fine and is how GPUs do it. The alternative adds a 0.5 bias and divides by 256 instead, so the integer 0 gets mapped to 0.5/256=0.001953125. This is inconvenient because your image processing code can’t detect black pixels, for example, without knowing the above constant. As a consequence, you tie your logic to 8-bit inputs even if you compute in floating point. With the standard approach, you can always assume black is 0.0.
But some programmers still feel a pull towards the alternative. What is going on? What do they see in it?
The standard approach does look quite strange when plotted on the number line. Below you can see an exaggerated version with 3-bit integers in the range [0..7] being mapped to [0,1]:

On the X-axis we’ve got a number line and the locations of brown circles on it represent the decoded floating-point values. The numbers inside are the integer inputs. Each integer has arrows pointing to it; these show a range of floating-point values that round to it. I’ll call these ranges “bins” in the rest of this article.
The first issue really apparent in the diagram is how the standard formula’s extreme bins jut beyond the [0,1] range. Perhaps this visualization is unfair – both approaches clamp their output so the extreme bins could extend infinitely – but it clearly shows how “stretched” the standard range is. The stretched range is wider than the assumed operating range [0, 1] in image processing.
This means that when converting floating-point values in the [0, 1] range back to integers, the extreme bins have effectively half the width of other bins. As a consequence, it will be “harder” to output extreme values from your algorithm. For example, if you generate uniform [0,1] noise and round it using the standard formula, the values 0 and 255 will occur only half as frequently as other integers.
We can verify this claim empirically by generating a million uniform random numbers, plotting them as a histogram, and observing that both the 0 and 255 bins are indeed only half as tall as other bins:

The highlighted crop:

Histogram code
import numpy as np
import matplotlib.pyplot as plt
result = np.random.uniform(0, 1, 1000000)
final_values = np.trunc(result * 255 + 0.5).clip(0, 255).astype(np.uint8)
plt.hist(final_values, bins=256, range=(0, 255))
plt.show()
Still, I’m having a hard time coming up with an example situation where the bias away from the extremes would prove problematic. Sure, the standard approach’s floats are spread over a wider range, but the original image will still round-trip convert losslessly (uint8 → float → uint8).
Also, any result value just beyond 0.0 or 1.0 will still round to the right bin, evening out the output distribution. An example of what I mean. Assume your processing subtracts 0.005 from the floating-point colors. In the standard approach this pushes blacks below zero – outside the [0,1] range – but in the alternative the values stay positive. In the end both output the integer 0 anyway:
Standard:
trunc(255 * (-0.005) + 0.5) = 0
Alternative:
trunc(256 * (0.5 / 256 - 0.005)) = 0
It didn’t matter that in the standard approach the zero bin was only “half the size”.
The second issue is that the standard approach’s floating-point values aren’t exact. For example 128/255.0 \approx 0.501961 but 128/256.0 = 0.5. Due to this round-off error, the distances between floating-point values vary a tiny bit. But this isn’t a real problem since the error is truly tiny. A 32-bit floating-point number has a 23-bit fraction (“significand”). We are talking about round-off error in its least-significant bit; jitter with the magnitude less than 2^{-23}. Surely a relative error of 0.00001 % is immaterial even in the most sophisticated image processing task. In this case, inexactness is an aesthetic question, not a technical one.
The alternative approach always places each floating-point value exactly in the middle of two integers. See how the vertical bars align in the number line diagram above. The halfway position can be thought of as a compromise; we don’t know what the original quantized value was exactly, and thus the average point between two successive integers is a good guess.
I’m sure there are applications where this property is useful, even though I’m having a hard time coming up with examples myself. Well, at least dithering is more convenient, argues a 2015 blog post “Converting Color Depth” by Andrew Kesler (known for his business card raytracer). The reasoning goes that noise can be added without worrying about edge cases. In contrast, the standard formula’s awkward extremes require careful handling to keep the noise distribution consistent.
So far the standard “divide by 255” formula still looks solid, or at least firm enough to still be worth it. Another way to think about the question is to zoom out a bit and see the two approaches as two different uniform scalar quantizers. If we check the Wikipedia page on quantization, we’ll quickly learn that there are two main types of quantizers:
Most uniform quantizers for signed input data can be classified as being of one of two types: mid-riser and mid-tread. The terminology is based on what happens in the region around the value 0, and uses the analogy of viewing the input-output function of the quantizer as a stairway. Mid-tread quantizers have a zero-valued reconstruction level (corresponding to a tread of a stairway), while mid-riser quantizers have a zero-valued classification threshold (corresponding to a riser of a stairway).
As a source Wikipedia cites a 1977 paper that has such an incredible combined title and abstract layout that I must reproduce it here:

Anyway. When plotted on a graph, the mid-riser and mid-tread quantizers differ where they cross zero:

Mid-tread indeed maps zero to zero and mid-riser maps zero to the middle of two integers (sound familiar?). The notation chosen by Wikipedia represents an input real number with x, its encoded (“classified”) integer value with k, and reconstructed real number with y_k. The corresponding quantizer formulas look like this:
| Type | Classify (encode) | Reconstruct (decode) |
|---|---|---|
| Mid-tread staircase quantizer | k = \text{trunc}(x L + 0.5) | y_k=k/L |
| Mid-riser staircase quantizer | k = \text{trunc}(x L) | y_k=(k+0.5)/L |
L stands for the number of distinct output levels (for example 256).
If we apply these definitions to our two competing approaches, we can call the standard formula a “mid-riser” with L=255 and the alternative a “mid-tread” with L=256. Actually, I’ll show their code again with the new labels to make the connection to the new formulas above clear. The code snippets themselves are the same as in the beginning.
| Mid-tread quantizer (L=255) | Mid-riser quantizer (L=256) |
|---|---|
pixels = img / 255.0
result = process(pixels)
output = np.trunc(result * 255 + 0.5)
|
pixels = (img + 0.5) / 256.0
result = process(pixels)
output = np.trunc(result * 256)
|
From this perspective we can say the standard approach is a strange combination of a mid-riser quantizer for unsigned inputs (the quote said “for signed input data”) and a choice of L=255 integer codes. Clearly this is not optimal for 8-bit inputs. Again, this is all for the programming convenience of having the extremes map to 0.0 and 1.0. This leads to the final criticism of the standard formula.
If we were designing a system that receives a uniformly distributed real number x \in [0,1], encodes it as an 8-bit integer k, and finally reconstructs it as another real number y_k, the standard formula would waste bandwidth. Remember how the 0 and 255 bins poked slightly beyond the [0,1] range’s edges? In the standard approach, the range of representable values is actually [-0.5/255, 255.5/255], meaning the bins are spaced further apart than strictly needed for [0, 1] inputs, leading to a higher reconstruction error. The increase in error is small, however. According to StackOverflow user Peter Mudrievskij’s calculation, the mean absolute errors are 1/1020 and 1/1024 for 255 and 256 divisors, respectively. Thus division by 256 is theoretically more precise.
The subtle part is that this kind of reconstruction is not what we’re doing. The premise was that we are loading 8-bit RGB images, doing processing on them, and saving them again. We have no control over how they were quantized when saved; all information lost is gone forever. In other words, if an image’s color were multiplied by 255 and rounded, dividing them by 256 at load time does not bring back any precision. Only when we control both saving and loading does an appeal to lower reconstruction error make sense.
In fact, using the alternative formula to load other people’s images will introduce more error. Most likely the images were quantized via the standard formula, so decoding them with the wrong scale factor is incorrect, in theory. In practice, the colors aren’t absolute measurements (even if the sRGB spec claims so), and all that happens is that we’ll do our processing in a slightly smaller range with a small offset. End of the subtle part.
Finally, one should never mix the encode and decode steps of the two quantizers. That’s just broken code. It’s an easy mistake to make, though.
To answer the question posed in the title: if you’re processing images given to you by strangers, you should normalize RGB values by 255. Neither inexact floating-point values nor some abstract feeling of a higher reconstruction error is a good reason to go for the alternative. But if you control both image saving and loading, don’t need zero to map to zero, and feel OK about tying your processing code to the 8-bit dynamic range, then you can consider division by 256 to eke out a bit more precision. Just don’t blame me when your colleagues load your images with the standard formula anyway, ruining your master plan.
Jonathan Blow’s 2002 article talks about mid-riser and mid-tread quantizers without mentioning them by name. I got the diagram idea from there.
The already mentioned 2015 blog post by Andrew Kesler advocates for the alternate formula. Unfortunately the comparison is to the standard formula but without rounding, which invalidates most of the analysis.
I’m writing a book on color reduction algorithms. Sign up here if you’re interested.
If your conversion from high precision -> 8-bit is just multiplication by 256 and then truncation, then you’ve got the mid-riser quantizer. The +0.5 comes from interpreting a value of 0 as bucket from 0-1, just like the value of 255 is the bucket from 255-256. It’s introduced in the conversion back from 8-bit to high precision.
But again, the likely reason no one does this is because it introduces a bias in the other direction, toward 255.