This matches my experience whenever I do an unconventional or deep work like the article mentions. The engineers comfortable with this type of work will multiply their worth.
One thing that I am glad to have been taught early on in my career when it comes to debugging, especially anything involving HW, is to `make no assumptions'. Bugs can be anywhere and everywhere.
One thing I noticed: The last footnote is missing.
I don't believe there's anybody who can reason about them at code skimming speeds. It's probably the best place to hide underhanded code.
There is absolutely no "sign extension" in the C standard (go ahead, search it). "Sign extension" is a feature of some assembly instructions on some architectures, but C has nothing to do with it.
Citing integer promotion from the standard is justified, but it's just one part (perhaps even the smaller part) of the picture. The crucial bit is not quoted in the article: the specification of "Bitwise shift operators". Namely
> The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. [...]
> The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1×2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1×2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.
What happens here is that "base2" (of type uint8_t, which is "unsigned char" in this environment) gets promoted to "int", and then left-shifted by 24 bits. You get undefined behavior because, while "base2" (after promotion) has a signed type ("int") and nonnegative value, E1×2^E2 (i.e., base2 × 2^24) is NOT representable in the result type ("int").
What happens during the conversion to "uint64_t" afterwards is irrelevant; even the particulars of the sign bit of "int", and how you end up with a negative "int" from the shift, are irrelevant; you got your UB right inside the invalid left-shift. How said UB happens to materialize on this particular C implementation may perhaps be explained in terms of sign extension of the underlying ISA -- but do that separately; be absolutely clear about what is what.
The article fails to mention the root cause (violating the rules for the bitwise left-shift operator) and fails to name the key consequence (undefined behavior); instead, it leads with not-a-thing ("sign-extension bug in C"). I'm displeased.
BTW this bug (invalid left shift of a signed integer) is common, sadly.
C seems to be one of those languages where people think they know it based on prior and adjacent experience. But it is not a language which can be learned based on experience alone. The language is full of cases where things will go badly wrong in a way which is neither obvious nor immediately evident. The negative side effects of what you did often only become evident long after you "learn" it as something you "can" do.
If you want to write C for anything where any security, safety, or reliability requirement needs to be met, you should commit to this strategy: Do not write any code which you are not absolutely certain you could justify the behaviour of by referencing the standard or (in the case of reliance on a specific definition of implementation defined, unspecified, or even (e.g. -ftrapv) undefined behaviour) the implementation documentation.
If you cannot commit to such a (rightfully mentally arduous) policy, you have no business writing C.
The same can actually be applied to C++ and Bash.
> Since virtualization is hardware assisted these days
I was running Xen with full-hardware virtualization on consumer hardware in... 2006. I mean: some of us here were running hardware virt before some of the commenters were born. Just to put the "these days" into perspective in case some would be thinking it's a new thing.
It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.
Saying that the behavior is sometimes undefined is not acceptable. Any implicit conversion of this kind must be an error. Whenever a conversion between signed and unsigned or unsigned and signed is desired, it must be explicit.
This may be the worst mistake that has ever been made in the design of the C language and it has not been corrected even after 50 years.
Making this an error would indeed produce a deluge of error messages in many carelessly written legacy programs, but the program conversion is trivial and it is extremely likely that many of these cases where the compilers do not signal errors can cause bugs in certain corner cases, like in the parent article.
In my opinion, unwritten rules are for gatekeeping. And if a new person follows all the unwritten rules, magically there's no one willing to review.
I think this is how large BFDL-style open source projects slowly become less and less relevant over the next few decades.
Fir these projects everything "tribal" has to be explicitly codified.
On a more general note: this is likely going to have a rather big impact on software in general - the "engineer to company can not afford to loose" is likely loosing their moat entirely.
However, this kind of implicit conversions must really be forbidden in the standard, because the correct program source is different from the one permitted by the standard.
When you activate most compiler options that detect undefined behaviors, the correct program source remains the same, even if the compiler now implements a better behavior for the translated program than the minimal behavior specified by the standard.
That happens because most undefined behaviors are detected at run time. On the other hand, incorrect implicit conversions are a property of the source code, which is always detected during compilation, so such programs must be rejected.
But it is easy enough to use modern tooling and coding styles to deal with signed overflow. Nowadays, silent unsigned wrap around causing logic errors is the more vexing issue, which indicates the undefined behavior actually helps rather than hurts when used with good tooling.
The hardware of modern CPUs actually implements 5 distinct data types that must be declared as "unsigned" in C: non-negative integers, integer residues a.k.a. modular integers, bit strings, binary polynomials and binary polynomial residues.
A modern programming language should better have these 5 distinct types, but it must have at least distinct types for non-negative integers and for integer residues. There are several programming languages that provide at least this distinction. The other data types would be more difficult to support in a high-level language, as they use certain machine instructions that compilers typically do not know how to use.
The change in the C standard that was made so that now "unsigned" means integer residue, has left the language without any means to specify a data type for non-negative integers, which is extremely wrong, because there are more programs that use "unsigned" for non-negative integers than programs that use "unsigned" for integer residues.
The hardware of most CPUs implements very well non-negative integers so non-negative integer overflow is easily detected, but the current standard makes impossible to use the hardware.
I agree though that using "unsigned" for non-negative integers is problematic and that there should be a way to specify non-negative integers. I would be fine with an attribute.
The problem is also that the standard committee is not the ruling body of the C language. It is the place where people come together to negotiate some minimal requirements. If you want something, you need to first convince the compilers vendors to implement it as an extension.
Mar 19, 2026 Updated on Mar 22, 2026
How a sign-extension bug in C made me pull my hair out for days but became my first patch to the Linux kernel!
A while ago, I started dipping my toe into virtualization. It's a topic that many people have heard of or are using on a daily basis but a few know and think about how it works under the hood.
I like to learn by reinventing the wheel, and naturally, to learn virtualization I started by trying to build a Type-2 hypervisor. This approach is similar to how KVM (Linux) or bhyve (FreeBSD) are built.
My experimental hypervisor (and VMM) is still a work-in-progress and is available on my Github: pooladkhay/evmm.
Since virtualization is hardware assisted these days 1, the hypervisor needs to communicate directly with the CPU by running certain privileged instructions; which means a Type-2 hypervisor is essentially a Kernel Module that exposes an API 2 to the user-space where a Virtual Machine Monitor (VMM) 3 like QEMU or Firecracker is running and orchestrating VMs by utilizing that API.
In this post, I want to describe exactly how I found that bug. But to make it a bit more educational, I'm going to set the stage first and talk about a few core concepts so you can see exactly where the bug emerges.
The x86 architecture in protected mode (32-bit mode) envisions a task switching mechanism that is facilitated by the hardware. The architecture defines a Task State Segment (TSS) which is a region in the memory that holds information about a task (General purpose registers, segment registers, etc.). The idea was that any given task or thread would have its own TSS, and when the switch happens, a specific register (Task Register or TR) would get updated to point to the new task 4.
This was abandoned in favor of software-defined task switching which gives more granular control and portability to the operating system kernel.
But the TSS was not entirely abandoned. In modern days (64-bit systems) the kernel uses a TSS-per-core approach where the main job of TSS is to hold a few stack pointers that are very critical for the kernel and CPU's normal operation. More specifically, it holds the kernel stack of the current thread which is used when the system wants to switch from user-space to the kernel-space.
It also holds a few known good stacks for critical events like Non-Maskable Interrupts (NMIs) and Double Faults. These are events that if not handled correctly, can cause a triple fault and crash a CPU core or cause an immediate system reboot.
We know that memory access is generally considered to be expensive and caching values somewhere on the CPU die is the preferred approach if possible. This is where the TR register comes into the picture. It has a visible part which is a 16-bit offset that we have already discussed as well as a hidden part that holds direct information about the TSS (Base address, Limit, and Access rights). This saves the CPU the trouble of indexing into the GDT to eventually find the TSS every time it's needed.
A hypervisor is essentially a task switcher where tasks are operating systems. In order for multiple operating systems to run on the same silicon chip, the hypervisor must swap the entire state of the CPU which includes updating the hidden part of the TR register as well.
In a previous blog post 1 I described how Intel implemented their virtualization extension (VT-x) and how each vCPU (vCore) is given its own VMCS (Virtual Machine Control Structure) block where its state is saved to or restored from by the hardware when switching between host and guest OSes.
I suggest reading that post if you're interested in the topic but VMCS consists of four main areas:
Host-state area has two fields which correspond to the visible part and one of the hidden parts (base address) the TR register:
HOST_TR_SELECTOR (16 bits)HOST_TR_BASE (natural width 5)While guest-state area has four (one visible plus all three hidden parts):
GUEST_TR_SELECTOR (16 bits)GUEST_TR_BASE (natural width 5)GUEST_TR_LIMIT (32 bits)GUEST_TR_ACCESS_RIGHTS (32 bits)The reason is that the hardware assumes the host OS to be a modern 64-bit operating system where TR limit and Access Rights are fixed known values (0x67 and 0x11 respectively). But the guest OS can be virtually any operating system with any constraints.
Naturally, it is the hypervisor's job to set these values on initial run and to update them when needed (e.g. when the kernel thread that is running a vCPU is migrated to another physical CPU core, the hypervisor must update the host state to match the new core).
To set these values, I "borrowed" some code from the linux kernel tree (KVM selftests):
vmwrite(HOST_TR_BASE,
get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));
This piece of code does the following:
HOST_TR_BASE section of the VMCS using the special VMWRITE instruction 6.So far, so good!
If for any reason this operation fails to extract and write the correct address, upon the next context switch from user-space to kernel-space (or next NMI or next Double fault), when the CPU hardware tries to read the kernel stack from the TSS to update the Stack Pointer register, it either receives garbage or an unmapped address. Either way, the CPU will eventually face a double fault (a fault that happens when trying to handle another fault like a page fault) and when trying to use one of the known good stacks for handling the double fault, it will fail again which will make it a triple fault and BOOM! The core dies or we get a sudden reboot.
Now lets talk about the issue that I was facing.
I started developing my hypervisor on a virtualized instance of Fedora, to avoid crashing my machine in case something went wrong. By the time I realized something is indeed wrong, I had already developed the ability to put the CPU in VMX operation, run a hardcoded loop in VMX non-root mode that would use the VMCALL instruction to trap into the hypervisor (VMX root) and ask it to print a message, then resume the loop (VMRESUME).
Additionally, VMCS was programmed to trap external interrupts (e.g. timer ticks). Upon an exit, the hypervisor would check if we (the current kernel thread) needs to be rescheduled, keeping the kernel scheduler happy.
I was using preempt notifier api which lets threads provide two custom functions (sched_in and sched_out) that are called by the scheduler when it's about to deschedule the thread as well as right before rescheduling it. These functions are then responsible for cleanups and initialization work that is required.
In my case, sched_out would unload the VMCS from the current core, and sched_in would load it on the new core 7 while reinitializing it using a series of VMWRITEs 6 to match the new core's state.
On my virtualized dev environment with only three vCPUs, everything was working just fine. Until I decided to give it a try on my main machine 8 where the hypervisor would talk to an actual physical CPU.
And BOOM!
Seconds after running the loop, the system crashed, in a very unpredictable way. I was logging the core switches and didn't find any meaningful correlation between the last core number and the crash. Additionally, sometimes it would last longer and sometimes it was immediate. After investigating kernel logs a few times, I saw a pattern in the sequence of events that caused the system to eventually hang:
So why no triple faults?!
The Kernel Oops killed the active task and halted operations on CPU 5. However, it left CPU 5 in a "zombie" state. Alive enough to keep the motherboard powered on, but with its interrupts disabled, making it entirely unresponsive to the rest of the system.
Soon I realized that the hypervisor works absolutely fine 9 when pinned to one core (e.g. via taskset command), so there must be something happening while moving between cores. Additionally, I didn't dare to question the code I stole from the Linux kernel source, and I was trying hard to find an issue in the code I wrote myself. This eventually led to rewriting a portion of the hypervisor code with an alternative method which would achieve the same goal.
For example, from reading Intel's Software Developer Manual (SDM) 10, I knew that when moving from core A to core B, core A must run the VMCLEAR instruction to unload the VMCS, and only then can core B load the VMCS using the VMPTRLD to be able to execute the guest code. For that, I was using smp_call_function_single which relies on IPIs to run a piece of code on another CPU, that I replaced with the preempt notifiers.
Eventually, (while pulling my hair out) I realized I have eliminated all possible parts of the hypervisor that played a role in moving between cores.
Then there was another clue!
While running the hypervisor on my virtual dev environment (QEMU + Fedora) I observed that by increasing the number of vCores, I can reproduce the issue and there is also a new behavior. Sometimes the VM reboots immediately (instead of freezing) and after the reboot, there is no trace of any logs related to the previous session. And I concluded that a triple fault has happened.
This turned my attention to the TR and TSS. I started looking for alternative ways of setting the HOST_TR_BASE and realized that the KVM itself (not KVM selftests) uses a different method:
/*
* Linux uses per-cpu TSS and GDT, so set these when switching
* processors. See 22.2.4.
*/
vmcs_writel(HOST_TR_BASE, (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
And that was it! Using this method to set HOST_TR_BASE fixed my hypervisor and helped me keep whatever sanity I had left.
Remember that piece of code I took from the kernel source. It used the get_desc64_base function to extract and write the address of TSS into the HOST_TR_BASE. This function has this definition:
static inline uint64_t get_desc64_base(const struct desc64 *desc)
{
return ((uint64_t)desc->base3 << 32) |
(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
}
TSS segment descriptor has four fields that must be stitched together to form the address of the TSS 11.
base0 is uint16_t.base1 is uint8_t.base2 is uint8_t.base3 is uint32_t.The C standard 12 dictates Integer Promotion. Whenever a type smaller than an int is used in an expression, the compiler automatically promotes it to a standard int (which is a 32-bit signed integer on modern x86-64 architectures) before performing the operation.
If an
intcan represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to anint; otherwise, it is converted to anunsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.— Section 6.3.1.1
This promotion has a consequence: if the resulting value after promotion has a 1 in its most significant bit (32nd bit), this value considered negative by the compiler and if casted to a larger type like a uint64_t in our case, sign extension happens.
Lets see an example:
We have an 8-bit unsigned integer (uint8_t) with 11001100 bit pattern. If we left-shift it by 24, it still can be represented by an int which is 32 bits long. So the compiler generates this value: 11001100000000000000000000000000 and considers it to be an int which is a signed type.
Now if we try to perform any operation on this value, it would follow the protocol for signed values. In our case, we are ORing it with a uint64_t. So the compiler would cast our int (a 32-bit signed value) into uint64_t (a 64-bit unsigned value), which is where the sign-extension happens which would turn our value to 11111111111111111111111111111111_11001100000000000000000000000000 before OR happens.
Saw the problem?
Because the upper 32 bits are sign-extended to all 1s (Hex: 0xFFFFFFFF), the bitwise OR operation completely destroys base3 (In a bitwise OR, 1 | X equals 1). Therefore, whatever data was in base3 is permanently overwritten by the 1s from the sign extension.
Here is an actual example with "real" addresses:
base0 = 0x5000
base1 = 0xd6
base2 = 0xf8
base3 = 0xfffffe7c
Expected return: 0xfffffe7cf8d65000
Actual return: 0xfffffffff8d65000
This also explains when the problem would happen: Only and only if base2 has a 1 as its most significant bit. Any other value would not corrupt the resulting address.

The fix is actually very simple. We must cast values to unsigned types before the bit-shift operation:
static inline uint64_t get_desc64_base(const struct desc64 *desc)
{
return (uint64_t)desc->base3 << 32 |
(uint64_t)desc->base2 << 24 |
(uint64_t)desc->base1 << 16 |
(uint64_t)desc->base0;
}
This will prevent the sign-extension from happening.
Finally, this is the patch I sent, which was approved and merged:
https://lore.kernel.org/kvm/20251222174207.107331-1-mj@pooladkhay.com/
I can't finish this post without talking about AI!
You may wonder whether I tried asking an LLM for help or not. Well, I did. In fact it was very helpful in some tasks like summarizing kernel logs [^13] and extracting the gist of them. But when it came to debugging based on all the clues that were available, it concluded that my code didn't have any bugs, and that the CPU hardware was faulty.
CASE CLOSED.
3
Hypervisor and Virtual Machine Monitor (VMM) are generally interchangeable terms, while some might differentiate them slightly (e.g. VMM as user-space part of a kernel-space hypervisor).
4
TR register does not directly point to TSS. It holds an offset that is used to index into a region of memory called the Global Descriptor Table (GDT). This offset is where the TSS segment descriptor lives, and is the entity that actually holds the address of the TSS.
At this point I hope you're asking "WTF Intel?!"
Well, these design decisions were made back in the 80s where memory was scarce, paging hadn't been fully adopted yet, and segmentation was "the way" of managing memory and privilege levels.
5
32 bits on 32-bit machines and 64 bits on 64-bit machines.
6
It's not possible to write to and read from the VMCS using usual memory read and write operations. There are special instructions to do so: VMREAD and VMWRITE.
7
Yes, this path must be optimized since this loading and unloading is relatively heavy. And hypervisors usually pin threads to cores to avoid paying this fee.
8
It's an Intel Core i7-12700H with 14 Cores (6 Performance, 8 Efficient) and a total of 20 threads.
9
Looking back, that was purely luck. Continue reading to know why...
10
Volume 3C of the SDM covers the virtual machine extension (VMX).
11
Another remnant of old hardware design that is kept for backward compatibility purposes, but "WTF Intel?!" indeed.