My first patch to the Linux kernel

> You may wonder whether I tried asking an LLM for help or not. Well, I did. In fact it was very helpful in some tasks like summarizing kernel logs [^13] and extracting the gist of them. But when it came to debugging based on all the clues that were available, it concluded that my code didn't have any bugs, and that the CPU hardware was faulty.

This matches my experience whenever I do an unconventional or deep work like the article mentions. The engineers comfortable with this type of work will multiply their worth.

Everybody seems to be missing the forest for the trees on this.

There is absolutely no "sign extension" in the C standard (go ahead, search it). "Sign extension" is a feature of some assembly instructions on some architectures, but C has nothing to do with it.

Citing integer promotion from the standard is justified, but it's just one part (perhaps even the smaller part) of the picture. The crucial bit is not quoted in the article: the specification of "Bitwise shift operators". Namely

> The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. [...]

> The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1×2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1×2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

What happens here is that "base2" (of type uint8_t, which is "unsigned char" in this environment) gets promoted to "int", and then left-shifted by 24 bits. You get undefined behavior because, while "base2" (after promotion) has a signed type ("int") and nonnegative value, E1×2^E2 (i.e., base2 × 2^24) is NOT representable in the result type ("int").

What happens during the conversion to "uint64_t" afterwards is irrelevant; even the particulars of the sign bit of "int", and how you end up with a negative "int" from the shift, are irrelevant; you got your UB right inside the invalid left-shift. How said UB happens to materialize on this particular C implementation may perhaps be explained in terms of sign extension of the underlying ISA -- but do that separately; be absolutely clear about what is what.

The article fails to mention the root cause (violating the rules for the bitwise left-shift operator) and fails to name the key consequence (undefined behavior); instead, it leads with not-a-thing ("sign-extension bug in C"). I'm displeased.

BTW this bug (invalid left shift of a signed integer) is common, sadly.

The first one always takes way longer than the code itself deserves. Most of the work is figuring out the unwritten rules, not writing the patch.

Sign extension bugs are the worst. Silent for ages then suddenly everything is on fire. Spent a lot of time in C doing low-level firmware work and ran into the same class of issue more than once. Nice writeup, congrats on the patch.

Great blog post. Using _BitInt typedefs for integers is a good option for anyone starting a fresh c project. It has worked well for me so far. _BitInt integers don’t promote to signed automatically like regular integers in c

Lovely article with a happy ending!

One thing that I am glad to have been taught early on in my career when it comes to debugging, especially anything involving HW, is to `make no assumptions'. Bugs can be anywhere and everywhere.

Well done and great writeup! Any idea why the bug hadn’t shown up sooner, like when running self tests?

Nice blogpost. Was an really interesting read. Would be interesting to read about the experience of getting the patch accepted and merged.

One thing I noticed: The last footnote is missing.

A big thanks for making the Linux kernel better!

> Since virtualization is hardware assisted these days

I was running Xen with full-hardware virtualization on consumer hardware in... 2006. I mean: some of us here were running hardware virt before some of the commenters were born. Just to put the "these days" into perspective in case some would be thinking it's a new thing.

Integer promotion rules in C are so deceptive.

I don't believe there's anybody who can reason about them at code skimming speeds. It's probably the best place to hide underhanded code.

Huge congrats on tracking that down and getting your first Linux kernel patch merged!

Congrats and happy for you, you had a lot of fun and did something genuinely interesting

I love these kind of posts.

Welcome to the club my friend! Its very exited. Very soon you will choose your favorate subsystem and double down on it.

Lovely article with a happy ending!

One thing that I am glad to have been taught early on in my career when it comes to debugging, especially anything involving HW, is to `make no assumptions'. Bugs can be anywhere and everywhere.

Well done and great writeup! Any idea why the bug hadn’t shown up sooner, like when running self tests?

Nice blogpost. Was an really interesting read. Would be interesting to read about the experience of getting the patch accepted and merged.

One thing I noticed: The last footnote is missing.

Integer promotion rules in C are so deceptive.

I don't believe there's anybody who can reason about them at code skimming speeds. It's probably the best place to hide underhanded code.

Huge congrats on tracking that down and getting your first Linux kernel patch merged!

Congrats and happy for you, you had a lot of fun and did something genuinely interesting

I love these kind of posts.

Everybody seems to be missing the forest for the trees on this.

There is absolutely no "sign extension" in the C standard (go ahead, search it). "Sign extension" is a feature of some assembly instructions on some architectures, but C has nothing to do with it.

> The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. [...]

BTW this bug (invalid left shift of a signed integer) is common, sadly.

The root problem is actually that the C language allows implicit conversions from an unsigned type to a signed type and from a signed type to an unsigned type, and in certain contexts such implicit conversions are actually mandated by the standard, like in the buggy expression from the parent article.

It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.

Saying that the behavior is sometimes undefined is not acceptable. Any implicit conversion of this kind must be an error. Whenever a conversion between signed and unsigned or unsigned and signed is desired, it must be explicit.

This may be the worst mistake that has ever been made in the design of the C language and it has not been corrected even after 50 years.

Making this an error would indeed produce a deluge of error messages in many carelessly written legacy programs, but the program conversion is trivial and it is extremely likely that many of these cases where the compilers do not signal errors can cause bugs in certain corner cases, like in the parent article.

It's incredibly common for people talking about C online or even in books (be that blog posts, side notes, tutorials, guides) to constantly make mistakes like these.

C seems to be one of those languages where people think they know it based on prior and adjacent experience. But it is not a language which can be learned based on experience alone. The language is full of cases where things will go badly wrong in a way which is neither obvious nor immediately evident. The negative side effects of what you did often only become evident long after you "learn" it as something you "can" do.

If you want to write C for anything where any security, safety, or reliability requirement needs to be met, you should commit to this strategy: Do not write any code which you are not absolutely certain you could justify the behaviour of by referencing the standard or (in the case of reliance on a specific definition of implementation defined, unspecified, or even (e.g. -ftrapv) undefined behaviour) the implementation documentation.

If you cannot commit to such a (rightfully mentally arduous) policy, you have no business writing C.

The same can actually be applied to C++ and Bash.

It was implementation defined for shifting negative numbers, but now the standard specifies twos-complement for this and all related IB

The first one always takes way longer than the code itself deserves. Most of the work is figuring out the unwritten rules, not writing the patch.

This is a big problem in open source that seems taboo to discuss.

In my opinion, unwritten rules are for gatekeeping. And if a new person follows all the unwritten rules, magically there's no one willing to review.

I think this is how large BFDL-style open source projects slowly become less and less relevant over the next few decades.

Can confirm that it also happens in other complex systems! Still a lot of good time and the novelty factor helps with pushing through

Sand that after so many years these rules are still not written down.

That makes me worry that you code actually has more issues because with small _BitInt you would run into signed overflow more often.

It's incredibly common for people talking about C online or even in books (be that blog posts, side notes, tutorials, guides) to constantly make mistakes like these.

If you cannot commit to such a (rightfully mentally arduous) policy, you have no business writing C.

The same can actually be applied to C++ and Bash.

Can confirm that it also happens in other complex systems! Still a lot of good time and the novelty factor helps with pushing through

That makes me worry that you code actually has more issues because with small _BitInt you would run into signed overflow more often.

Sand that after so many years these rules are still not written down.

A big thanks for making the Linux kernel better!

> Since virtualization is hardware assisted these days

It does not matter which is the relationship between the sizes of such types, there will always be values of the operand that cannot be represented in the result.

This may be the worst mistake that has ever been made in the design of the C language and it has not been corrected even after 50 years.

You could just use -Wsign-conversion.

It was implementation defined for shifting negative numbers, but now the standard specifies twos-complement for this and all related IB

While standard requires twos-complement we did not make all shift cases defined so far.

This is a big problem in open source that seems taboo to discuss.

In my opinion, unwritten rules are for gatekeeping. And if a new person follows all the unwritten rules, magically there's no one willing to review.

I think this is how large BFDL-style open source projects slowly become less and less relevant over the next few decades.

Agreed. The level of aggressive gatekeepers is just crazy, take Linux ARM mailing list for example. I found the Central and Eastern Europeans particularly aggressive there and I'm saying this as on myself. They sure do like to feel special there, with very little soft skills.

This will likely be alleviated when Ai first projects take over as important OSS projects.

Fir these projects everything "tribal" has to be explicitly codified.

On a more general note: this is likely going to have a rather big impact on software in general - the "engineer to company can not afford to loose" is likely loosing their moat entirely.

Welcome to the club my friend! Its very exited. Very soon you will choose your favorate subsystem and double down on it.

While standard requires twos-complement we did not make all shift cases defined so far.

This will likely be alleviated when Ai first projects take over as important OSS projects.

Fir these projects everything "tribal" has to be explicitly codified.

On a more general note: this is likely going to have a rather big impact on software in general - the "engineer to company can not afford to loose" is likely loosing their moat entirely.

You could just use -Wsign-conversion.

Obviously, that should always be used, like also the compiler options for checking integer overflow and accesses out-of-bounds.

However, this kind of implicit conversions must really be forbidden in the standard, because the correct program source is different from the one permitted by the standard.

When you activate most compiler options that detect undefined behaviors, the correct program source remains the same, even if the compiler now implements a better behavior for the translated program than the minimal behavior specified by the standard.

That happens because most undefined behaviors are detected at run time. On the other hand, incorrect implicit conversions are a property of the source code, which is always detected during compilation, so such programs must be rejected.

Obviously, that should always be used, like also the compiler options for checking integer overflow and accesses out-of-bounds.

However, this kind of implicit conversions must really be forbidden in the standard, because the correct program source is different from the one permitted by the standard.

The standard will not forbid anything that breaks billions of lines of code still be used and maintained.

But it is easy enough to use modern tooling and coding styles to deal with signed overflow. Nowadays, silent unsigned wrap around causing logic errors is the more vexing issue, which indicates the undefined behavior actually helps rather than hurts when used with good tooling.

The standard will not forbid anything that breaks billions of lines of code still be used and maintained.

Silent unsigned wrap around is caused by another mistake of the C language (and of all later languages inspired by C), there is only a single unsigned type.

The hardware of modern CPUs actually implements 5 distinct data types that must be declared as "unsigned" in C: non-negative integers, integer residues a.k.a. modular integers, bit strings, binary polynomials and binary polynomial residues.

A modern programming language should better have these 5 distinct types, but it must have at least distinct types for non-negative integers and for integer residues. There are several programming languages that provide at least this distinction. The other data types would be more difficult to support in a high-level language, as they use certain machine instructions that compilers typically do not know how to use.

The change in the C standard that was made so that now "unsigned" means integer residue, has left the language without any means to specify a data type for non-negative integers, which is extremely wrong, because there are more programs that use "unsigned" for non-negative integers than programs that use "unsigned" for integer residues.

The hardware of most CPUs implements very well non-negative integers so non-negative integer overflow is easily detected, but the current standard makes impossible to use the hardware.

Silent unsigned wrap around is caused by another mistake of the C language (and of all later languages inspired by C), there is only a single unsigned type.

The hardware of most CPUs implements very well non-negative integers so non-negative integer overflow is easily detected, but the current standard makes impossible to use the hardware.

There are other languages such as Ada that allow you to more precisely specify such things. Before requesting many new types for C, one should clarify why those languages did not already replace C.

I agree though that using "unsigned" for non-negative integers is problematic and that there should be a way to specify non-negative integers. I would be fine with an attribute.

The problem is also that the standard committee is not the ruling body of the C language. It is the place where people come together to negotiate some minimal requirements. If you want something, you need to first convince the compilers vendors to implement it as an extension.

There are other languages such as Ada that allow you to more precisely specify such things. Before requesting many new types for C, one should clarify why those languages did not already replace C.

I agree though that using "unsigned" for non-negative integers is problematic and that there should be a way to specify non-negative integers. I would be fine with an attribute.

Mar 19, 2026 Updated on Mar 22, 2026

How a sign-extension bug in C made me pull my hair out for days but became my first patch to the Linux kernel!

Intro

A while ago, I started dipping my toe into virtualization. It's a topic that many people have heard of or are using on a daily basis but a few know and think about how it works under the hood.

I like to learn by reinventing the wheel, and naturally, to learn virtualization I started by trying to build a Type-2 hypervisor. This approach is similar to how KVM (Linux) or bhyve (FreeBSD) are built.

My experimental hypervisor (and VMM) is still a work-in-progress and is available on my Github: pooladkhay/evmm.

Since virtualization is hardware assisted these days 1, the hypervisor needs to communicate directly with the CPU by running certain privileged instructions; which means a Type-2 hypervisor is essentially a Kernel Module that exposes an API 2 to the user-space where a Virtual Machine Monitor (VMM) 3 like QEMU or Firecracker is running and orchestrating VMs by utilizing that API.

In this post, I want to describe exactly how I found that bug. But to make it a bit more educational, I'm going to set the stage first and talk about a few core concepts so you can see exactly where the bug emerges.

x86 Task State Segment (TSS)

The x86 architecture in protected mode (32-bit mode) envisions a task switching mechanism that is facilitated by the hardware. The architecture defines a Task State Segment (TSS) which is a region in the memory that holds information about a task (General purpose registers, segment registers, etc.). The idea was that any given task or thread would have its own TSS, and when the switch happens, a specific register (Task Register or TR) would get updated to point to the new task 4.

This was abandoned in favor of software-defined task switching which gives more granular control and portability to the operating system kernel.

But the TSS was not entirely abandoned. In modern days (64-bit systems) the kernel uses a TSS-per-core approach where the main job of TSS is to hold a few stack pointers that are very critical for the kernel and CPU's normal operation. More specifically, it holds the kernel stack of the current thread which is used when the system wants to switch from user-space to the kernel-space.
It also holds a few known good stacks for critical events like Non-Maskable Interrupts (NMIs) and Double Faults. These are events that if not handled correctly, can cause a triple fault and crash a CPU core or cause an immediate system reboot.

We know that memory access is generally considered to be expensive and caching values somewhere on the CPU die is the preferred approach if possible. This is where the TR register comes into the picture. It has a visible part which is a 16-bit offset that we have already discussed as well as a hidden part that holds direct information about the TSS (Base address, Limit, and Access rights). This saves the CPU the trouble of indexing into the GDT to eventually find the TSS every time it's needed.

Why do hypervisors care?

A hypervisor is essentially a task switcher where tasks are operating systems. In order for multiple operating systems to run on the same silicon chip, the hypervisor must swap the entire state of the CPU which includes updating the hidden part of the TR register as well.

In a previous blog post 1 I described how Intel implemented their virtualization extension (VT-x) and how each vCPU (vCore) is given its own VMCS (Virtual Machine Control Structure) block where its state is saved to or restored from by the hardware when switching between host and guest OSes.

I suggest reading that post if you're interested in the topic but VMCS consists of four main areas:

Host-state area
Guest-state area
Control fields
VM-exit information area

Host-state area has two fields which correspond to the visible part and one of the hidden parts (base address) the TR register:

HOST_TR_SELECTOR (16 bits)
HOST_TR_BASE (natural width 5)

While guest-state area has four (one visible plus all three hidden parts):

GUEST_TR_SELECTOR (16 bits)
GUEST_TR_BASE (natural width 5)
GUEST_TR_LIMIT (32 bits)
GUEST_TR_ACCESS_RIGHTS (32 bits)

The reason is that the hardware assumes the host OS to be a modern 64-bit operating system where TR limit and Access Rights are fixed known values (0x67 and 0x11 respectively). But the guest OS can be virtually any operating system with any constraints.

Naturally, it is the hypervisor's job to set these values on initial run and to update them when needed (e.g. when the kernel thread that is running a vCPU is migrated to another physical CPU core, the hypervisor must update the host state to match the new core).

To set these values, I "borrowed" some code from the linux kernel tree (KVM selftests):

vmwrite(HOST_TR_BASE,
		get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));

This piece of code does the following:

Gets the address of GDT.
Indexes into it using the value of TR register.
Parses the TSS segment descriptor and extracts the memory address of TSS.
Writes the address into the HOST_TR_BASE section of the VMCS using the special VMWRITE instruction 6.

So far, so good!

If for any reason this operation fails to extract and write the correct address, upon the next context switch from user-space to kernel-space (or next NMI or next Double fault), when the CPU hardware tries to read the kernel stack from the TSS to update the Stack Pointer register, it either receives garbage or an unmapped address. Either way, the CPU will eventually face a double fault (a fault that happens when trying to handle another fault like a page fault) and when trying to use one of the known good stacks for handling the double fault, it will fail again which will make it a triple fault and BOOM! The core dies or we get a sudden reboot.

Symptoms

Now lets talk about the issue that I was facing.

I started developing my hypervisor on a virtualized instance of Fedora, to avoid crashing my machine in case something went wrong. By the time I realized something is indeed wrong, I had already developed the ability to put the CPU in VMX operation, run a hardcoded loop in VMX non-root mode that would use the VMCALL instruction to trap into the hypervisor (VMX root) and ask it to print a message, then resume the loop (VMRESUME).

Additionally, VMCS was programmed to trap external interrupts (e.g. timer ticks). Upon an exit, the hypervisor would check if we (the current kernel thread) needs to be rescheduled, keeping the kernel scheduler happy.

I was using preempt notifier api which lets threads provide two custom functions (sched_in and sched_out) that are called by the scheduler when it's about to deschedule the thread as well as right before rescheduling it. These functions are then responsible for cleanups and initialization work that is required.

In my case, sched_out would unload the VMCS from the current core, and sched_in would load it on the new core 7 while reinitializing it using a series of VMWRITEs 6 to match the new core's state.

On my virtualized dev environment with only three vCPUs, everything was working just fine. Until I decided to give it a try on my main machine 8 where the hypervisor would talk to an actual physical CPU.

And BOOM!

Seconds after running the loop, the system crashed, in a very unpredictable way. I was logging the core switches and didn't find any meaningful correlation between the last core number and the crash. Additionally, sometimes it would last longer and sometimes it was immediate. After investigating kernel logs a few times, I saw a pattern in the sequence of events that caused the system to eventually hang:

The Fatal VM-Exit: An NMI triggered a VM-Exit on CPU 5 and naturally the hardware tried to locate a valid kernel stack from TSS to handle the privilege transition.
Core Death: CPU 5 hit a fatal Page Fault attempting to read an unmapped memory address, resulting in a Kernel Oops. CPU 5 was left completely paralyzed with interrupts disabled.
IPI Lockup: CPU 6 attempted a routine system-wide update (kernel text patching) requiring an Inter-Processor Interrupt (IPI) acknowledgment from all cores. CPU 6 became permanently stuck in an infinite loop waiting for the dead CPU 5 to respond.
Cascading Paralysis: As other cores (3, 8, 11, etc.) attempted standard cross-core communications (like memory map TLB flushes and RCU synchronizations), they too fell into the IPI trap, waiting indefinitely for CPU 5.
Terminal State: The RCU subsystem starved, peripheral drivers (like Wi-Fi) crashed from timeouts, and the system entered a total, unrecoverable deadlock.

So why no triple faults?!

The Kernel Oops killed the active task and halted operations on CPU 5. However, it left CPU 5 in a "zombie" state. Alive enough to keep the motherboard powered on, but with its interrupts disabled, making it entirely unresponsive to the rest of the system.

The quest

Soon I realized that the hypervisor works absolutely fine 9 when pinned to one core (e.g. via taskset command), so there must be something happening while moving between cores. Additionally, I didn't dare to question the code I stole from the Linux kernel source, and I was trying hard to find an issue in the code I wrote myself. This eventually led to rewriting a portion of the hypervisor code with an alternative method which would achieve the same goal.

For example, from reading Intel's Software Developer Manual (SDM) 10, I knew that when moving from core A to core B, core A must run the VMCLEAR instruction to unload the VMCS, and only then can core B load the VMCS using the VMPTRLD to be able to execute the guest code. For that, I was using smp_call_function_single which relies on IPIs to run a piece of code on another CPU, that I replaced with the preempt notifiers.

Eventually, (while pulling my hair out) I realized I have eliminated all possible parts of the hypervisor that played a role in moving between cores.

Then there was another clue!

While running the hypervisor on my virtual dev environment (QEMU + Fedora) I observed that by increasing the number of vCores, I can reproduce the issue and there is also a new behavior. Sometimes the VM reboots immediately (instead of freezing) and after the reboot, there is no trace of any logs related to the previous session. And I concluded that a triple fault has happened.

This turned my attention to the TR and TSS. I started looking for alternative ways of setting the HOST_TR_BASE and realized that the KVM itself (not KVM selftests) uses a different method:

/*
 * Linux uses per-cpu TSS and GDT, so set these when switching
 * processors.  See 22.2.4.
 */
vmcs_writel(HOST_TR_BASE, (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);

And that was it! Using this method to set HOST_TR_BASE fixed my hypervisor and helped me keep whatever sanity I had left.

The smoking gun

Remember that piece of code I took from the kernel source. It used the get_desc64_base function to extract and write the address of TSS into the HOST_TR_BASE. This function has this definition:

static inline uint64_t get_desc64_base(const struct desc64 *desc)
{
    return ((uint64_t)desc->base3 << 32) |
		(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
}

TSS segment descriptor has four fields that must be stitched together to form the address of the TSS 11.

base0 is uint16_t.
base1 is uint8_t.
base2 is uint8_t.
base3 is uint32_t.

The C standard 12 dictates Integer Promotion. Whenever a type smaller than an int is used in an expression, the compiler automatically promotes it to a standard int (which is a 32-bit signed integer on modern x86-64 architectures) before performing the operation.

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

— Section 6.3.1.1

This promotion has a consequence: if the resulting value after promotion has a 1 in its most significant bit (32nd bit), this value considered negative by the compiler and if casted to a larger type like a uint64_t in our case, sign extension happens.

Lets see an example:

We have an 8-bit unsigned integer (uint8_t) with 11001100 bit pattern. If we left-shift it by 24, it still can be represented by an int which is 32 bits long. So the compiler generates this value: 11001100000000000000000000000000 and considers it to be an int which is a signed type.

Now if we try to perform any operation on this value, it would follow the protocol for signed values. In our case, we are ORing it with a uint64_t. So the compiler would cast our int (a 32-bit signed value) into uint64_t (a 64-bit unsigned value), which is where the sign-extension happens which would turn our value to 11111111111111111111111111111111_11001100000000000000000000000000 before OR happens.

Saw the problem?
Because the upper 32 bits are sign-extended to all 1s (Hex: 0xFFFFFFFF), the bitwise OR operation completely destroys base3 (In a bitwise OR, 1 | X equals 1). Therefore, whatever data was in base3 is permanently overwritten by the 1s from the sign extension.

Here is an actual example with "real" addresses:

base0 = 0x5000
base1 = 0xd6
base2 = 0xf8
base3 = 0xfffffe7c

Expected return: 0xfffffe7cf8d65000
Actual return:   0xfffffffff8d65000

This also explains when the problem would happen: Only and only if base2 has a 1 as its most significant bit. Any other value would not corrupt the resulting address.

hide-the-pain

The Kernel patch

The fix is actually very simple. We must cast values to unsigned types before the bit-shift operation:

static inline uint64_t get_desc64_base(const struct desc64 *desc)
{
    return (uint64_t)desc->base3 << 32 |
           (uint64_t)desc->base2 << 24 |
           (uint64_t)desc->base1 << 16 |
           (uint64_t)desc->base0;
}

This will prevent the sign-extension from happening.

Finally, this is the patch I sent, which was approved and merged:
https://lore.kernel.org/kvm/20251222174207.107331-1-mj@pooladkhay.com/

Outro

I can't finish this post without talking about AI!

You may wonder whether I tried asking an LLM for help or not. Well, I did. In fact it was very helpful in some tasks like summarizing kernel logs [^13] and extracting the gist of them. But when it came to debugging based on all the clues that were available, it concluded that my code didn't have any bugs, and that the CPU hardware was faulty.

CASE CLOSED.

Hypervisor and Virtual Machine Monitor (VMM) are generally interchangeable terms, while some might differentiate them slightly (e.g. VMM as user-space part of a kernel-space hypervisor).

TR register does not directly point to TSS. It holds an offset that is used to index into a region of memory called the Global Descriptor Table (GDT). This offset is where the TSS segment descriptor lives, and is the entity that actually holds the address of the TSS.
At this point I hope you're asking "WTF Intel?!"
Well, these design decisions were made back in the 80s where memory was scarce, paging hadn't been fully adopted yet, and segmentation was "the way" of managing memory and privilege levels.

32 bits on 32-bit machines and 64 bits on 64-bit machines.

It's not possible to write to and read from the VMCS using usual memory read and write operations. There are special instructions to do so: VMREAD and VMWRITE.

Yes, this path must be optimized since this loading and unloading is relatively heavy. And hypervisors usually pin threads to cores to avoid paying this fee.

It's an Intel Core i7-12700H with 14 Cores (6 Performance, 8 Efficient) and a total of 20 threads.

Looking back, that was purely luck. Continue reading to know why...

Volume 3C of the SDM covers the virtual machine extension (VMX).

Another remnant of old hardware design that is kept for backward compatibility purposes, but "WTF Intel?!" indeed.

Hacker Times