Emacs internals: Tagged pointers vs. C++ std:variant and LLVM (Part 3)

Happy to see discussion of LLVM's interesting implementation of Static Polymorphism using CRTP. Some recommended reads:

It's not clear to me (and as an unsafe language it's not called out by your compiler if you do something illegal) what the correct way to spell this kind of trick is in C++

I had thought you need the pointer-sized integer types and mustn't do this directly to an actual pointer, but maybe I was wrong (in theory, obviously practice doesn't follow but that's a dangerous game)

Emacs internal part 2 HN link:

https://news.ycombinator.com/item?id=47259961

LLVM now has another way to implement RTTI using the `CastInfo` trait instead of `classof`: https://llvm.org/doxygen/structllvm_1_1CastInfo.html

But it's really just an implementation difference, the idea is still to have a lightweight RTTI.

Happy to see discussion of LLVM's interesting implementation of Static Polymorphism using CRTP. Some recommended reads:

1. https://en.wikipedia.org/wiki/Curiously_recurring_template_p...

2. https://david.alvarezrosa.com/posts/devirtualization-and-sta...

3. https://llvm.org/docs/ProgrammersManual.html#the-isa-cast-an...

LLVM now has another way to implement RTTI using the `CastInfo` trait instead of `classof`: https://llvm.org/doxygen/structllvm_1_1CastInfo.html

But it's really just an implementation difference, the idea is still to have a lightweight RTTI.

Emacs internal part 2 HN link:

https://news.ycombinator.com/item?id=47259961

It's not clear to me (and as an unsafe language it's not called out by your compiler if you do something illegal) what the correct way to spell this kind of trick is in C++

Doing bitwise operations directly on raw pointers is a fast track to Undefined Behavior in standard C/C++. Emacs gets away with it largely due to its age, its heavy reliance on specific GCC behaviors/extensions, and how its build system configures compiler optimizations.

In modern C++, the technically "correct" and safe way to spell this trick is exactly as you suggested: using uintptr_t (or intptr_t).

The idiomatic way to safely do pointer tagging in C++ works through uintptr_t.

If you don't care about portability or using every theoretically available bit then it is trivial. A maximalist implementation must be architecture aware and isn't entirely knowable at compile-time. This makes standardization more complicated since the lowest common denominator is unnecessarily limited.

In C++ this really should be implemented through a tagged pointer wrapper class that abstracts the architectural assumptions and limitations.

Do the way LLVM does it.

The idiomatic way to safely do pointer tagging in C++ works through uintptr_t.

In C++ this really should be implemented through a tagged pointer wrapper class that abstracts the architectural assumptions and limitations.

In modern C++, the technically "correct" and safe way to spell this trick is exactly as you suggested: using uintptr_t (or intptr_t).

There’s a paper in flight to add a stdlib type to handle pointer tagging as well while preserving pointer provenance and so-forth. It’s currently best to use the intptr types, but the goal is to make it so that an implementation can provide specializations based on what bits of a pointer are insignificant, or even ignored, on a given target without user code having to be specialized. Not sure where it has landed since discussion in SG1 but seemed like a good idea.

Do (u)intptr_t preserve provenance? Or does this count as exposed provenance when you convert back and forth?

Maybe that is not the correct C++ terminology, I'm more familiar with how provenance works in Rust, where large parts of it got stabilised a little over a year ago. (What was stabilised was "strict provenance", which is a set of rules that if you abide them will definitely be correct, but it is possible the rules might be loosened in the future to be more lenient.)

https://doc.rust-lang.org/std/ptr/index.html#provenance

Is there a similar solution to doing this in Rust? I suppose inside `unsafe` you can do basically anything.

Do the way LLVM does it.

Do (u)intptr_t preserve provenance? Or does this count as exposed provenance when you convert back and forth?

https://doc.rust-lang.org/std/ptr/index.html#provenance

Well, C++ does not have any promises about how Pointer Provenance works, so AFAIK the answer is "mu" meaning that's a bad question, don't ask that.

But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates

As that name suggests, in this model provenance is not transmitted via integers. Every 123456 is always just the integer 123456 and there aren't magic 123456 values which are different and transmit some form of provenance from a pointer to some value which happened perhaps to be stored at address 123456 in memory.

However, PNVI-ae-udi has Exposure, which means if we exposed the pointer in an approved way then the associated provenance is somehow magically "out there" in the ether, as a result if we have exposed this pointer then just having that integer 123456 works fine because we combined that integer 123456 with that provenance from the ether and make a working pointer. User disambiguation means that the compiler has to give you "benefit of the doubt" e.g. if you could mean to make a pointer to that Doodad which no longer exists as of a minute ago or to this other Doodad which does exist, well, benefit of the doubt means it was the latter and so your pointer is valid even though the addresses of both Doodads were the same.

Pointer provenance is not properly defined in C or C++. (There is a C TS that introduces pointer provenance, but it's not part of the main standard).

The problem of pointer provenance is more finding a workable theoretical model rather than one causing miscompiles on realistic code. While there are definitely miscompiles on carefully constructed examples, I'm not aware of any bugs on actual code. This is in comparison to topics like restrict(/noalias) semantics or lifetime semantics, where there is a steady drip of bug reports that turn out to be actual optimization failures.

Given you aren't sure since SG1 this might be useless but... do you have a paper number? Or, more likely, know an author's name ?

Is there a similar solution to doing this in Rust? I suppose inside `unsafe` you can do basically anything.

Unlike C++ all of Rust's primitive types get the same first class treatment as your user defined types and so the appropriate API is provided as methods on pointer types. For this you want ptr::map_addr which takes a callable (such as your own function for this mapping or a lambda) to fiddle with the pointer.

https://doc.rust-lang.org/std/primitive.pointer.html#method....

Rust's MIRI is able to run code which uses this (a strict provenance API) because although MIRI's pointers are some mysterious internal type, it can track that we mapped them to hide our tags, and then later mapped back from the tagged pointer to recover our "real" pointer and see that's fine.

This isn't an unsafe operation. Dereferencing a pointer is unsafe, but twiddling the bits is fine, it just means whoever writes the unsafe dereferencing part of your codebase needs to be very careful about these pointers e.g. making sure the ones you've smuggled a tag in aren't dereferenced 'cos that's Undefined Behaviour.

It's clear to me how this works in Rust, it's just unclear still in C++

Everything else in the siblings is true, but remember that the language and std types in rust all do this already. Most of the time it’s better to use a native enum or optional/result because they do this in the compiler/lib. It’s only really worth it if you need more than a few types or need precise control of the representation for C interop or something.

Rust is basically in the same place as C++, i.e. provenance rules are currently ad-hoc/conventional, meaning that pointer tagging is a grey area.

Waiting for Rust experts.

Given you aren't sure since SG1 this might be useless but... do you have a paper number? Or, more likely, know an author's name ?

It’s Hana Dusikova’s paper IIRC.

Seems like its p3125r0

Waiting for Rust experts.

https://doc.rust-lang.org/std/primitive.pointer.html#method....

It's clear to me how this works in Rust, it's just unclear still in C++

Pointer provenance is not properly defined in C or C++. (There is a C TS that introduces pointer provenance, but it's not part of the main standard).

To expand on the sibling answer: sort of! Rust will do niche optimisation, but for references and NonNull pointers this is limited to "the value 0 is invalid and can thus be used as a niche". But Rust does not (currently) take advantage of alignment niches in pointers. Nor does it use high bit on architectures where you know your whole theoretical address space isn't actually in use.

Is doing that manually worth it? Usually not, but for some core types (classical example is strings) or in language runtimes it can be.

Would it be awesome if this could be done automatically? Absolutely, but I understand it is a large change, and the plan is to later build upon the pattern types that are currently work in progress (and would allow you to specify custom ranged integer typed).

I mean, kinda, sorta? Rust's guaranteed niche optimisation means Option<&T> [which might be Some(&T) or just None] is promised to be the same size in memory as &T the reference to a T

So that's one tiny use of this sort of idea which is guaranteed unnecessary in Rust, and indeed although it isn't guaranteed the optimiser will typically spot less obvious opportunities so that Option<Option<bool>> which might be None, or Some(None) or Some(Some(true)) or Some(Some(false)) is the same size (one byte) as bool.

But hiding stuff in a pointer is applicable in places your Rust compiler won't try to take advantage unless you do something like this. A novel tiny String-like type I saw recently does this, https://crates.io/crates/cold-string ColdString is 8 bytes, if your text is 8 or fewer bytes of UTF-8 then you're done, that'll fit, but, if you have more text ColdString allocates on the heap to store not only your text but also its length and so it needs to actually "be" in some sense a raw pointer to that structure, but if the string is shorter that pointer is nonsense, we've hidden our text in the pointer itself.

Implementation requires knowing how pointers work, and how UTF-8 encoding works. I actually really like one of the other Rust tiny strings, CompactString but if you have a lot of very small strings (e.g. UK postcodes fit great) then ColdString might be as much as three times smaller than your existing Rust or C++ approach and it's really hard to beat that for such use cases.

Edited: To remove suggestion ColdString has a distinct storage capacity, this isn't intended as a conventional string buffer, it can't grow after creation

Well, C++ does not have any promises about how Pointer Provenance works, so AFAIK the answer is "mu" meaning that's a bad question, don't ask that.

But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates

> But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates

There's a competing proposal in C++ land to add provenance via angelic nondeterminism: if there's some provenance that makes the code non-UB, then use that provenance. (As you might imagine, I'm not a big fan of that proposal, but WG21 seems to love it a lot more than I do.)

Rust is basically in the same place as C++, i.e. provenance rules are currently ad-hoc/conventional, meaning that pointer tagging is a grey area.

Nope. Rust stabilized strict provenance over a year ago. Some details about aliasing aren't tied down, but so long as you can obey the strict provenance rules you're golden today in Rust to hide flags in pointers etc.

https://blog.rust-lang.org/2025/01/09/Rust-1.84.0/#strict-pr...

It’s Hana Dusikova’s paper IIRC.

Is doing that manually worth it? Usually not, but for some core types (classical example is strings) or in language runtimes it can be.

Seems like its p3125r0

Thanks! https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...

(is the current version of that paper, the tracking ticket insisted there's a P3125R5 and that LEWG had seen it in 2025, but it isn't listed in a mailing so it might be a mirage)

You know it's a Hana paper because it wants this to be allowed at compile time (C++ constrexpr) but joking aside this seems like a nice approach for C++ which stays agnostic about future implementation details.

I mean, kinda, sorta? Rust's guaranteed niche optimisation means Option<&T> [which might be Some(&T) or just None] is promised to be the same size in memory as &T the reference to a T

Edited: To remove suggestion ColdString has a distinct storage capacity, this isn't intended as a conventional string buffer, it can't grow after creation

> But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates

https://blog.rust-lang.org/2025/01/09/Rust-1.84.0/#strict-pr...

But LLVM's optimizations aren't sound and this affects Rust too.

Thanks! https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...

(is the current version of that paper, the tracking ticket insisted there's a P3125R5 and that LEWG had seen it in 2025, but it isn't listed in a mailing so it might be a mirage)

But LLVM's optimizations aren't sound and this affects Rust too.

March 12, 2026 [systems-programming] #tagged-pointer #fat-pointer #custom-rtti #struct-embedding

Recap

From the previous article, we examined how GNU Emacs represents every Lisp value — integers, symbols, cons cells, strings, buffers — inside a single 64-bit slot called Lisp_Object. Because all heap objects are 8-byte aligned and the lowest 3 bits of any valid pointer are always zero, Emacs reclaims these "free" bits and uses them as a type tag.

The more fundamental question is: when a single variable must hold values of different types at runtime, how do we preserve enough information to use that data correctly?

01. Back to the basics: how to write a polymorphic type

The static typing

Given memory and a struct type, how do we access the data?

Data in memory is represented as bits. To operate on it, we define its shape. If we modify the value of score, we load the data from base + 4 bytes, write a 32-bit value, and store it back.

#include <stdint.h>

struct Person {
    uint8_t  age;         offset  0, size 1
     [3 bytes padding]    offset  1  (align next field to 4)
    uint32_t score;       offset  4, size 4
    uint64_t id;          offset  8, size 8
    char     name[12];    offset 16, size 12
};
// sizeof(Person) == 28, no trailing padding needed

The compiler remembers the shape of the data so the runtime does not have to.

Dynamic typing

Given a memory address and a set of possible types, how do we access the data?

One approach is to add a field at the beginning of a struct to indicate the active type.

struct TaggedValue {
    int type_id;         which type is active?
    union {
        TypeA a;                TypeB b;                TypeC c;                    } payload;
};

This allows type checking before casting:

// predicate: is this a TypeA?
bool is_type_a(struct TaggedValue* v) {
    return v->type_id == TYPE_A;   just an integer comparison
}

// check cast: give me TypeA* or NULL
TypeA* as_type_a(struct TaggedValue* v) {
    if (v->type_id == TYPE_A) {
        return &v->payload.a;      }
    return NULL;   wrong type
}

Functions can be dispatched according to this tag.

// Each function pointer represents "what to do when you encounter this type".
// A Visitor bundles all these handlers together into one struct.
struct Visitor {
    void (*visit_a)(TypeA* a);   called when type_id == TYPE_A
    void (*visit_b)(TypeB* b);   called when type_id == TYPE_B
    void (*visit_c)(TypeC* c);   called when type_id == TYPE_C
};

// The dispatch function is the bridge between data and behavior.
// It reads the type_id, then hands control to the correct handler.
void visit(struct TaggedValue* v, struct Visitor* visitor) {
    switch (v->type_id) {
        case TYPE_A: visitor->visit_a(&v->payload.a); break;
        case TYPE_B: visitor->visit_b(&v->payload.b); break;
        case TYPE_C: visitor->visit_c(&v->payload.c); break;
                    }
}

Now you have invented C++ std::variant and std::visit that were introduced in C++17, which utilize a "tagged union."

02. Tagged Union (Unboxed): `std::variant` and `std::visit`

People claim std::variant and std::visit provide a more "TYPE SAFE" way, but in fact they just provide some checks. It simply ensures that an invalid cast like (TypeA*) type_b_object is caught either at compile time (if the type is not in the variant at all) or at runtime (if the active type does not match), rather than silently producing undefined behavior as the hand-written version would.

There are two things that need attention here:

First, the problem with a tagged union is that the size of the struct is the size of the largest union element. So for the example below, even if most objects are bool or int, the overall size will be 64 bytes, which is very memory inefficient. Therefore, tagged union technique is mostly applied where different sizes of different types are similar, or temporary objects on stack that can be freed after this call.

PS. std::variant is originally designed to handle a closed set of known types in a type-safe manner — similar in spirit to Haskell's Either, where a value is one of several explicitly listed possibilities. The closest C++ analogue to Haskell's Maybe is actually std::optional.

struct TaggedValue {
    int type_id;
    union {
        int a;                 bool b;                char c[64];                } payload;
};

Second, this way of handling data and types is called a "tagged union", or "unboxed".

The naming of boxed and unboxed is a programming language (PL) term that looks very weird to C/C++ programmers. The "unboxed" way looks actually wrapped in a struct box. But the actual meaning in PL theory refers to memory indirection.

Boxed: The data lives on the heap (inside a "box"). You only hold a reference or pointer to it.
Unboxed: The raw bits of the data are laid out flat right where they are declared (on the stack, or inline within an array). There are no pointers to follow. std::variant is unboxed because all the bytes required for the largest possible variant are allocated inline right there.

boxed-vs-unboxed

References on memory representation:

03. Tagged Pointer (Boxed)

A tagged union allocates the maximum required size for every variant. If an Abstract Syntax Tree (AST) contains an integer node requiring 8 bytes and a function definition node requiring 256 bytes, an unboxed tagged union allocates 256 bytes for every integer node. This affects the overall memory footprint and cache usage.

An alternative approach keeps data on the heap (boxed) and uses the lowest 3 bits of the 8-byte aligned pointer as the type tag. This provides $2^3 = 8$ possible types without allocating additional inline bytes. This is called a tagged pointer.

struct tagged_pointer {
    void* pointer_and_tag;
}
// sizeof(tagged_pointer) == 8 bytes

What if there are more than 8 types?

One common solution is using fat pointer. Modern languages like Go (interfaces) and Rust (trait objects) use this approach extensively.

Stealing 3 bit is not enough, so, just add a 64-bit space to store the tag information.

struct fat_pointer {
    int tag;         Could be a type ID, a vtable pointer, or a size/length
    void* payload;   Pointer to the actual data on the heap
}
// sizeof(fat_pointer) == 16 bytes (doubled)

"Since both Go and Rust think fat pointers are great, why doesn't GNU Emacs just use fat pointers for Lisp_Object?"

Comparing fat pointers to tagged pointers: A fat pointer is smaller than a tagged union, but it doubles the memory size compared to a 64-bit tagged pointer. This changes the memory footprint and the Garbage Collector (GC) scanning workload. Emacs was designed to keep the Lisp_Object within a single 64-bit word.

PS. It's a crime to waste memory in the 1980s — a typical workstation had around 256 KB of RAM, which is less than the size of a single modern emoji in a Unicode font file. Doubling every Lisp_Object from 8 to 16 bytes wasn't an engineering tradeoff. It was a confession.

04. Emacs: Tagged Pointer + Poor Man's Inheritance

Since GNU Emacs uses the lowest 3 bits for tags, it is strictly limited to 8 fundamental types. If you look at enum Lisp_Type in src/lisp.h, you'll see exactly that:

Lisp_Symbol
Lisp_Int0
Lisp_Int1
Lisp_String
Lisp_Vectorlike
Lisp_Cons
Lisp_Float (plus one unused type)

McCarthy's Lisp (1960) abstract math atom eq car cdr cons quote cond │ │ Emacs engineers bridge: │ "statically typed C must represent │ dynamically typed Lisp" ▼ Lisp_Object (src/lisp.h) C layer ┌──────────────────────┬────┐ │ pointer or value │tag │ ← one machine word (64 bits) │ 61 bits │ 3b │ └──────────────────────┴────┘ │ ├─ tag = Cons → struct Lisp_Cons ├─ tag = String → struct Lisp_String ├─ tag = Float → struct Lisp_Float ├─ tag = Int0/1 → EMACS_INT (immediate value, no pointer!) ├─ tag = Symbol → struct Lisp_Symbol └─ tag = Vectorlike │ ▼ union vectorlike_header (The Escape Hatch) │ ├─ PVEC_BUFFER → struct buffer ├─ PVEC_WINDOW → struct window ├─ PVEC_HASH_TABLE → struct Lisp_Hash_Table ├─ PVEC_SYMBOL_WITH_POS → struct Lisp_Symbol_With_Pos └─ ... (over 30+ complex types!)

A basic Lisp interpreter might only need a few primitive types. But Emacs is a text editor designed for performance (?!), so it implements its core objects (like buffers and windows) directly in C. So, 8 types are not enough.

How can we represent more types when the pointer tag is limited to 3 bits?

To expand the type space, Emacs uses a C pattern sometimes called "Poor Man's Inheritance" or struct embedding. By placing a common header as the very first field of a struct, a pointer can be cast to the header type, checked for its sub-type, and then cast to the specific object type.

When Lisp_Object carries the Lisp_Vectorlike tag, it points to a struct starting with union vectorlike_header. Emacs reads this header to find a sub-tag indicating the specific type (pvec_type).

This is how vectorlike_header is embedded:

// Source: lisp.h
struct Lisp_Vector // The generic "Base Data Structure"
{
  union vectorlike_header header;
  Lisp_Object contents[FLEXIBLE_ARRAY_MEMBER];
} GCALIGNED_STRUCT;

// Derived type Example
struct Lisp_Symbol_With_Pos
{
  union vectorlike_header header;  <--- The "Base Class" must be the first field!
  Lisp_Object sym;                 A symbol 
  Lisp_Object pos;                 A fixnum 
} GCALIGNED_STRUCT;

When Emacs identifies a Lisp_Object, it performs two checks:

Primary Tag Check: Check if the lowest 3 bits are Lisp_Vectorlike (0b101).
Sub-Tag Check: Cast the pointer to union vectorlike_header*, read the TYPE field. If it equals PVEC_SYMBOL_WITH_POS, cast the pointer to struct Lisp_Symbol_With_Pos*.

The definition of union vectorlike_header packs the subtype into its size field:

// Source: src/lisp.h
union vectorlike_header
  {
     The `size' header word, W bits wide, has one of two forms
       discriminated by the second-highest bit (PSEUDOVECTOR_FLAG):

         1   1                    W-2
       +---+---+-------------------------------------+
       | M | 0 |                 SIZE                |  vector
       +---+---+-------------------------------------+

         1   1    W-32      6       12         12
       +---+---+--------+------+----------+----------+
       | M | 1 | unused | TYPE | RESTSIZE | LISPSIZE |  pseudovector
       +---+---+--------+------+----------+----------+

       M (ARRAY_MARK_FLAG) holds the GC mark bit.

       SIZE     is the length (number of slots) of a regular Lisp vector,
                and the object layout is struct Lisp_Vector.

       TYPE     is the pseudovector subtype (enum pvec_type).

       LISPSIZE is the number of Lisp_Object fields at the beginning of the
                object (after the header).  These are always traced by the GC.

       RESTSIZE is the number of fields (in word_size units) following.
                These are not automatically traced by the GC.
                For PVEC_BOOL and statically allocated PVEC_SUBR, RESTSIZE is 0.
                (The block size for PVEC_BOOL is computed from its own size
                field, to avoid being restricted by the 12-bit RESTSIZE field.)
    
    ptrdiff_t size;
  };

And the following contains all the concrete pvec_type sub-types it can represent:

// Source: src/lisp.h
enum pvec_type
{
  PVEC_NORMAL_VECTOR,   Should be first, for sxhash_obj.  
  PVEC_FREE,
  PVEC_BIGNUM,
  PVEC_MARKER,
  PVEC_OVERLAY,
  PVEC_FINALIZER,
  PVEC_SYMBOL_WITH_POS,
  PVEC_MISC_PTR,
  PVEC_USER_PTR,
  PVEC_PROCESS,
  PVEC_FRAME,
  PVEC_WINDOW,
  PVEC_BOOL_VECTOR,
  PVEC_BUFFER,
  PVEC_HASH_TABLE,
  PVEC_OBARRAY,
  PVEC_TERMINAL,
  PVEC_WINDOW_CONFIGURATION,
  PVEC_SUBR,
  PVEC_OTHER,             Should never be visible to Elisp code.  
  PVEC_XWIDGET,
  PVEC_XWIDGET_VIEW,
  PVEC_THREAD,
  PVEC_MUTEX,
  PVEC_CONDVAR,
  PVEC_MODULE_FUNCTION,
  PVEC_NATIVE_COMP_UNIT,
  PVEC_TS_PARSER,
  PVEC_TS_NODE,
  PVEC_TS_COMPILED_QUERY,
  PVEC_SQLITE,

   These should be last, for internal_equal and sxhash_obj.  
  PVEC_CLOSURE,
  PVEC_CHAR_TABLE,
  PVEC_SUB_CHAR_TABLE,
  PVEC_RECORD,
  PVEC_FONT,
  PVEC_TAG_MAX = PVEC_FONT   Keep this equal to the highest member.  
};

This elegant combination of Tagged Pointers (for high-speed, core types and immediate integers without memory allocations) and Poor Man's Inheritance (for an extensible array of complex types) is how Emacs achieves dynamic typing in statically-typed C without sacrificing critical GC performance.

05. The Modern Reincarnation: LLVM's Custom RTTI

The most fascinating part about reading Emacs's 1980s source code is discovering that these techniques are still highly applicable and relevant even in the modern C++ era. The combination of Tagged Pointer + Poor Man's Inheritance is no exception.

By looking at the source code of LLVM, engineers explicitly disable the C++ standard RTTI (-fno-rtti) and dynamic_cast. Instead, LLVM literally reinvents Emacs's "Poor Man's Inheritance" and Tagging system, but wraps it in modern C++ templates. It's called Custom RTTI.

Why LLVM abandons standard RTTI?

Standard C++ RTTI works by embedding a pointer to a type_info object inside every polymorphic class's vtable. A dynamic_cast then traverses a chain of these type_info objects at runtime, comparing strings or pointers until it finds a match or exhausts the hierarchy. For a compiler that performs millions of type checks per second while traversing an AST, this traversal cost is unacceptable.

Instead, LLVM defines a single integer field called SubclassID in its base class to identify concrete types:

// Source: llvm/include/llvm/IR/Value.h
class Value {
private:
   A single integer tag — the same idea as Emacs's pvec_type enum.
   Private so subclasses cannot corrupt it accidentally.
  const unsigned char SubclassID;

public:
  unsigned getValueID() const { return SubclassID; }

   The base class accepts all Values — this is the identity check.
  static bool classof(const Value *) { return true; }
};

// A derived class registers its own ID range and implements classof().
class Argument : public Value {
public:
   isa<Argument>(ptr) compiles down to this — a single integer comparison.
   No vtable traversal. No string comparison. Just: is the tag == ArgumentVal?
  static bool classof(const Value *V) {
    return V->getValueID() == ArgumentVal;
  }
};

Invoking isa<Argument>(Val) evaluates at compile time to Argument::classof(Val), resulting in an integer comparison on SubclassID.

// How LLVM developers write type dispatch
if (isa<Argument>(Val)) {
    Argument *Arg = cast<Argument>(Val);
    // Do something with Arg...
}

Both Emacs and LLVM handle dynamic dispatch over a hierarchy of types by embedding tag information directly in the base structures and performing integer comparisons before casting.

06. Other Tagged Pointer Usages

The pattern of storing information in unused bits of pointers or headers is found in other system implementations:

Linux Kernel Red-Black Trees: Uses the lowest bits of parent pointers to store the node color.
LuaJIT and V8 (NaN Boxing): Uses the payload space of IEEE 754 "Not-a-Number" doubles to encode pointers.
PostgreSQL: Encodes transaction visibility metadata in the bit-fields of tuple headers.
LLVM PointerIntPair<>: A C++ template utility for packing integers into pointer alignment padding.
ARM64 Top Byte Ignore (TBI): Hardware configuration that allows the top 8 bits of a 64-bit pointer to be used for tags (utilized in iOS/macOS).

Conclusion

This article outlines three ways memory is structured to handle dynamic typing:

Tagged Union (Unboxed): Allocates inline memory based on the largest variant. (std::variant)
Fat Pointer: Allocates additional bytes alongside the pointer to store type information. (Go interfaces, Rust traits)
Tagged Pointer (Boxed): Uses the alignment padding of pointers to store tags, relying on heap allocation for the data. (Emacs Lisp_Object, V8)

Different memory layouts serve different requirements. Emacs and LLVM utilize Tagged Pointers and struct embedding to manage dynamic typing within their specific memory constraints.

Next step

Looking into the weird Lisp_String object... there is an interval tree in it.

Emacs Internal Series:

#01: Emacs is a Lisp Runtime in C, Not an Editor
#02: Data First — Deconstructing Lisp_Object in C
#03: Tagged Union, Tagged Pointer, and Poor Man's Inheritance

> share on twitter / bluesky / mastodon / facebook / reddit / telegram / email

Hacker Times

Hacker Times

Emacs internals: Tagged pointers vs. C++ std:variant and LLVM (Part 3)

Discussion

Discussion

Recap

01. Back to the basics: how to write a polymorphic type

02. Tagged Union (Unboxed): `std::variant` and `std::visit`

03. Tagged Pointer (Boxed)

04. Emacs: Tagged Pointer + Poor Man's Inheritance

05. The Modern Reincarnation: LLVM's Custom RTTI

06. Other Tagged Pointer Usages

Conclusion

Next step

Hacker Times

Hacker Times

Emacs internals: Tagged pointers vs. C++ std:variant and LLVM (Part 3)

Discussion

Discussion

Recap

01. Back to the basics: how to write a polymorphic type

02. Tagged Union (Unboxed): std::variant and std::visit

03. Tagged Pointer (Boxed)

04. Emacs: Tagged Pointer + Poor Man's Inheritance

05. The Modern Reincarnation: LLVM's Custom RTTI

06. Other Tagged Pointer Usages

Conclusion

Next step

02. Tagged Union (Unboxed): `std::variant` and `std::visit`