1. https://en.wikipedia.org/wiki/Curiously_recurring_template_p...
2. https://david.alvarezrosa.com/posts/devirtualization-and-sta...
3. https://llvm.org/docs/ProgrammersManual.html#the-isa-cast-an...
But it's really just an implementation difference, the idea is still to have a lightweight RTTI.
I had thought you need the pointer-sized integer types and mustn't do this directly to an actual pointer, but maybe I was wrong (in theory, obviously practice doesn't follow but that's a dangerous game)
If you don't care about portability or using every theoretically available bit then it is trivial. A maximalist implementation must be architecture aware and isn't entirely knowable at compile-time. This makes standardization more complicated since the lowest common denominator is unnecessarily limited.
In C++ this really should be implemented through a tagged pointer wrapper class that abstracts the architectural assumptions and limitations.
In modern C++, the technically "correct" and safe way to spell this trick is exactly as you suggested: using uintptr_t (or intptr_t).
Maybe that is not the correct C++ terminology, I'm more familiar with how provenance works in Rust, where large parts of it got stabilised a little over a year ago. (What was stabilised was "strict provenance", which is a set of rules that if you abide them will definitely be correct, but it is possible the rules might be loosened in the future to be more lenient.)
https://doc.rust-lang.org/std/primitive.pointer.html#method....
Rust's MIRI is able to run code which uses this (a strict provenance API) because although MIRI's pointers are some mysterious internal type, it can track that we mapped them to hide our tags, and then later mapped back from the tagged pointer to recover our "real" pointer and see that's fine.
This isn't an unsafe operation. Dereferencing a pointer is unsafe, but twiddling the bits is fine, it just means whoever writes the unsafe dereferencing part of your codebase needs to be very careful about these pointers e.g. making sure the ones you've smuggled a tag in aren't dereferenced 'cos that's Undefined Behaviour.
It's clear to me how this works in Rust, it's just unclear still in C++
The problem of pointer provenance is more finding a workable theoretical model rather than one causing miscompiles on realistic code. While there are definitely miscompiles on carefully constructed examples, I'm not aware of any bugs on actual code. This is in comparison to topics like restrict(/noalias) semantics or lifetime semantics, where there is a steady drip of bug reports that turn out to be actual optimization failures.
But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates
As that name suggests, in this model provenance is not transmitted via integers. Every 123456 is always just the integer 123456 and there aren't magic 123456 values which are different and transmit some form of provenance from a pointer to some value which happened perhaps to be stored at address 123456 in memory.
However, PNVI-ae-udi has Exposure, which means if we exposed the pointer in an approved way then the associated provenance is somehow magically "out there" in the ether, as a result if we have exposed this pointer then just having that integer 123456 works fine because we combined that integer 123456 with that provenance from the ether and make a working pointer. User disambiguation means that the compiler has to give you "benefit of the doubt" e.g. if you could mean to make a pointer to that Doodad which no longer exists as of a minute ago or to this other Doodad which does exist, well, benefit of the doubt means it was the latter and so your pointer is valid even though the addresses of both Doodads were the same.
Is doing that manually worth it? Usually not, but for some core types (classical example is strings) or in language runtimes it can be.
Would it be awesome if this could be done automatically? Absolutely, but I understand it is a large change, and the plan is to later build upon the pattern types that are currently work in progress (and would allow you to specify custom ranged integer typed).
So that's one tiny use of this sort of idea which is guaranteed unnecessary in Rust, and indeed although it isn't guaranteed the optimiser will typically spot less obvious opportunities so that Option<Option<bool>> which might be None, or Some(None) or Some(Some(true)) or Some(Some(false)) is the same size (one byte) as bool.
But hiding stuff in a pointer is applicable in places your Rust compiler won't try to take advantage unless you do something like this. A novel tiny String-like type I saw recently does this, https://crates.io/crates/cold-string ColdString is 8 bytes, if your text is 8 or fewer bytes of UTF-8 then you're done, that'll fit, but, if you have more text ColdString allocates on the heap to store not only your text but also its length and so it needs to actually "be" in some sense a raw pointer to that structure, but if the string is shorter that pointer is nonsense, we've hidden our text in the pointer itself.
Implementation requires knowing how pointers work, and how UTF-8 encoding works. I actually really like one of the other Rust tiny strings, CompactString but if you have a lot of very small strings (e.g. UK postcodes fit great) then ColdString might be as much as three times smaller than your existing Rust or C++ approach and it's really hard to beat that for such use cases.
Edited: To remove suggestion ColdString has a distinct storage capacity, this isn't intended as a conventional string buffer, it can't grow after creation
There's a competing proposal in C++ land to add provenance via angelic nondeterminism: if there's some provenance that makes the code non-UB, then use that provenance. (As you might imagine, I'm not a big fan of that proposal, but WG21 seems to love it a lot more than I do.)
https://blog.rust-lang.org/2025/01/09/Rust-1.84.0/#strict-pr...
(is the current version of that paper, the tracking ticket insisted there's a P3125R5 and that LEWG had seen it in 2025, but it isn't listed in a mailing so it might be a mirage)
You know it's a Hana paper because it wants this to be allowed at compile time (C++ constrexpr) but joking aside this seems like a nice approach for C++ which stays agnostic about future implementation details.
March 12, 2026 [systems-programming] #tagged-pointer #fat-pointer #custom-rtti #struct-embedding
From the previous article, we examined how GNU Emacs represents every Lisp value β integers, symbols, cons cells, strings, buffers β inside a single 64-bit slot called Lisp_Object. Because all heap objects are 8-byte aligned and the lowest 3 bits of any valid pointer are always zero, Emacs reclaims these "free" bits and uses them as a type tag.
The more fundamental question is: when a single variable must hold values of different types at runtime, how do we preserve enough information to use that data correctly?
The static typing
Given memory and a struct type, how do we access the data?
Data in memory is represented as bits. To operate on it, we define its shape. If we modify the value of score, we load the data from base + 4 bytes, write a 32-bit value, and store it back.
#include <stdint.h>
struct Person {
uint8_t age; offset 0, size 1
[3 bytes padding] offset 1 (align next field to 4)
uint32_t score; offset 4, size 4
uint64_t id; offset 8, size 8
char name[12]; offset 16, size 12
};
// sizeof(Person) == 28, no trailing padding needed
The compiler remembers the shape of the data so the runtime does not have to.
Dynamic typing
Given a memory address and a set of possible types, how do we access the data?
One approach is to add a field at the beginning of a struct to indicate the active type.
struct TaggedValue {
int type_id; which type is active?
union {
TypeA a; TypeB b; TypeC c; } payload;
};
This allows type checking before casting:
// predicate: is this a TypeA?
bool is_type_a(struct TaggedValue* v) {
return v->type_id == TYPE_A; just an integer comparison
}
// check cast: give me TypeA* or NULL
TypeA* as_type_a(struct TaggedValue* v) {
if (v->type_id == TYPE_A) {
return &v->payload.a; }
return NULL; wrong type
}
Functions can be dispatched according to this tag.
// Each function pointer represents "what to do when you encounter this type".
// A Visitor bundles all these handlers together into one struct.
struct Visitor {
void (*visit_a)(TypeA* a); called when type_id == TYPE_A
void (*visit_b)(TypeB* b); called when type_id == TYPE_B
void (*visit_c)(TypeC* c); called when type_id == TYPE_C
};
// The dispatch function is the bridge between data and behavior.
// It reads the type_id, then hands control to the correct handler.
void visit(struct TaggedValue* v, struct Visitor* visitor) {
switch (v->type_id) {
case TYPE_A: visitor->visit_a(&v->payload.a); break;
case TYPE_B: visitor->visit_b(&v->payload.b); break;
case TYPE_C: visitor->visit_c(&v->payload.c); break;
}
}
Now you have invented C++ std::variant and std::visit that were introduced in C++17, which utilize a "tagged union."
std::variant and std::visitPeople claim std::variant and std::visit provide a more "TYPE SAFE" way, but in fact they just provide some checks. It simply ensures that an invalid cast like (TypeA*) type_b_object is caught either at compile time (if the type is not in the variant at all) or at runtime (if the active type does not match), rather than silently producing undefined behavior as the hand-written version would.
There are two things that need attention here:
First, the problem with a tagged union is that the size of the struct is the size of the largest union element. So for the example below, even if most objects are bool or int, the overall size will be 64 bytes, which is very memory inefficient. Therefore, tagged union technique is mostly applied where different sizes of different types are similar, or temporary objects on stack that can be freed after this call.
PS. std::variant is originally designed to handle a closed set of known types in a type-safe manner β similar in spirit to Haskell's Either, where a value is one of several explicitly listed possibilities. The closest C++ analogue to Haskell's Maybe is actually std::optional.
struct TaggedValue {
int type_id;
union {
int a; bool b; char c[64]; } payload;
};
Second, this way of handling data and types is called a "tagged union", or "unboxed".
The naming of boxed and unboxed is a programming language (PL) term that looks very weird to C/C++ programmers. The "unboxed" way looks actually wrapped in a struct box. But the actual meaning in PL theory refers to memory indirection.
std::variant is unboxed because all the bytes required for the largest possible variant are allocated inline right there.
References on memory representation:
A tagged union allocates the maximum required size for every variant. If an Abstract Syntax Tree (AST) contains an integer node requiring 8 bytes and a function definition node requiring 256 bytes, an unboxed tagged union allocates 256 bytes for every integer node. This affects the overall memory footprint and cache usage.
An alternative approach keeps data on the heap (boxed) and uses the lowest 3 bits of the 8-byte aligned pointer as the type tag. This provides $2^3 = 8$ possible types without allocating additional inline bytes. This is called a tagged pointer.
struct tagged_pointer {
void* pointer_and_tag;
}
// sizeof(tagged_pointer) == 8 bytes
What if there are more than 8 types?
One common solution is using fat pointer. Modern languages like Go (interfaces) and Rust (trait objects) use this approach extensively.
Stealing 3 bit is not enough, so, just add a 64-bit space to store the tag information.
struct fat_pointer {
int tag; Could be a type ID, a vtable pointer, or a size/length
void* payload; Pointer to the actual data on the heap
}
// sizeof(fat_pointer) == 16 bytes (doubled)
"Since both Go and Rust think fat pointers are great, why doesn't GNU Emacs just use fat pointers for
Lisp_Object?"
Comparing fat pointers to tagged pointers: A fat pointer is smaller than a tagged union, but it doubles the memory size compared to a 64-bit tagged pointer. This changes the memory footprint and the Garbage Collector (GC) scanning workload. Emacs was designed to keep the Lisp_Object within a single 64-bit word.
PS. It's a crime to waste memory in the 1980s β a typical workstation had around 256 KB of RAM, which is less than the size of a single modern emoji in a Unicode font file. Doubling every Lisp_Object from 8 to 16 bytes wasn't an engineering tradeoff. It was a confession.
Since GNU Emacs uses the lowest 3 bits for tags, it is strictly limited to 8 fundamental types. If you look at enum Lisp_Type in src/lisp.h, you'll see exactly that:
Lisp_Symbol
Lisp_Int0
Lisp_Int1
Lisp_String
Lisp_Vectorlike
Lisp_Cons
Lisp_Float (plus one unused type)
McCarthy's Lisp (1960) abstract math atom eq car cdr cons quote cond β β Emacs engineers bridge: β "statically typed C must represent β dynamically typed Lisp" βΌ Lisp_Object (src/lisp.h) C layer ββββββββββββββββββββββββ¬βββββ β pointer or value βtag β β one machine word (64 bits) β 61 bits β 3b β ββββββββββββββββββββββββ΄βββββ β ββ tag = Cons β struct Lisp_Cons ββ tag = String β struct Lisp_String ββ tag = Float β struct Lisp_Float ββ tag = Int0/1 β EMACS_INT (immediate value, no pointer!) ββ tag = Symbol β struct Lisp_Symbol ββ tag = Vectorlike β βΌ union vectorlike_header (The Escape Hatch) β ββ PVEC_BUFFER β struct buffer ββ PVEC_WINDOW β struct window ββ PVEC_HASH_TABLE β struct Lisp_Hash_Table ββ PVEC_SYMBOL_WITH_POS β struct Lisp_Symbol_With_Pos ββ ... (over 30+ complex types!)
A basic Lisp interpreter might only need a few primitive types. But Emacs is a text editor designed for performance (?!), so it implements its core objects (like buffers and windows) directly in C. So, 8 types are not enough.
How can we represent more types when the pointer tag is limited to 3 bits?
To expand the type space, Emacs uses a C pattern sometimes called "Poor Man's Inheritance" or struct embedding. By placing a common header as the very first field of a struct, a pointer can be cast to the header type, checked for its sub-type, and then cast to the specific object type.
When Lisp_Object carries the Lisp_Vectorlike tag, it points to a struct starting with union vectorlike_header. Emacs reads this header to find a sub-tag indicating the specific type (pvec_type).
This is how vectorlike_header is embedded:
// Source: lisp.h
struct Lisp_Vector // The generic "Base Data Structure"
{
union vectorlike_header header;
Lisp_Object contents[FLEXIBLE_ARRAY_MEMBER];
} GCALIGNED_STRUCT;
// Derived type Example
struct Lisp_Symbol_With_Pos
{
union vectorlike_header header; <--- The "Base Class" must be the first field!
Lisp_Object sym; A symbol
Lisp_Object pos; A fixnum
} GCALIGNED_STRUCT;
When Emacs identifies a Lisp_Object, it performs two checks:
Lisp_Vectorlike (0b101).union vectorlike_header*, read the TYPE field. If it equals PVEC_SYMBOL_WITH_POS, cast the pointer to struct Lisp_Symbol_With_Pos*.The definition of union vectorlike_header packs the subtype into its size field:
// Source: src/lisp.h
union vectorlike_header
{
The `size' header word, W bits wide, has one of two forms
discriminated by the second-highest bit (PSEUDOVECTOR_FLAG):
1 1 W-2
+---+---+-------------------------------------+
| M | 0 | SIZE | vector
+---+---+-------------------------------------+
1 1 W-32 6 12 12
+---+---+--------+------+----------+----------+
| M | 1 | unused | TYPE | RESTSIZE | LISPSIZE | pseudovector
+---+---+--------+------+----------+----------+
M (ARRAY_MARK_FLAG) holds the GC mark bit.
SIZE is the length (number of slots) of a regular Lisp vector,
and the object layout is struct Lisp_Vector.
TYPE is the pseudovector subtype (enum pvec_type).
LISPSIZE is the number of Lisp_Object fields at the beginning of the
object (after the header). These are always traced by the GC.
RESTSIZE is the number of fields (in word_size units) following.
These are not automatically traced by the GC.
For PVEC_BOOL and statically allocated PVEC_SUBR, RESTSIZE is 0.
(The block size for PVEC_BOOL is computed from its own size
field, to avoid being restricted by the 12-bit RESTSIZE field.)
ptrdiff_t size;
};
And the following contains all the concrete pvec_type sub-types it can represent:
// Source: src/lisp.h
enum pvec_type
{
PVEC_NORMAL_VECTOR, Should be first, for sxhash_obj.
PVEC_FREE,
PVEC_BIGNUM,
PVEC_MARKER,
PVEC_OVERLAY,
PVEC_FINALIZER,
PVEC_SYMBOL_WITH_POS,
PVEC_MISC_PTR,
PVEC_USER_PTR,
PVEC_PROCESS,
PVEC_FRAME,
PVEC_WINDOW,
PVEC_BOOL_VECTOR,
PVEC_BUFFER,
PVEC_HASH_TABLE,
PVEC_OBARRAY,
PVEC_TERMINAL,
PVEC_WINDOW_CONFIGURATION,
PVEC_SUBR,
PVEC_OTHER, Should never be visible to Elisp code.
PVEC_XWIDGET,
PVEC_XWIDGET_VIEW,
PVEC_THREAD,
PVEC_MUTEX,
PVEC_CONDVAR,
PVEC_MODULE_FUNCTION,
PVEC_NATIVE_COMP_UNIT,
PVEC_TS_PARSER,
PVEC_TS_NODE,
PVEC_TS_COMPILED_QUERY,
PVEC_SQLITE,
These should be last, for internal_equal and sxhash_obj.
PVEC_CLOSURE,
PVEC_CHAR_TABLE,
PVEC_SUB_CHAR_TABLE,
PVEC_RECORD,
PVEC_FONT,
PVEC_TAG_MAX = PVEC_FONT Keep this equal to the highest member.
};
This elegant combination of Tagged Pointers (for high-speed, core types and immediate integers without memory allocations) and Poor Man's Inheritance (for an extensible array of complex types) is how Emacs achieves dynamic typing in statically-typed C without sacrificing critical GC performance.
The most fascinating part about reading Emacs's 1980s source code is discovering that these techniques are still highly applicable and relevant even in the modern C++ era. The combination of Tagged Pointer + Poor Man's Inheritance is no exception.
By looking at the source code of LLVM, engineers explicitly disable the C++ standard RTTI (-fno-rtti) and dynamic_cast. Instead, LLVM literally reinvents Emacs's "Poor Man's Inheritance" and Tagging system, but wraps it in modern C++ templates. It's called Custom RTTI.
Why LLVM abandons standard RTTI?
Standard C++ RTTI works by embedding a pointer to a type_info object inside every polymorphic class's vtable. A dynamic_cast then traverses a chain of these type_info objects at runtime, comparing strings or pointers until it finds a match or exhausts the hierarchy. For a compiler that performs millions of type checks per second while traversing an AST, this traversal cost is unacceptable.
Instead, LLVM defines a single integer field called SubclassID in its base class to identify concrete types:
// Source: llvm/include/llvm/IR/Value.h
class Value {
private:
A single integer tag β the same idea as Emacs's pvec_type enum.
Private so subclasses cannot corrupt it accidentally.
const unsigned char SubclassID;
public:
unsigned getValueID() const { return SubclassID; }
The base class accepts all Values β this is the identity check.
static bool classof(const Value *) { return true; }
};
// A derived class registers its own ID range and implements classof().
class Argument : public Value {
public:
isa<Argument>(ptr) compiles down to this β a single integer comparison.
No vtable traversal. No string comparison. Just: is the tag == ArgumentVal?
static bool classof(const Value *V) {
return V->getValueID() == ArgumentVal;
}
};
Invoking isa<Argument>(Val) evaluates at compile time to Argument::classof(Val), resulting in an integer comparison on SubclassID.
// How LLVM developers write type dispatch
if (isa<Argument>(Val)) {
Argument *Arg = cast<Argument>(Val);
// Do something with Arg...
}
Both Emacs and LLVM handle dynamic dispatch over a hierarchy of types by embedding tag information directly in the base structures and performing integer comparisons before casting.
The pattern of storing information in unused bits of pointers or headers is found in other system implementations:
doubles to encode pointers.PointerIntPair<>: A C++ template utility for packing integers into pointer alignment padding.This article outlines three ways memory is structured to handle dynamic typing:
std::variant)Lisp_Object, V8)Different memory layouts serve different requirements. Emacs and LLVM utilize Tagged Pointers and struct embedding to manage dynamic typing within their specific memory constraints.
Looking into the weird Lisp_String object... there is an interval tree in it.
Emacs Internal Series:
> share on twitter / bluesky / mastodon / facebook / reddit / telegram / email
> related articles:
1) Emacs Internal #02: Data First β Deconstructing Lisp_Object in C