Chapter 1: The Performance Gap

Part I: Foundations

"In theory, theory and practice are the same. In practice, they are not." — Attributed to various computer scientists

The Mystery

It was 2:00 AM, and I was staring at profiling data that made no sense.

I was working on a bootloader for a RISC-V SoC, and we had a performance problem. The bootloader needed to look up device configurations from a table—about 500 entries, each with a 32-bit device ID and a pointer to configuration data. Simple enough.

My colleague had implemented it using a hash table. "O(1) lookup," he said confidently. "Can't beat that."

But the bootloader was slow. Unacceptably slow. We were missing our 100ms boot time target by a factor of three.

I tried the obvious optimization: replacing the hash table with a binary search on a sorted array. Binary search is O(log n), which is theoretically worse than O(1). The textbooks say so. My algorithms professor would have frowned.

The result? The bootloader was now 40% faster.

How could O(log n) beat O(1)? What was going on?

The Investigation

I fired up perf, Linux's performance profiling tool, and ran both implementations:

# Hash table version
$ perf stat -e cache-references,cache-misses ./bootloader_hash
  Performance counter stats:
    1,247,832  cache-references
      892,441  cache-misses  (71.5% miss rate)

# Binary search version
$ perf stat -e cache-references,cache-misses ./bootloader_binsearch
  Performance counter stats:
      423,156  cache-references
       89,234  cache-misses  (21.1% miss rate)

There it was. The hash table had a 71.5% cache miss rate. The binary search had only 21.1%.

Each cache miss costs roughly 100 CPU cycles on this system. The hash table was spending most of its time waiting for memory.

The O(1) hash table was doing fewer operations, but each operation was expensive. The O(log n) binary search was doing more operations, but each operation was cheap.

The hardware had overruled the algorithm.

Why This Matters

This book is about that gap—the gap between what the textbooks teach and what actually happens when your code runs on real silicon.

Traditional data structures courses teach you to think in terms of Big-O complexity:

Arrays: O(1) access, O(n) insertion
Linked lists: O(1) insertion, O(n) access
Hash tables: O(1) average case
Binary search trees: O(log n) operations

These are useful abstractions. They help us reason about algorithms at scale. But they're incomplete.

They assume all memory accesses cost the same. They assume operations happen in isolation. They assume an idealized computer that doesn't exist.

Real computers have:

Memory hierarchies: Registers, L1 cache, L2 cache, L3 cache, DRAM
Latency gaps: 1 cycle vs 100+ cycles
Cache lines: 64 bytes fetched together
Prefetchers: Hardware that guesses what you'll need next
Limited bandwidth: You can't fetch everything at once

And if you're working on embedded systems, you have even more constraints:

Tiny caches: 8KB to 64KB is common
No L3 cache: Many MCUs stop at L1 or L2
Slow memory: DRAM might be 100MHz, not 3GHz
Real-time requirements: Worst-case matters, not average-case

The Real Performance Model

Here's a better mental model for modern computers:

Time = Operations × (Computation Cost + Memory Cost)

Where:

Computation Cost: The actual ALU operations (usually cheap)
Memory Cost: Cache misses, DRAM accesses (usually expensive)

For many algorithms, Memory Cost dominates.

Let's quantify this with real numbers from a typical embedded RISC-V system:

Operation	Latency	Relative Cost
Register access	1 cycle	1×
L1 cache hit	3-4 cycles	3×
L2 cache hit	12-15 cycles	12×
L3 cache hit	40-50 cycles	40×
DRAM access	100-200 cycles	100×

A single cache miss can cost as much as 100 register operations.

This means:

An O(n) algorithm with good cache behavior can beat an O(log n) algorithm with poor cache behavior
An O(1) hash table can lose to an O(log n) binary search
A "slow" algorithm that fits in cache can beat a "fast" algorithm that doesn't

Our First Benchmark: Array vs Linked List

Let's make this concrete with a simple experiment. We'll compare two ways to sum 100,000 integers:

Array: Contiguous memory, perfect for cache
Linked list: Scattered memory, cache nightmare

Both are O(n). The textbooks say they should perform similarly. Let's see what really happens.

Here's the array version:

// Array: contiguous memory
int array[100000];
for (int i = 0; i < 100000; i++) {
    array[i] = i;
}

// Sum all elements
long long sum = 0;
for (int i = 0; i < 100000; i++) {
    sum += array[i];
}

And the linked list version:

// Linked list: scattered memory
typedef struct node {
    int value;
    struct node *next;
} node_t;

node_t *head = NULL;
for (int i = 0; i < 100000; i++) {
    node_t *node = malloc(sizeof(node_t));
    node->value = i;
    node->next = head;
    head = node;
}

// Sum all elements
long long sum = 0;
node_t *curr = head;
while (curr) {
    sum += curr->value;
    curr = curr->next;
}

Using our benchmark framework (which we'll explore in detail in Chapter 3), here are the results:

=== Array Sequential Sum ===
Mean time:      70,147 ns
Median time:    71,724 ns
Total cycles:   17,557,410

=== Linked List Sequential Sum ===
Mean time:      179,169 ns
Median time:    160,527 ns
Total cycles:   44,740,656

Array is 2.55× faster than Linked List

Same algorithm (sequential sum), same O(n) complexity, but the array is 2.5× faster.

Why? Let's look at the cache behavior:

Array access pattern:

Memory:  [0][1][2][3][4][5][6][7][8][9]...
         ↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑  ↑
Access:  Sequential, predictable
Cache:   Fetch 64 bytes (16 ints) at once
Result:  ~94% cache hit rate

Linked list access pattern:

Memory:  [node] ... [node] ... [node] ... [node]
         ↑          ↑          ↑          ↑
Access:  Random, unpredictable (follows pointers)
Cache:   Each node likely in different cache line
Result:  ~70% cache miss rate

The array benefits from spatial locality—when you access array[0], the CPU fetches an entire cache line (64 bytes), which includes array[0] through array[15]. The next 15 accesses are free.

The linked list suffers from pointer chasing—each node is allocated separately by malloc(), scattered randomly in memory. Each access likely requires a new cache line fetch.

The Memory Hierarchy

To understand why cache matters so much, we need to understand the memory hierarchy.

Modern computers are not the simple "CPU + RAM" model from introductory courses. They're more like this:

CPU Core
  ↓ 1 cycle
Registers (32-64 registers, ~256 bytes)
  ↓ 3-4 cycles
L1 Cache (32-64 KB, split I/D)
  ↓ 12-15 cycles
L2 Cache (256 KB - 1 MB, unified)
  ↓ 40-50 cycles
L3 Cache (4-32 MB, shared) [not on all systems]
  ↓ 100-200 cycles
DRAM (GB scale)
  ↓ 10,000+ cycles
SSD/Flash

Each level is:

Faster but smaller than the level below
More expensive per byte
Closer to the CPU

The speed gap is enormous. On a 1 GHz RISC-V processor:

L1 cache: 3-4 nanoseconds
DRAM: 100-200 nanoseconds
That's a 50× difference

For comparison, if L1 cache access was 1 second, DRAM access would be 50 seconds. That's the difference between a quick response and going to make coffee.

Cache Lines: The Fundamental Unit

Here's a critical insight: CPUs don't fetch individual bytes. They fetch cache lines.

A cache line is typically 64 bytes. When you access a single byte, the CPU fetches the entire 64-byte block containing that byte.

This has profound implications:

Good: If you access nearby data, it's already in cache (spatial locality)

// Excellent: sequential access
for (int i = 0; i < n; i++) {
    sum += array[i];  // Next element likely in same cache line
}

Bad: If you access scattered data, you waste 63 bytes per fetch

// Terrible: random access
for (int i = 0; i < n; i++) {
    sum += array[random()];  // Each access likely misses cache
}

Worse: If your data structure has poor layout, you pay for data you don't use

// Linked list node: 16 bytes (4-byte value + 8-byte pointer + padding)
// Cache line: 64 bytes
// Waste: 48 bytes (75% of cache line unused!)

Prefetching: Hardware Tries to Help

Modern CPUs have hardware prefetchers that try to predict what you'll access next. They're good at detecting simple patterns:

Sequential access: Prefetcher loves this

for (int i = 0; i < n; i++) {
    process(array[i]);  // Prefetcher: "I see a pattern! Fetch ahead!"
}

Strided access: Prefetcher can handle this

for (int i = 0; i < n; i += 2) {
    process(array[i]);  // Prefetcher: "Stride of 2, got it!"
}

Pointer chasing: Prefetcher gives up

while (node) {
    process(node->value);
    node = node->next;  // Prefetcher: "No idea what's next..."
}

This is why linked lists are so slow—the prefetcher can't help. Each pointer dereference is a surprise.

Embedded Systems: Even Harsher Constraints

If you're working on embedded systems, the situation is more extreme:

Typical embedded RISC-V MCU:

L1 cache: 16-32 KB (vs 32-64 KB on desktop)
L2 cache: 128-256 KB (vs 256 KB - 1 MB on desktop)
L3 cache: None (vs 4-32 MB on desktop)
DRAM: 100 MHz (vs 3 GHz on desktop)

With a 16 KB L1 cache, your entire working set needs to fit in 16 KB or you'll thrash the cache.

For comparison:

100,000 integers (array): 400 KB → won't fit in L1
100,000 linked list nodes: 1.6 MB → won't even fit in L2

This is why embedded systems developers obsess over data structure size and layout. Every byte counts.

Real-Time Considerations

In embedded systems, we often care about worst-case performance, not average-case.

Consider a real-time control loop running at 1 kHz (1ms period):

Best case: All data in L1 cache → 50 microseconds
Worst case: All data in DRAM → 500 microseconds

If your algorithm has unpredictable cache behavior, you can't guarantee real-time deadlines.

This is why real-time systems often prefer:

Static allocation: Predictable memory layout
Fixed-size data structures: No dynamic resizing
Simple algorithms: Predictable cache behavior

Even if they're "slower" in average-case Big-O terms.

What You'll Learn in This Book

This book will teach you to think about data structures in terms of hardware reality:

Part I: Foundations

How memory hierarchy works (Chapter 2)
How to measure and profile performance (Chapter 3)

Part II: Basic Data Structures

Arrays: The cache-friendly foundation (Chapter 4)
Linked lists: When and how to use them (Chapter 5)
Stacks, queues, and ring buffers (Chapter 6)
Hash tables: Cache-conscious design (Chapter 7)
Dynamic arrays and memory management (Chapter 8)

Part III: Trees and Hierarchies

Binary search trees: Cache behavior (Chapter 9)
B-trees: Cache-conscious trees (Chapter 10)
Tries and radix trees (Chapter 11)
Heaps and priority queues (Chapter 12)

Part IV: Advanced Topics

Lock-free data structures (Chapter 13)
String processing (Chapter 14)
Graphs and networks (Chapter 15)
Probabilistic structures (Chapter 16)

Part V: Case Studies

Bootloader data structures (Chapter 17)
Device driver queues (Chapter 18)
Firmware memory management (Chapter 19)

Each chapter will include:

Real-world examples from embedded systems
Benchmarks showing actual performance
Cache analysis with profiling tools
Design guidelines for your own code

Prerequisites and Setup

To get the most out of this book, you should:

Know:

C programming (pointers, structs, memory management)
Basic data structures (arrays, linked lists, trees)
Basic algorithms (sorting, searching)

Have:

Linux system (Ubuntu/Debian recommended)
GCC compiler
Basic command-line skills

Optional but helpful:

RISC-V development board or QEMU
Experience with embedded systems
Familiarity with assembly language

All code examples and benchmarks are available at: github.com/dannyjiang/ds-in-practice (placeholder)

The Road Ahead

In the next chapter, we'll dive deep into the memory hierarchy. You'll learn:

How caches work at the hardware level
What cache lines, sets, and ways mean
How to predict cache behavior
How to measure cache performance

Then in Chapter 3, we'll build a complete benchmarking framework—the same one used for all measurements in this book.

By the end of Part I, you'll have the tools and knowledge to analyze any data structure's real-world performance.

Let's get started.

Summary

The mystery from 2:00 AM was solved. The O(log n) binary search beat the O(1) hash table by 40% because cache behavior mattered more than algorithmic complexity. The hash table's 71.5% cache miss rate versus the binary search's 21.1% explained everything. The hardware had overruled the algorithm.

Key insights:

Big-O complexity is necessary but not sufficient for understanding real-world performance
Memory hierarchy dominates modern computer performance
Cache misses cost 100× more than cache hits
Spatial locality matters: Sequential access beats random access
Embedded systems have harsher constraints: Smaller caches, slower memory
Real-time systems need predictable performance: Worst-case matters

The Performance Gap:

Textbook: O(1) hash table beats O(log n) binary search
Reality: Cache behavior can reverse this
Lesson: Measure, don't assume

Next Chapter: We'll explore the memory hierarchy in detail and learn how caches work at the hardware level.

Hacker Times