Part I: Foundations
"In theory, theory and practice are the same. In practice, they are not." — Attributed to various computer scientists
It was 2:00 AM, and I was staring at profiling data that made no sense.
I was working on a bootloader for a RISC-V SoC, and we had a performance problem. The bootloader needed to look up device configurations from a table—about 500 entries, each with a 32-bit device ID and a pointer to configuration data. Simple enough.
My colleague had implemented it using a hash table. "O(1) lookup," he said confidently. "Can't beat that."
But the bootloader was slow. Unacceptably slow. We were missing our 100ms boot time target by a factor of three.
I tried the obvious optimization: replacing the hash table with a binary search on a sorted array. Binary search is O(log n), which is theoretically worse than O(1). The textbooks say so. My algorithms professor would have frowned.
The result? The bootloader was now 40% faster.
How could O(log n) beat O(1)? What was going on?
I fired up perf, Linux's performance profiling tool, and ran both implementations:
# Hash table version
$ perf stat -e cache-references,cache-misses ./bootloader_hash
Performance counter stats:
1,247,832 cache-references
892,441 cache-misses (71.5% miss rate)
# Binary search version
$ perf stat -e cache-references,cache-misses ./bootloader_binsearch
Performance counter stats:
423,156 cache-references
89,234 cache-misses (21.1% miss rate)
There it was. The hash table had a 71.5% cache miss rate. The binary search had only 21.1%.
Each cache miss costs roughly 100 CPU cycles on this system. The hash table was spending most of its time waiting for memory.
The O(1) hash table was doing fewer operations, but each operation was expensive. The O(log n) binary search was doing more operations, but each operation was cheap.
The hardware had overruled the algorithm.
This book is about that gap—the gap between what the textbooks teach and what actually happens when your code runs on real silicon.
Traditional data structures courses teach you to think in terms of Big-O complexity:
These are useful abstractions. They help us reason about algorithms at scale. But they're incomplete.
They assume all memory accesses cost the same. They assume operations happen in isolation. They assume an idealized computer that doesn't exist.
Real computers have:
And if you're working on embedded systems, you have even more constraints:
Here's a better mental model for modern computers:
Time = Operations × (Computation Cost + Memory Cost)
Where:
For many algorithms, Memory Cost dominates.
Let's quantify this with real numbers from a typical embedded RISC-V system:
| Operation | Latency | Relative Cost |
|---|---|---|
| Register access | 1 cycle | 1× |
| L1 cache hit | 3-4 cycles | 3× |
| L2 cache hit | 12-15 cycles | 12× |
| L3 cache hit | 40-50 cycles | 40× |
| DRAM access | 100-200 cycles | 100× |
A single cache miss can cost as much as 100 register operations.
This means:
Let's make this concrete with a simple experiment. We'll compare two ways to sum 100,000 integers:
Both are O(n). The textbooks say they should perform similarly. Let's see what really happens.
Here's the array version:
// Array: contiguous memory
int array[100000];
for (int i = 0; i < 100000; i++) {
array[i] = i;
}
// Sum all elements
long long sum = 0;
for (int i = 0; i < 100000; i++) {
sum += array[i];
}
And the linked list version:
// Linked list: scattered memory
typedef struct node {
int value;
struct node *next;
} node_t;
node_t *head = NULL;
for (int i = 0; i < 100000; i++) {
node_t *node = malloc(sizeof(node_t));
node->value = i;
node->next = head;
head = node;
}
// Sum all elements
long long sum = 0;
node_t *curr = head;
while (curr) {
sum += curr->value;
curr = curr->next;
}
Using our benchmark framework (which we'll explore in detail in Chapter 3), here are the results:
=== Array Sequential Sum ===
Mean time: 70,147 ns
Median time: 71,724 ns
Total cycles: 17,557,410
=== Linked List Sequential Sum ===
Mean time: 179,169 ns
Median time: 160,527 ns
Total cycles: 44,740,656
Array is 2.55× faster than Linked List
Same algorithm (sequential sum), same O(n) complexity, but the array is 2.5× faster.
Why? Let's look at the cache behavior:
Array access pattern:
Memory: [0][1][2][3][4][5][6][7][8][9]...
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
Access: Sequential, predictable
Cache: Fetch 64 bytes (16 ints) at once
Result: ~94% cache hit rate
Linked list access pattern:
Memory: [node] ... [node] ... [node] ... [node]
↑ ↑ ↑ ↑
Access: Random, unpredictable (follows pointers)
Cache: Each node likely in different cache line
Result: ~70% cache miss rate
The array benefits from spatial locality—when you access array[0], the CPU fetches an entire cache line (64 bytes), which includes array[0] through array[15]. The next 15 accesses are free.
The linked list suffers from pointer chasing—each node is allocated separately by malloc(), scattered randomly in memory. Each access likely requires a new cache line fetch.
To understand why cache matters so much, we need to understand the memory hierarchy.
Modern computers are not the simple "CPU + RAM" model from introductory courses. They're more like this:
CPU Core
↓ 1 cycle
Registers (32-64 registers, ~256 bytes)
↓ 3-4 cycles
L1 Cache (32-64 KB, split I/D)
↓ 12-15 cycles
L2 Cache (256 KB - 1 MB, unified)
↓ 40-50 cycles
L3 Cache (4-32 MB, shared) [not on all systems]
↓ 100-200 cycles
DRAM (GB scale)
↓ 10,000+ cycles
SSD/Flash
Each level is:
The speed gap is enormous. On a 1 GHz RISC-V processor:
For comparison, if L1 cache access was 1 second, DRAM access would be 50 seconds. That's the difference between a quick response and going to make coffee.
Here's a critical insight: CPUs don't fetch individual bytes. They fetch cache lines.
A cache line is typically 64 bytes. When you access a single byte, the CPU fetches the entire 64-byte block containing that byte.
This has profound implications:
Good: If you access nearby data, it's already in cache (spatial locality)
// Excellent: sequential access
for (int i = 0; i < n; i++) {
sum += array[i]; // Next element likely in same cache line
}
Bad: If you access scattered data, you waste 63 bytes per fetch
// Terrible: random access
for (int i = 0; i < n; i++) {
sum += array[random()]; // Each access likely misses cache
}
Worse: If your data structure has poor layout, you pay for data you don't use
// Linked list node: 16 bytes (4-byte value + 8-byte pointer + padding)
// Cache line: 64 bytes
// Waste: 48 bytes (75% of cache line unused!)
Modern CPUs have hardware prefetchers that try to predict what you'll access next. They're good at detecting simple patterns:
Sequential access: Prefetcher loves this
for (int i = 0; i < n; i++) {
process(array[i]); // Prefetcher: "I see a pattern! Fetch ahead!"
}
Strided access: Prefetcher can handle this
for (int i = 0; i < n; i += 2) {
process(array[i]); // Prefetcher: "Stride of 2, got it!"
}
Pointer chasing: Prefetcher gives up
while (node) {
process(node->value);
node = node->next; // Prefetcher: "No idea what's next..."
}
This is why linked lists are so slow—the prefetcher can't help. Each pointer dereference is a surprise.
If you're working on embedded systems, the situation is more extreme:
Typical embedded RISC-V MCU:
With a 16 KB L1 cache, your entire working set needs to fit in 16 KB or you'll thrash the cache.
For comparison:
This is why embedded systems developers obsess over data structure size and layout. Every byte counts.
In embedded systems, we often care about worst-case performance, not average-case.
Consider a real-time control loop running at 1 kHz (1ms period):
If your algorithm has unpredictable cache behavior, you can't guarantee real-time deadlines.
This is why real-time systems often prefer:
Even if they're "slower" in average-case Big-O terms.
This book will teach you to think about data structures in terms of hardware reality:
Part I: Foundations
Part II: Basic Data Structures
Part III: Trees and Hierarchies
Part IV: Advanced Topics
Part V: Case Studies
Each chapter will include:
To get the most out of this book, you should:
Know:
Have:
Optional but helpful:
All code examples and benchmarks are available at:
github.com/dannyjiang/ds-in-practice (placeholder)
In the next chapter, we'll dive deep into the memory hierarchy. You'll learn:
Then in Chapter 3, we'll build a complete benchmarking framework—the same one used for all measurements in this book.
By the end of Part I, you'll have the tools and knowledge to analyze any data structure's real-world performance.
Let's get started.
The mystery from 2:00 AM was solved. The O(log n) binary search beat the O(1) hash table by 40% because cache behavior mattered more than algorithmic complexity. The hash table's 71.5% cache miss rate versus the binary search's 21.1% explained everything. The hardware had overruled the algorithm.
Key insights:
The Performance Gap:
Next Chapter: We'll explore the memory hierarchy in detail and learn how caches work at the hardware level.