Includes: - AVX-512 SIMD path + scalar fallback - Wait-free lookups with rebuild-and-swap dynamic FIB - Benchmarks on synthetic data and real RIPE RIS BGP (~254K prefixes)
Interesting result: on real BGP + uniform random lookups, a plain Patricia trie can sometimes match or beat the SIMD tree due to cache locality and early exits.
Would love feedback, especially comparisons with PopTrie / CP-Trie.
[0] https://github.com/esutcu/planb-lpm/blob/748d19d5fbd945cefa3...
[1] https://github.com/esutcu/planb-lpm/blob/748d19d5fbd945cefa3...
__m512i vx = _mm512_set1_epi64(static_cast<long long>(x));
__m512i vk = _mm512_load_si512(reinterpret_cast<const __m512i*>(base));
__mmask8 m = _mm512_cmp_epu64_mask(vx, vk, _MM_CMPINT_GE);
return static_cast<std::uint32_t>(__builtin_popcount(m));
would be replaced with: return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, FANOUT), x, FANOUT), FANOUT);
and you set FANOUT to __riscv_vsetvlmax_e32m1() at runtime.Alternatively, if you don't want a dynamic FANOUT you keep the FANOUT=8 (or another constant) and do a stripmining loop
size_t cnt = 0;
for (size_t vl, n = 8; n > 0; n -= vl, base += vl) {
vl = __riscv_vsetvl_e64m1(n);
cnt += __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, vl), x, vl), vl);
}
return cnt;
This will take FANOUT/VLEN iterations and the branches will be essentially almost predicted.If you know FANOUT is always 8 and you'll never want to changes it, you could alternatively use select the optimal LMUL:
size_t vl = __riscv_vsetvlmax_e32m1();
if (vl == 2) return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m4(base, 8), x, 8), 8);
if (vl == 4) return __riscv_vcpop(__riscv_vmsge(u__riscv_vle64_v_u64m2(base, 8), x, 8), 8);
return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, 8), x, 8), 8);IPv6 longest-prefix-match (LPM) using a linearized B+-tree with AVX-512 SIMD and a scalar fallback. Based on the algorithm from:
Zhihao Zhang, Lanzheng Liu, Chen Chen, Huiba Li, Jiwu Shu, Windsor Hsu, Yiming Zhang. PlanB: Efficient Software IPv6 Lookup with Linearized B+-Tree. NSDI '26. arXiv:2604.14650. Author's reference code: https://github.com/nicexlab/planb.
This repository is an independent, clean-room reimplementation of the algorithm as a portable, MIT-licensed C++17 library with extras that were not part of the reference:
include/lpm6.hpp) that compiles without
AVX-512 and transparently falls back to a scalar path.lpm6::Dynamic) using the paper's rebuild-and-swap model
with std::atomic<std::shared_ptr>; lookups are wait-free.pip install -e .).tests/test_correctness.cpp) that verify the tree
against a brute-force LPM reference on synthetic FIBs.examples/generate_sample.py).The PlanB paper is a solid engineering idea, but the reference code on
GitHub is not something you can drop into another project: it is Linux-
and AVX-512-only, has no license, no Python layer, and no dynamic-FIB
path. planb-lpm re-expresses the algorithm from the paper as a
portable, tested, and permissively licensed library so the technique is
actually usable in research, teaching, and production software.
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure
Options:
| CMake option | Default | Effect |
|---|---|---|
LPM6_BUILD_PYTHON |
OFF |
Build the pybind11 extension |
AVX-512 is auto-detected via check_cxx_compiler_flag; on CPUs or compilers
without support, the scalar path is used.
python3 examples/generate_sample.py 100000 1000000 42
./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20
rrc00-25 table (roughly /48 45%, /32 11%,
/40 10%, /44 10%, /36 4%, others), generated by
examples/generate_sample.py.taskset -c <N> to cut scheduler noise.
The binaries print the CPU they ran on, the affinity-mask width, the
cpufreq governor, and the Intel turbo state so numbers are
reproducible. (On WSL the /sys probes are unavailable and those
lines are silently omitted.)-O3. This is a consumer CPU; the
paper's numbers are from a 24-core Xeon 8331C and a Zen5 Ryzen 9 370HX.taskset -c 2 ./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20 (20 timed runs + 1 warmup, batch paths use the real public API returning next-hops):
| Path | min | q1 | median | q3 | max | IQR |
|---|---|---|---|---|---|---|
single |
34 | 45 | 48 | 51 | 62 | 6.6 |
batch<1> |
32 | 44 | 47 | 51 | 60 | 7.6 |
batch<2> |
36 | 48 | 54 | 58 | 68 | 10.4 |
batch<4> |
42 | 60 | 66 | 71 | 84 | 11.3 |
batch<8> |
53 | 66 | 73 | 81 | 101 | 13.4 |
batch<16> |
43 | 57 | 62 | 66 | 78 | 9.1 |
batch<32> |
41 | 70 | 76 | 83 | 94 | 13.4 |
Units: MLPS; medians from a rolling average of four 20-run sweeps. Build time for that FIB is ~25 ms, tree depth 6, 531 k keys across 200 k edges.
Observations:
batch<1> β single β the batched entry point adds no overhead
at M=1, confirming the dispatch cost is negligible.single.
M=32 is essentially flat; going further would need larger-batch
unrolling and register-allocation work.#pragma GCC unroll 16 boundary interacting with the
7-level traversal. A known-to-investigate item.Paper's ablation reports batching worth 3β4.5Γ on AMD Zen5; we see ~1.5Γ
on Ice Lake. Part of the gap is the paper's lookup_batch_checksum-style
fast path (our binary uses the public API that also writes next-hops),
part is AVX-512 sustained frequency differences between Ice Lake and
Zen5. The paper reports 191β197 MLPS single-core on Xeon and 374β393
MLPS on Zen5 β a real server should land near those numbers.
The bench_naive binary builds a Patricia radix trie (tests/patricia.hpp)
on the same FIB and runs the same trace through it. Patricia is a
standard path-compressed binary trie β representative of the
pointer-chasing LPM structures PlanB is designed to replace.
taskset -c 2 ./build/bench_naive β¦ (same 20-run discipline):
| Implementation | min | median | max | IQR |
|---|---|---|---|---|
lpm6::Tree |
23 | 50 | 63 | 8.2 |
| Patricia trie | 2.4 | 2.4 | 2.6 | 0.06 |
Units: MLPS. Tree-vs-Patricia β 20Γ on the median, bracketing the
upper end of the paper's reported 1.6β14Γ advantage over PopTrie,
CP-Trie, Neurotrie, and HBS. Patricia isn't one of those baselines β
it's a conventional-radix-trie stand-in β so the exact multiplier will
shift when we benchmark against the paper's actual algorithms (on the
roadmap, see plan.md). The binary also prints a linear-scan number
on a 50 000-address slice; that is O(N) vs. O(log N) and is only a
sanity check, not a comparison.
Same 100 000-prefix workload. footprint_bytes() counts the
algorithmic cost (our flat key + next-hop arrays for the tree; one
Node struct per Patricia node); RSS delta is the process-level
VmRSS change across the build and captures allocator overhead:
| Implementation | Algorithmic | RSS delta | Per-prefix |
|---|---|---|---|
lpm6::Tree |
4.82 MB | 5.05 MB | ~50 B/prefix |
| Patricia trie | 6.04 MB | 9.02 MB | ~90 B/prefix |
At 100 K, the tree's flat cache-aligned layout costs roughly half the
RSS of the pointer-chasing Patricia. This does not hold at scale β
see FIB-size sweep below: the linearized B+-tree grows in discrete
depth steps (9^depth buckets), so its footprint is bounded from
below by depth, not prefix count. Once depth transitions 6β7 (around
250 K prefixes) the tree's algorithmic size jumps to ~38 MB and stays
there until depth 7 fills up. Patricia scales linearly with prefixes,
so at 250 K the tree is actually larger than Patricia. Paper
reports 56β92% memory reductions vs PopTrie/CP-Trie/Neurotrie/HBS;
those are heavier than plain Patricia, so the paper's advantage likely
still holds against them even at our 250 K crossover.
examples/run_sweep.sh drives planb-lpm, bench_update, and
bench_naive across five FIB sizes sharing one 1 M-address trace,
same 20-run discipline, pinned to core 2. Numbers are medians of
the timed runs; throughput units are MLPS:
| FIB | depth | single | batch<8> | batch<32> | rebuild | tree alg. | Patricia alg. |
|---|---|---|---|---|---|---|---|
| 10 K | 5 | 94 | 165 | 148 | 1.5 ms | 0.53 MB | 0.61 MB |
| 100 K | 6 | 46 | 69 | 75 | 17.7 ms | 4.82 MB | 6.04 MB |
| 250 K | 7 | 20 | 29 | 41 | 51.8 ms | 38.40 MB | 15.00 MB |
| 500 K | 7 | 12 | 18 | 27 | 107.9 ms | n/m | n/m |
| 1 M | 7 | 10 | 14 | 22 | 213.4 ms | n/m | n/m |
(n/m = not measured; Patricia allocation and the brute-force sanity
check become impractical at 500 K+ on a laptop, so the sweep script
only runs them for β€250 K. Tree footprint at 500 K / 1 M is dominated
by the same depth-7 bucket skeleton as 250 K and grows only with the
filled-leaf count; we haven't added a separate probe yet.)
Observations:
batch<8> beats
batch<32> (165 vs 148 MLPS) β everything's in L1/L2 and large
unrolls just add register pressure. At 1 M, batch<32> is the
best path (22 vs 14 for batch<8>); the long-latency DRAM fetches
benefit from more in-flight loads.Dataset caveat: these are synthetic FIBs with a RIPE-weighted length distribution, not real BGP tables. Real BGP has stronger prefix-length clustering and spatial locality in advertised ranges, both of which change cache behavior β see the next section for a reproduction on an actual RIPE RIS RIB.
rrc00)examples/mrt_to_fib.py parses a TABLE_DUMP_V2 / RIB_IPV6_UNICAST
MRT dump (RFC 6396) into a planb-lpm FIB. On the rrc00 full-table
latest-bview.gz of 2026-04-19 we get 254 197 unique IPv6
prefixes β right in the range (β235 K) the paper cites for rrc00 in
its evaluation. Same 1 M uniform-random 64-bit trace, same 20-run
discipline, pinned to core 2:
| Path | min | median | max | IQR |
|---|---|---|---|---|
single |
48.7 | 67.5 | 74.4 | 3.6 |
batch<8> |
115.3 | 137.2 | 151.0 | 8.7 |
batch<32> |
100.5 | 118.6 | 137.1 | 8.3 |
Units: MLPS. Tree depth 7, 4.78 M sentinel-padded keys, 38.4 MB
algorithmic footprint (same skeleton as the synthetic 250 K row).
Build is 63 ms, full rebuild under bench_update is median 40 ms
(paper 850 ms/1 M on Zen5 β ~213 ms/250 K scale; our Ice Lake laptop
at 40 ms is ~5Γ faster β likely because the paper bundles more
preprocessing into the rebuild number).
Throughput is roughly 3Γ higher on real BGP than on the same-size synthetic (67 vs 20 MLPS single). The synthetic distribution is length-weighted but picks each prefix's address uniformly; real BGP concentrates advertisements in a small fraction of the address space, so a uniform-random lookup trace against a real RIB tends to hit the same hot tree paths and stays L2-resident. This is the opposite direction of "real data is worse" that we expected β it mainly tells us our trace is the limiting factor, not the FIB.
Against real BGP the Patricia baseline gets faster too β by more:
| Implementation | min | median | max | IQR |
|---|---|---|---|---|
lpm6::Tree |
43.5 | 65.3 | 75.4 | 5.1 |
| Patricia trie | 51.2 | 95.7 | 110.9 | 24.2 |
Patricia jumps from 2.4 MLPS on the 100 K synthetic to 95 MLPS on real BGP β 40Γ better on the same machine, and now ahead of the tree. Two things are going on:
The honest conclusion: on this hardware, on a real BGP table, against a uniform 64-bit trace, a plain Patricia trie is roughly at parity with the tree. The tree's advantage shows up with non-uniform traces (real packet captures concentrate on long prefixes where Patricia walks deep) or with the paper's actual baselines. Both are on the roadmap.
bench_mt runs batch<8> across T threads, each pinned to a distinct
logical CPU, all sharing one immutable lpm6::Tree. Threads release
in lockstep per run so L3 / DRAM contention is measured, not dodged.
Same 20-run discipline. Laptop: 4 physical cores / 8 logical (SMT):
| FIB | T | per-thread min | per-thread max | aggregate MLPS | efficiency |
|---|---|---|---|---|---|
| BGP 254 K | 1 | 95 | 140 | 130 | 1.00Γ |
| BGP 254 K | 2 | 95 | 143 | 278 | 1.07Γ |
| BGP 254 K | 4 | 68 | 142 | 396 | 0.76Γ |
| BGP 254 K | 8 | 17 | 98 | 626 | 0.60Γ |
| synth 1 M | 1 | 12 | 18 | 15 | 1.00Γ |
| synth 1 M | 2 | 13 | 18 | 31 | 1.02Γ |
| synth 1 M | 4 | 7 | 16 | 52 | 0.85Γ |
| synth 1 M | 8 | 3 | 15 | 85 | 0.69Γ |
Efficiency = aggregate at T Γ· (T Γ aggregate at T=1). Units MLPS.
Observations:
Paper reports 3.4 BLPS on 12 Xeon cores (Β§5.4). Our 8-thread laptop
(4 real cores + SMT) hits 0.63 BLPS on the same-scale BGP table β
roughly 5Γ below the paper, which tracks with 12 vs 4 real cores
plus Xeon's wider L2 / higher sustained AVX-512 frequency. A proper
comparison needs a server-class box; see Phase 2.6 in plan.md.
PlanB's update model is batched: N pending prefix changes are coalesced
into one full rebuild (Β§3.6 of the paper). bench_update measures
rebuild time and per-op latency on the 100 000-prefix FIB:
tree rebuild : min 17.2 med 18.8 max 25.0 ms (IQR 1.2)
dynamic insert (incl. rebuild) : med 14.4 ms
amortized if coalesced per 1 K : ~14 Β΅s / update
amortized if coalesced per 10 K: ~1.4 Β΅s / update
The paper reports 850 ms to rebuild a 1 M-prefix FIB on an AMD Zen5 server; linear scaling gives ~85 ms for our 100 K setup and we see ~19 ms on Ice Lake, so the rebuild path is at parity within hardware noise.
c2-standard-16 (Cascade Lake, AVX-512) is on the roadmap
(Phase 2.6 in plan.md).plan.md
(Phase 2.7); the others are tracked but not scheduled.rrc00 table is ~3Γ higher
than on a same-size synthetic for uniform 64-bit traces, and
Patricia's relative position against the tree flips entirely. Both
the synthetic and the real-BGP runs use a uniform-random lookup
trace β real packet captures concentrate on long prefixes, which
would penalize Patricia further and is the regime where the
linearized tree's fixed 7 cache-line cost looks best.Treat these numbers as a reproducibility check that the algorithm behaves as the paper predicts on commodity hardware, not as a competitive evaluation.
pip install pybind11 scikit-build-core
pip install -e .
python examples/basic_usage.py
from planb_lpm import Tree, Dynamic
t = Tree()
t.build([("2001:db8::", 32, 10), ("fe80::", 10, 20)])
t.lookup("2001:db8::1") # -> 10
d = Dynamic()
d.load([("::", 0, 1)])
d.insert("2001:db8::", 32, 7)
d.lookup("2001:db8::abcd") # -> 7
prefix/len into a [start, end) interval on the upper
64 bits of the IPv6 address space.9^L Β· 8.vpcmpuq + popcnt to pick the child, then do the same at the
leaf to locate the predecessor edge.include/lpm6.hpp core header-only library
src/main.cpp CLI benchmark
src/ipv6_util.hpp FIB/trace loaders
tests/test_correctness.cpp brute-force vs tree vs patricia
tests/patricia.hpp baseline path-compressed radix trie (benchmark only)
tests/bench_naive.cpp three-way throughput: naive / patricia / tree
tests/bench_update.cpp rebuild + dynamic-update latency
tests/bench_mt.cpp multi-core scaling (shared const tree, per-thread slice)
tests/bench_stats.hpp warmup / percentile / env-probe helpers
examples/mrt_to_fib.py RFC 6396 TABLE_DUMP_V2 β FIB converter for RIPE RIS
python/binding.cpp pybind11 module
examples/ sample data + usage scripts
All files in this repository are original work under the MIT license β see
LICENSE.md. The author's reference implementation at
https://github.com/nicexlab/planb is not vendored; see the paper cited
above for the algorithm itself.
The 0.1.0 tree has been built and tested in the following configurations on
Ubuntu 24.04 / GCC 13.3 (Intel Ice Lake):
| Configuration | Result |
|---|---|
| Release, AVX-512 auto-detect | ctest 1/1, bench OK |
| Release, AVX-512 forced off (scalar fallback) | ctest 1/1, 0 zmm instructions in binary |
Debug + -fsanitize=address,undefined |
ctest 1/1, no leaks / UB |
-Wall -Wextra -Wpedantic -Wshadow -Wconversion -Werror |
Clean build |
pip install -e . in a fresh venv + examples/basic_usage.py |
All lookups correct |
The core algorithm is from the PlanB paper by Zhang et al. (see top of file); all source in this repository is an independent reimplementation.
Contributions of any size are welcome β bug reports, benchmark results on
new hardware, alternative SIMD backends, language bindings, documentation
fixes. See CONTRIBUTING for the dev workflow,
commit conventions, and how to run the full verification suite (scalar
fallback, ASAN/UBSAN, -Werror, pip install -e .) before opening a
pull request.