- cpu usage under concurrency: many of these spin-lock or use atomics, which can use up to 100% cpu time just spinning.
- latency under concurrency: atomics cause cache-line bouncing which kills latency, especially p99 latency
Pure overwrite workload (pre-allocated values): xsync.Map: 24 B/op 1 alloc/op 31.89 ns/op orcaman/concurrent-map: 0 B/op 0 alloc/op 70.72 ns/op
Real-world mixed (80% overwrites, 20% new): xsync.Map: 57 B/op 2 allocs/op 218.1 ns/op orcaman/concurrent-map: 63 B/op 3 allocs/op 283.1 ns/op
Go maps reuse memory on overwrites, which is why orcaman achieves 0 B/op for pure updates. xsync's custom bucket structure allocates 24 B/op per write even when overwriting existing keys.
At 1M writes/second with 90% overwrites: xsync allocates ~27 MB/s, orcaman ~6 MB/s. The trade is 24 bytes/op for 2x speed under contention. Whether this matters depends on whether your bottleneck is CPU or memory allocation.
Benchmark code: standard Go testing framework, 8 workers, 100k keys.
Benchmarks for concurrent hash map implementations in Go.
Disclaimer: I'm the author of xsync, one of the libraries benchmarked here. I did my best to keep this benchmark neutral and fair to all implementations. If you spot any issues or have suggestions for improvements, please open an issue.
Since Go 1.24, sync.Map is backed by a HashTrieMap — a concurrent hash trie with 16-way branching at each level. Reads are lock-free via atomic pointer traversal through the trie nodes. Writes acquire a per-node mutex, affecting only a small subtree. The trie grows lazily as entries are inserted.
A hash table organized into cache-line-sized buckets, each holding up to 5 entries. Each bucket has its own mutex for writes, while reads are fully lock-free using atomic loads. Lookups use SWAR (SIMD Within A Register) techniques on per-entry metadata bytes for fast key filtering. Table resizing is cooperative: all goroutines help migrate buckets during growth.
A lock-free hash map combining a hash table index with sorted linked lists for collision resolution. All mutations (insert, update, delete) use atomic CAS operations. A background goroutine triggers table resize when the fill factor exceeds 50%. The sorted list ordering enables efficient concurrent traversal.
Based on Harris's lock-free linked list algorithm with a hash table index layer. Uses xxHash for hashing and atomic CAS for all mutations. Deletions are lazy (nodes are logically marked before physical removal). Auto-resizes when the load factor exceeds 50%.
A straightforward sharded design with 32 fixed shards. Each shard is a regular Go map protected by a sync.RWMutex. Keys are assigned to shards using FNV-32 hashing. The fixed shard count makes it simple and predictable, but limits scalability under high parallelism.
Each benchmark uses permille-based random operation selection:
string keys (with a long prefix to stress hashing)int keys| Size | Approx Footprint | Target |
|---|---|---|
| 100 | ~15 KB | Fits in L1 |
| 1,000 | ~150 KB | Fits in L2 |
| 100,000 | ~15 MB | Fits in L3 |
| 1,000,000 | ~150 MB | Spills to RAM |
# Run all benchmarks
go test -bench . -benchtime 5s
# Run a specific library
go test -bench BenchmarkXsyncMapOf -benchtime 5s
# Run only string key benchmarks
go test -bench 'StringKeys' -benchtime 5s
# Run only warm-up benchmarks at size=1000
go test -bench 'WarmUp.*size=1000' -benchtime 5s
The benchmark results in this repository were collected on the following setup:
Each benchmark was run with -benchtime 3s -count 3 and results were collected for each GOMAXPROCS value separately.
Each plot shows ops/s (Y axis) vs GOMAXPROCS (X axis) for all libraries. For read/write workloads, rows correspond to read percentages and columns to map sizes. Range plots have a single row with columns for map sizes.
cornelk/hashmap was benchmarked at sizes 100 and 1,000 only due to significant performance degradation at larger sizes.






All libraries report 0 allocs/op across every read/write benchmark. The tables below show B/op (bytes allocated per operation, amortized) for the WarmUp variant. Values are consistent across map sizes and GOMAXPROCS values. Range benchmarks are excluded as they allocate due to iteration overhead (closures, internal snapshots, channel buffers).
String keys:
| Library | 100% reads | 99% reads | 90% reads | 75% reads |
|---|---|---|---|---|
sync.Map |
0 | 0 | 3 | 9 |
xsync.Map |
0 | 0 | 1 | 2 |
cornelk/hashmap* |
0 | 0 | 1 | 4 |
alphadose/haxmap |
0 | 0 | 1 | 3 |
orcaman/concurrent-map |
0 | 0 | 0 | 0 |
Int keys:
| Library | 100% reads | 99% reads | 90% reads | 75% reads |
|---|---|---|---|---|
sync.Map |
0 | 0 | 3 | 8 |
xsync.Map |
0 | 0 | 0 | 2 |
cornelk/hashmap* |
0 | 0 | 1 | 4 |
alphadose/haxmap |
0 | 0 | 1 | 3 |
orcaman/concurrent-map |
0 | 0 | 0 | 0 |
* cornelk/hashmap was benchmarked at sizes 100 and 1,000 only due to significant performance degradation at larger sizes.
orcaman/concurrent-map shows zero allocations because its shards use regular Go maps, which don't allocate when overwriting existing keys. Other libraries allocate small amounts during writes due to their internal data structure overhead. sync.Map has the highest per-write allocation cost, while xsync.Map has the lowest among the non-sharded implementations.
| Library | Strengths | Weaknesses |
|---|---|---|
sync.Map |
Stdlib, no dependencies; excellent read scaling; solid all-round since Go 1.24 | Highest per-write allocation cost; slower than xsync.Map under all workloads |
xsync.Map |
Fastest in nearly every scenario; best read, write, and iteration scaling; lowest allocations among non-sharded designs | External dependency; writes allocate |
cornelk/hashmap |
Competitive at small sizes with read-heavy workloads | Significant performance degradation at sizes ≥100K with writes; limited to small maps in practice |
alphadose/haxmap |
Good read-only performance at small sizes; lock-free design | Poor write scaling under contention; falls behind at higher parallelism |
orcaman/concurrent-map |
Zero allocations (read/write); simple and predictable; decent single-threaded performance | Fixed 32 shards limit scalability; worst read-only throughput due to mutex overhead; write scaling plateaus early; slowest iteration due to channel-based API |