Faster C software with Dynamic Feature Detection

I've been building some software recently whose performance is very sensitive to the capabilities of the CPU on which it's running. A portable version of the code does not perform all that well, but we cannot guarantee the presence of optional Instruction Set Architectures (ISAs) which we can use to speed it up. What to do? That's what we'll be looking at today, mostly for the wildly popular x86-64 family of processors (but the general techniques apply anywhere).

Make it the compiler's problem.

Compilers are very good at optimising for a particular target CPU microarchitecture, such as if you use -march=native (or e.g. -march=znver3). They know amongst other things, the ISA capabilities of these CPUs and they will quietly take advantage of them at cost of portability.

So the first way to speed up C software is to build for a more recent architecture where the compiler has the tools to speed the code up for you. This won't work for every problem or scenario, but if it's an option for you, it's very easy.

This works surprisingly well on x86-64 because it's now a very mature architecture. But this also means that there's a wide span of capabilities between the original x86-64 CPUs and the CPUs you can buy nowadays. To help make things a bit more digestible, intel devised microarchitecture levels, with later levels including all the features of its predecessors:

Level	Contains e.g.	Intel	AMD
x86-64-v1	(base)	All 64 bit	All 64 bit
x86-64-v2	POPCNT, SSE4.2	2008 (Nehalem/Westmere)	2011 (Bulldozer)
x86-64-v3	AVX2, BMI2	2013 (Haswell/Broadwell)	2015 (Excavator)
x86-64-v4	AVX-512[1]	2017 (Skylake)	2022 (Zen 4)

[1] AVX-512 is not actually one feature, but v4 includes the most useful parts of it.

There are some gotchas I won't dwell on, but not all kit released after these dates is good for these capabilities, in particular there have been:

Slow implementations of some instructions (e.g. PEXT/PDEP in BMI2 in AMD before Zen 3)
Aggressive feature-based market segmentation by intel:
- Consumer avx512 kit more or less doesn't exist.
- Lower cost chips with fewer capabilities.

However, in general, microarchitecture levels give you a good set of baseline capabilities for optimisation. Two ways to use them:

Build for the lowest common denominator in a fleet (which is probably v3 or v4 by now)
Build a version for newer processors and a version for older processors.

Obviously the second is less than ideal if you don't control all the hardware you can run on. Fortunately there's an answer for that (for popular compilers): indirect functions (IFUNCs). IFUNCs essentially have the dynamic linker run a function at link time which returns the real function to use according to the hardware available. And the best bit is for the general case, the compiler can even do all the work for you:

[[gnu::target_clones("avx2,default")]] // gcc/glibc and clang void * my_func(void *data) { ... }

Note that the square brackets here are c23 syntax for attributes. the equivalent compiler-specific version is __attribute__((target_clones("avx2,default"))). It's the little things that make c23 great!

This will create two versions of my_func, one with avx2 and one with the default flags. It will also generate a resolver function in the background for the dynamic linker to run. Calls to the function will thus be linked to the most optimal version at program startup time!

If you're lucky, this did the trick. If you're slightly less lucky you may have luck triggering autovectorisation with some small modifications (such as alignment annotations). Sadly this process is finicky and unreliable and there isn't really space for it in this post.

Manual optimisation with Intrinsics

Sometimes you need to write multiple versions of an algorithm to get the best performance. Either you can't autovec to work (if it's for SIMD) or you need to work with some specific intrinsics (such as I do for this project).

To take advantage of intrinsics directly, we must provide two versions of an algorithm: a portable version and a version that uses the intrinsics. Here's how we might optimise for AVX2 statically:

#ifdef __AVX2__ // defined by the compiler when AVX2 is supported on the target #include <immintrin.h> // header with avx2 intrinsics void * my_func(void *data) { ... } #else void * my_func(void *data) { ... } #endif

With this sort of technique, we can once again support building for targets with specific capabilities, except now with direct access to the intrinsics that can make things faster.

But we're still building for a specific target and we'd like not to do that. Unfortunately there isn't a portable way to do this, but there are compiler-specific hacks. Here's how we do it for gcc and clang for avx2:

// ask the compiler to enable avx2 #pragma GCC push_options #pragma GCC target ("avx2") #pragma clang attribute push \ (__attribute__((target("avx2"))), apply_to = function)

// now include the header with avx2 enabled #include <immintrin.h>

// now undo to stop our portable code requiring avx2 #pragma GCC pop_options #pragma clang attribute pop

[[gnu::target("avx2")]] // this function must be compiled for avx2 void * my_func_avx2(void *data) { ... } void * my_func_portable(void *data) { ... }

Now we need a way to dispatch between them. As we are limiting ourselves to gcc and clang on x86-64, we can use the compiler-provided runtime platform detection to switch implementations:

void * my_func(void *data) { return __builtin_cpu_supports("avx") ? my_func_avx2(data) : my_func_portable(data); }

We could use IFUNCs instead, although we have to write our own resolver this time:

static void * (*resolve_my_func(void)) (void *) { __builtin_cpu_init(); // ifunc resolvers are called before this is automatically triggered. return __builtin_cpu_supports("avx") ? my_func_avx2 : my_func_portable; } void * my_func(void *data) __attribute__ ((ifunc ("resolve_my_func")));

Okay it's a bit gnarly because of the atrocious function pointer syntax in c, but it works. At program startup, this will pick the best version!

At this point, since we're writing our own resolver function, we can provide any logic we like over as many different versions as we like. This makes it possible to handle more complex scenarios such as working around AMD's BMI2 implementation being slow before Zen 3 or Intel's AVX-512 implementation aggressively downclocking the CPU when you use ZMM registers before ice lake. Or probably the scenario you find yourself in, if your luck is anything like mine.

Notes

MUSL libc does not (yet) support IFUNCS. It's not a simple feature.

I haven't said a word about windows support. I do not have a windows machine to test on and in any case, the project I'm doing this for is written in C23, while the compiler of choice for windows (outside of WSL), MSVC, supports most of C11. You'd be forgiven for thinking microslop don't actually want people to port C software to windows!

Hacker Times

Hacker Times

Faster C software with Dynamic Feature Detection

Discussion

Discussion