Allocating on the Stack

Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)

This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().

Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.

Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.

Nice! That's (seems) so simple yet also so very effective. Shouldn't other memory-managed languages be able to profit from this as well?

alloca() is not part of the C++ standard, and I can't imagine how it could used safely in a C++ environment

If I had a nickel for every article about avoiding implicit boxing in gc-heap languages...

I read that as "Allocating on the Slack" and immediately came up with three ways how to do that.

Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.

alloca() is not part of the C++ standard, and I can't imagine how it could used safely in a C++ environment

Nice! That's (seems) so simple yet also so very effective. Shouldn't other memory-managed languages be able to profit from this as well?

I read that as "Allocating on the Slack" and immediately came up with three ways how to do that.

Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.

Yep. `go build -pgo=foo.pprof`

https://go.dev/doc/pgo

This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().

alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.

The obvious issue is that you can't know how much space is left on the stack, so you basically have to guess and pick an arbitrary "safe" size limit. This gets even more tricky when functions may be called recursively.

The more subtle issue is that the stack memory returned by alloca() has function scope and therefore you must never call it directly in a loop.

I use alloca() on a regular basis, but I have to say there are safer and better alternatives, depending on the particular use case: arena/frame allocators, threadlocal pseudo-stacks, static vectors, small vector optimizations, etc.

If you're not doing recursion, I prefer using an appropriately sized thread_local buffer in this scenario. Saves you the allocation and does the bookkeeping of having one per thread

Most C compilers let you use variable length arrays on the stack. However they're problematic and mature code bases usually disable this (-Werror -Wvla) because if the size is derived from user input then it's exploitable.

For purely historical reasons the C/C++ stack is "small" with exactly how small being outside of programmer control. So you have to avoid using the stack even if it would be the better solution. Otherwise you risk your program crashing/failing with stack overflow errors.

This is more of a patch/hack solution as far as I can understand.

You can just as well pass a heap allocated buffer + size around and allocate by incrementing/decrementing size.

Or even better use something like zig's FixedSizeAllocator.

Correct me if I am wrong please

alloca()'s availability and correctness/bugginess is platform dependent, so it probably sees only niche usage since it's not portable. Furthermore, even its man page discourages its use in the general case:

>The alloca() function is machine- and compiler-dependent. Because it allocates from the stack, it's faster than malloc(3) and free(3). In certain cases, it can also simplify memory deallocation in applications that use longjmp(3) or siglongjmp(3). Otherwise, its use is discouraged.

Furthermore:

>The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.

https://man7.org/linux/man-pages/man3/alloca.3.html

It becames super slow when you bump that pointer into a page that's missing from the TLB.

If I had a nickel for every article about avoiding implicit boxing in gc-heap languages...

...you would have the same balance as before, because this is not an article about implicit boxing. ;-)

In the case where you're using the top of the stack as a, well, stack, I don't see the problem. It would only work if you're not interleaving processing of dynamically-sized objects and function codegen works out. It's similar to TCO in the sense of maintaining certain invariants across calls (e.g. no temporaries need be preserved), and actually in languages with TCO, like Lua, you can hack an application-level stack data structure using tail recursion (and coroutines/threads if you need more than one) that can sometimes be more performant or more convenient than using a native data structure.

There's been a least one experiment (posted a few years ago to HN) where someone benchmarked a stackful coroutine implementation with hundreds of thousands (millions?) of stacks that could grow contiguously on-demand up to, e.g., 2MB, but were initially minimally sized and didn't reserve the maximum stack size upfront. The bottleneck was the VMA bookkeeping--the syscalls, exploding the page table, TLB flushing, etc. In principle it could work well and be even more performant than existing solutions, and it might work better today since Linux 6.13's lightweight guard page feature, MADV_GUARD_INSTALL, but we probably still need more architectural support from the system (kernel, if not hardware) to make it performant and competitive with language-level solutions like goroutines, Rust async, etc.

But that requires an explicit declaration and isn't done automatically under the hood, or am I missing something?

If you're not doing recursion, I prefer using an appropriately sized thread_local buffer in this scenario. Saves you the allocation and does the bookkeeping of having one per thread

Yep. `go build -pgo=foo.pprof`

https://go.dev/doc/pgo

I never noticed much difference with using pgo even after taking a very long real life profile. All the machinery required to get it and put it to CI was never worth the speed-up. Of course YMMV.

alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.

The more subtle issue is that the stack memory returned by alloca() has function scope and therefore you must never call it directly in a loop.

> The obvious issue is that you can't know how much space is left on the stack [...]

Oh, huh. I've never actually tried it, but I always assumed it would be possible to calculate this, at least for a given OS / arch. You just need 3 quantities, right? `remaining_stack_space = $stack_address - $rsp - $system_stack_size`.

But I guess there's no API for a program to get its own stack address unless it has access to `/proc/$pid/maps` or similar?

If you have well defined boundaries, you can move the stack to an arbitrarily large chunk of memory before the recursive call and restore it to the system stack upon completion.

> alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.

This is not a problem for Go, because it has resizable stacks.

...you would have the same balance as before, because this is not an article about implicit boxing. ;-)

But that requires an explicit declaration and isn't done automatically under the hood, or am I missing something?

The JIT does this automatically in some cases as of .NET 10 (https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotn...).

I never noticed much difference with using pgo even after taking a very long real life profile. All the machinery required to get it and put it to CI was never worth the speed-up. Of course YMMV.

If you have well defined boundaries, you can move the stack to an arbitrarily large chunk of memory before the recursive call and restore it to the system stack upon completion.

And if you never do reach completion, you can just garbage collect that chunk. AKA "Cheney on the MTA": https://dl.acm.org/doi/10.1145/214448.214454

> The obvious issue is that you can't know how much space is left on the stack [...]

But I guess there's no API for a program to get its own stack address unless it has access to `/proc/$pid/maps` or similar?

> alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.

This is not a problem for Go, because it has resizable stacks.

If your API includes inline assembly, then it's trivial. Go's internals would need it to swap stacks like it does. But I doubt any of that is exposed at the language level.

> $system_stack_size

Does such thing even exist? And non-64 bit platforms the address space is small enough that with several threads of execution you may just be unable to grow your stack even up to $system_stack_size because it'd bump into something else.

You can do something like:

    void *get_sp(void) {
        volatile char c;
        return (void *)&c;
    }

Or, in GCC and Clang:

    void *get_sp(void) {
        return __builtin_frame_address(0);
    }

Which gets you close enough.

It's certainly possible on some systems. Even then, you have to fudge, as you don't know exactly how much stack space you need to save for other things.

Stack memory is weird in general. It's usually a fixed amount determined when the thread starts, with the size typically determined by vibes or "seems to work OK." Most programmers don't have much of a notion of how much stack space their code needs, or how much their program needs overall. We know that unbounded non-tail recursion can overflow the stack, but how about bounded-but-large? At what point do you need to start considering such things? A hundred recursive calls? A thousand? A million?

It's all kind of sketchy, but it works well enough in practice, I suppose.

Furthermore:

>The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.

https://man7.org/linux/man-pages/man3/alloca.3.html

What do you mean outside of programmer control? What's stopping you from setting the stack size in the linker flags?

This is more of a patch/hack solution as far as I can understand.

You can just as well pass a heap allocated buffer + size around and allocate by incrementing/decrementing size.

Or even better use something like zig's FixedSizeAllocator.

Correct me if I am wrong please

It becames super slow when you bump that pointer into a page that's missing from the TLB.

I wouldn't call it a hack, but it's not a general alternative for memory allocated on the heap since the lifetime is tied to that of the allocating function.

I think what you're referring to is an arena allocator where you allocate a big chunk of memory from the heap, then sequentially sub-allocate from that, then eventually free the entire heap chunk (arena) in one go. Arena allocators are therefore also special use case since they are for when all the sub-allocations have the same (but arbitrary) lifetime, or at least you're willing to defer deallocation of everything to the same time.

So, heap, arena and stack allocation all serve different purposes, although you can just use heap for everything if memory allocation isn't a performance issue for your program, which nowadays is typically the case.

Back in the day when memory was scarce and computers were much slower, another common technique was to keep a reuse "free list" of allocated items of a given type/size, which was faster than heap allocate and free/coalesce, and avoided the heap fragmentation of random malloc/frees.

A TLB miss could happen when executing the next statement in your program. It's not something you have a lot of control over, and doesn't change the fact that allocating from the stack (when an option) is going to be faster than allocating from the heap.

The JIT does this automatically in some cases as of .NET 10 (https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotn...).

And if you never do reach completion, you can just garbage collect that chunk. AKA "Cheney on the MTA": https://dl.acm.org/doi/10.1145/214448.214454

You can do something like:

    void *get_sp(void) {
        volatile char c;
        return (void *)&c;
    }

Or, in GCC and Clang:

    void *get_sp(void) {
        return __builtin_frame_address(0);
    }

Which gets you close enough.

If your API includes inline assembly, then it's trivial. Go's internals would need it to swap stacks like it does. But I doubt any of that is exposed at the language level.

> $system_stack_size

> Does such thing even exist?

AFAIK no. There are default stack sizes, but they're just that, defaults, and they can vary on the same system: main thread stacks are generally 8MiB (except for Windows where it's just 1) but the size of ancillary stacks is much smaller everywhere but on linux using glibc.

It should be possible to get the stack root and size using `pthread_getattr_np`, but I don't know if there's anyone bothering with that, and it's a glibc extension.

It's certainly possible on some systems. Even then, you have to fudge, as you don't know exactly how much stack space you need to save for other things.

It's all kind of sketchy, but it works well enough in practice, I suppose.

Personally, I only use alloca() if:

1. I know that the function will never be called recursively and

2. the total amount of stack allocation is limited to a few kilobytes at most.

alloca() is more problematic on embedded platforms because default stack sizes tend to be tiny. Either document your stack usage requirements or provide an option to disable all calls to alloca(). For example, Opus has the OPUS_NONTHREADSAFE_PSEUDOSTACK option.

What do you mean outside of programmer control? What's stopping you from setting the stack size in the linker flags?

With Linux the stack size is a process limit, set with ulimit (default 8MB?). You can even set it to unlimited if you want, meaning that essentially (but not quite) the stack and heap grow towards each other only limited by the size of the address space.

ulimit only affects the main program stack though. if you are using multi-threading then there is a per-thread stack limit, which you can configure with pthreads, but not until C++23 for std::thread.

I wouldn't call it a hack, but it's not a general alternative for memory allocated on the heap since the lifetime is tied to that of the allocating function.

So you don't allocate left and right, be it stack or heap.

It's all useless though unless you control the hardware. If you don't, you might as well prlimit --stack=unlimited and have at it.

> Does such thing even exist?

It should be possible to get the stack root and size using `pthread_getattr_np`, but I don't know if there's anyone bothering with that, and it's a glibc extension.

.NET bothers with it, to support RuntimeHelpers.EnsureSufficientExecutionStack [1] and other things. See the pthreads calls used to here [2].

[1]: https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

[2]: https://github.com/dotnet/runtime/blob/b6a3e784f0bb418fd2fa7...

Personally, I only use alloca() if:

1. I know that the function will never be called recursively and

2. the total amount of stack allocation is limited to a few kilobytes at most.

ulimit only affects the main program stack though. if you are using multi-threading then there is a per-thread stack limit, which you can configure with pthreads, but not until C++23 for std::thread.

So you don't allocate left and right, be it stack or heap.

It's all useless though unless you control the hardware. If you don't, you might as well prlimit --stack=unlimited and have at it.

.NET bothers with it, to support RuntimeHelpers.EnsureSufficientExecutionStack [1] and other things. See the pthreads calls used to here [2].

[1]: https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

[2]: https://github.com/dotnet/runtime/blob/b6a3e784f0bb418fd2fa7...

We’re always looking for ways to make Go programs faster. In the last 2 releases, we have concentrated on mitigating a particular source of slowness, heap allocations. Each time a Go program allocates memory from the heap, there’s a fairly large chunk of code that needs to run to satisfy that allocation. In addition, heap allocations present additional load on the garbage collector. Even with recent enhancements like Green Tea, the garbage collector still incurs substantial overhead.

So we’ve been working on ways to do more allocations on the stack instead of the heap. Stack allocations are considerably cheaper to perform (sometimes completely free). Moreover, they present no load to the garbage collector, as stack allocations can be collected automatically together with the stack frame itself. Stack allocations also enable prompt reuse, which is very cache friendly.

Stack allocation of constant-sized slices

Consider the task of building a slice of tasks to process:

func process(c chan task) {
    var tasks []task
    for t := range c {
        tasks = append(tasks, t)
    }
    processAll(tasks)
}

Let’s walk through what happens at runtime when pulling tasks from the channel c and adding them to the slice tasks.

On the first loop iteration, there is no backing store for tasks, so append has to allocate one. Because it doesn’t know how big the slice will eventually be, it can’t be too aggressive. Currently, it allocates a backing store of size 1.

On the second loop iteration, the backing store now exists, but it is full. append again has to allocate a new backing store, this time of size 2. The old backing store of size 1 is now garbage.

On the third loop iteration, the backing store of size 2 is full. append again has to allocate a new backing store, this time of size 4. The old backing store of size 2 is now garbage.

On the fourth loop iteration, the backing store of size 4 has only 3 items in it. append can just place the item in the existing backing store and bump up the slice length. Yay! No call to the allocator for this iteration.

On the fifth loop iteration, the backing store of size 4 is full, and append again has to allocate a new backing store, this time of size 8.

And so on. We generally double the size of the allocation each time it fills up, so we can eventually append most new tasks to the slice without allocation. But there is a fair amount of overhead in the “startup” phase when the slice is small. During this startup phase we spend a lot of time in the allocator, and produce a bunch of garbage, which seems pretty wasteful. And it may be that in your program, the slice never really gets large. This startup phase may be all you ever encounter.

If this code was a really hot part of your program, you might be tempted to start the slice out at a larger size, to avoid all of these allocations.

func process2(c chan task) {
    tasks := make([]task, 0, 10) // probably at most 10 tasks
    for t := range c {
        tasks = append(tasks, t)
    }
    processAll(tasks)
}

This is a reasonable optimization to do. It is never incorrect; your program still runs correctly. If the guess is too small, you get allocations from append as before. If the guess is too large, you waste some memory.

If your guess for the number of tasks was a good one, then there’s only one allocation site in this program. The make call allocates a slice backing store of the correct size, and append never has to do any reallocation.

The surprising thing is that if you benchmark this code with 10 elements in the channel, you’ll see that you didn’t reduce the number of allocations to 1, you reduced the number of allocations to 0!

The reason is that the compiler decided to allocate the backing store on the stack. Because it knows what size it needs to be (10 times the size of a task) it can allocate storage for it in the stack frame of process2 instead of on the heap1. Note that this depends on the fact that the backing store does not escape to the heap inside of processAll.

Stack allocation of variable-sized slices

But of course, hard coding a size guess is a bit rigid. Maybe we can pass in an estimated length?

func process3(c chan task, lengthGuess int) {
    tasks := make([]task, 0, lengthGuess)
    for t := range c {
        tasks = append(tasks, t)
    }
    processAll(tasks)
}

This lets the caller pick a good size for the tasks slice, which may vary depending on where this code is being called from.

Unfortunately, in Go 1.24 the non-constant size of the backing store means the compiler can no longer allocate the backing store on the stack. It will end up on the heap, converting our 0-allocation code to 1-allocation code. Still better than having append do all the intermediate allocations, but unfortunate.

But never fear, Go 1.25 is here!

Imagine you decide to do the following, to get the stack allocation only in cases where the guess is small:

func process4(c chan task, lengthGuess int) {
    var tasks []task
    if lengthGuess <= 10 {
        tasks = make([]task, 0, 10)
    } else {
        tasks = make([]task, 0, lengthGuess)
    }
    for t := range c {
        tasks = append(tasks, t)
    }
    processAll(tasks)
}

Kind of ugly, but it would work. When the guess is small, you use a constant size make and thus a stack-allocated backing store, and when the guess is larger you use a variable size make and allocate the backing store from the heap.

But in Go 1.25, you don’t need to head down this ugly road. The Go 1.25 compiler does this transformation for you! For certain slice allocation locations, the compiler automatically allocates a small (currently 32-byte) slice backing store, and uses that backing store for the result of the make if the size requested is small enough. Otherwise, it uses a heap allocation as normal.

In Go 1.25, process3 performs zero heap allocations, if lengthGuess is small enough that a slice of that length fits into 32 bytes. (And of course that lengthGuess is a correct guess for how many items are in c.)

We’re always improving the performance of Go, so upgrade to the latest Go release and be surprised by how much faster and memory efficient your program becomes!

Stack allocation of append-allocated slices

Ok, but you still don’t want to have to change your API to add this weird length guess. Anything else you could do?

Upgrade to Go 1.26!

func process(c chan task) {
    var tasks []task
    for t := range c {
        tasks = append(tasks, t)
    }
    processAll(tasks)
}

In Go 1.26, we allocate the same kind of small, speculative backing store on the stack, but now we can use it directly at the append site.

On the first loop iteration, there is no backing store for tasks, so append uses a small, stack-allocated backing store as the first allocation. If, for instance, we can fit 4 tasks in that backing store, the first append allocates a backing store of length 4 from the stack.

The next 3 loop iterations append directly to the stack backing store, requiring no allocation.

On the 4th iteration, the stack backing store is finally full and we have to go to the heap for more backing store. But we have avoided almost all of the startup overhead described earlier in this article. No heap allocations of size, 1, 2, and 4, and none of the garbage that they eventually become. If your slices are small, maybe you will never have a heap allocation.

Stack allocation of append-allocated escaping slices

Ok, this is all good when the tasks slice doesn’t escape. But what if I’m returning the slice? Then it can’t be allocated on the stack, right?

Right! The backing store for the slice returned by extract below can’t be allocated on the stack, because the stack frame for extract disappears when extract returns.

func extract(c chan task) []task {
    var tasks []task
    for t := range c {
        tasks = append(tasks, t)
    }
    return tasks
}

But you might think, the returned slice can’t be allocated on the stack. But what about all those intermediate slices that just become garbage? Maybe we can allocate those on the stack?

func extract2(c chan task) []task {
    var tasks []task
    for t := range c {
        tasks = append(tasks, t)
    }
    tasks2 := make([]task, len(tasks))
    copy(tasks2, tasks)
    return tasks2
}

Then the tasks slice never escapes extract2. It can benefit from all of the optimizations described above. Then at the very end of extract2, when we know the final size of the slice, we do one heap allocation of the required size, copy our tasks into it, and return the copy.

But do you really want to write all that additional code? It seems error prone. Maybe the compiler can do this transformation for us?

In Go 1.26, it can!

For escaping slices, the compiler will transform the original extract code to something like this:

func extract3(c chan task) []task {
    var tasks []task
    for t := range c {
        tasks = append(tasks, t)
    }
    tasks = runtime.move2heap(tasks)
    return tasks
}

runtime.move2heap is a special compiler+runtime function that is the identity function for slices that are already allocated in the heap. For slices that are on the stack, it allocates a new slice on the heap, copies the stack-allocated slice to the heap copy, and returns the heap copy.

This ensures that for our original extract code, if the number of items fits in our small stack-allocated buffer, we perform exactly 1 allocation of exactly the right size. If the number of items exceeds the capacity our small stack-allocated buffer, we do our normal doubling-allocation once the stack-allocated buffer overflows.

The optimization that Go 1.26 does is actually better than the hand-optimized code, because it does not require the extra allocation+copy that the hand-optimized code always does at the end. It requires the allocation+copy only in the case that we’ve exclusively operated on a stack-backed slice up to the return point.

We do pay the cost for a copy, but that cost is almost completely offset by the copies in the startup phase that we no longer have to do. (In fact, the the new scheme at worst has to copy one more element than the old scheme.)

Wrapping up

Hand optimization can still be beneficial, especially if you have a good estimate of the slice size ahead of time. But hopefully the compiler will now catch a lot of the simple cases for you and allow you to focus on the remaining ones that really matter.

There are a lot of details that the compiler needs to ensure to get all these optimizations right. If you think that one of these optimizations is causing correctness or (negative) performance issues for you, you can turn them off with -gcflags=all=-d=variablemakehash=n. If turning these optimizations off helps, please file an issue so we can investigate.

Footnotes

1 Go stacks do not have any alloca-style mechanism for dynamically-sized stack frames. All Go stack frames are constant sized.

Hacker Times