Box to save memory in Rust

Very often if you have text, which this does, you can make huge savings by being intelligent with the text.

Rust intentionally provides the simplest possible growable string buffer String, which is literally (under the hood, you can't poke this legitimately) Vec<u8> plus the promise that this is UTF-8 text.

But you might find your needs better served by one (or several) of:

Box<str> -- you don't need capacity, so, don't store it => length == capacity

CompactString -- use the entire 24 bytes for SSO, up to 24 bytes of UTF-8 inline, obviously doesn't make sense if all or the vast majority of your strings are 25 bytes or longer

ColdString -- same idea but for 8 bytes, and also not storing capacity, this only makes sense over Box<str> if you have plenty of <= 8 byte strings

If anyone's doing this kind of optimization, dhat-rs is worth a look, it shows you exactly which fields and call sites are eating memory, instead of just a total. Saves a lot of guessing about where to start.

tbh "trait" feels like a very problematic name for that type, for this kind of educational purpose - `trait` is already an established concept and keyword: https://doc.rust-lang.org/book/ch10-02-traits.html

It's especially problematic because traits don't have memory behaviors like this article in most cases - by default they're unsized, because it's a description of behavior, not data, and you can't even use them as a struct field without extra work.

Like, replace "trait" in here with "box" and see how confusing it would be to be describing how you saved memory by boxing your box, because option doesn't box like many other languages do.

Are there any tools that help finding these kinds of things? Like a profiler that says "80% of the allocated bytes are objects of this type, with 95% of those having that field set to None"

Small correction:

> a lot of boxes means a fragmented heap. In such case it's not a problem but this might be worth keeping in mind.

A good malloc will be able to handle this without issue due to various optimizations specifically that inherently fight fragmentation. Default Linux malloc (glibc) may have issues but I did say good malloc (and even glibc generally shouldn’t struggle with the pattern described I think).

I quite often have this issue with async. You get a state machine that is huge because of how Rust builds it.

I wonder from time to time whether you can decide the best “schema shape” beforehand, ie before you can run real workloads that stress the memory implications of such things. This can be very useful if you are trying to decide the boundary of some public facing API, but for whatever reason can’t run benchmarks (lack of impl, data, time, etc).

Without that, if you try to suggest a transformation like this when the schema is first conceived, it will likely be considered premature optimization.

TLDR: use a nullable pointer instead of fields in nested structs to save memory.

Like, replace "trait" in here with "box" and see how confusing it would be to be describing how you saved memory by boxing your box, because option doesn't box like many other languages do.

Small correction:

> a lot of boxes means a fragmented heap. In such case it's not a problem but this might be worth keeping in mind.

Very often if you have text, which this does, you can make huge savings by being intelligent with the text.

But you might find your needs better served by one (or several) of:

Box<str> -- you don't need capacity, so, don't store it => length == capacity

CompactString -- use the entire 24 bytes for SSO, up to 24 bytes of UTF-8 inline, obviously doesn't make sense if all or the vast majority of your strings are 25 bytes or longer

ColdString -- same idea but for 8 bytes, and also not storing capacity, this only makes sense over Box<str> if you have plenty of <= 8 byte strings

There's really an endless list of these optimizations. A few I've used (though not necessarily in rust):

Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated.

Bump allocator: your strings are &str, allocation is super fast with limited fragmentation.

Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer.

> String, which is literally (under the hood, you can't poke this legitimately) Vec<u8>

`String::as_vec_mut` kinda implies that, since it gives you access to that underlying `Vec` which must then exist somewhere.

CompactStr doesnt have any additional runtime overhead iirc right? So in theory you can drop it in everywhere even when you expect > 25 chars. Maybe an extra branch in the >25 char case?

Are there any tools that help finding these kinds of things? Like a profiler that says "80% of the allocated bytes are objects of this type, with 95% of those having that field set to None"

It would be super useful since I think this is pretty likely to be surprising to many users. But the profiler would need to be a particularly-specific refinement of even that: you need to make it obvious that it's not "95% of your Option<Thing>s are None, and your Option<Things> are using X bytes", but that "95% of the bytes used for your Option<Thing>s are used for None versions." Otherwise you could just assume that your non-None ones are just that chunky, or you have that many of them... I haven't seen a profiler with that level of insight, unfortunately.

Perhaps because this feels like a fairly rust-specific gotcha. Especially if you're coming from languages where there's often not much syntactical distinction made between "this is a pointer because I don't want to be copying it" and "this is a pointer because it's optional."

For instance, it's not until now that I actually understood what the sibling comment about the Enum type size discrepancy lint meant: "This lint obviously cannot take the distribution of variants in your running program into account. It is possible that the smaller variants make up less than 1% of all instances, in which case the overhead is negligible and the boxing is counter-productive. Always measure the change this lint suggests." I had always accidentally read this backwards, thinking it meant something more to the effect of "if most of the instances are actually small, then it's not a problem here, but be aware that some of them are much larger so some of your calls to things with this could end up passing much larger types."

The closest I am aware of is clippy (`cargo clippy` in a standard Rust project will run it with default configurations).

Clippy is essentially a linter; and one of its checks catches cases where different enum variants have a significantly different size; with a suggestion to Box the larger variant.

Since this is just a linter, it doesn't actually have any knowledge of how frequently each variant is actually used. It also doesn't address the situation in the article at all.

I've personally found heaptrack[1] pretty good for this task, very straightforward to use and the info is detailed enough. Though, it'll only tell you where they are happening (e.j. allocation rate for Box::new), but not exactly what type they are given that info isn't available at runtime. Usually that kind of thing would be reserved to GC-based languages where they keep track of counts for each object.

1: https://github.com/kde/heaptrack

I'm a huge fan of perfetto. It requires some manual steps to get working with Linux, but it's a great tool: https://perfetto.dev/docs/data-sources/native-heap-profiler#...

I think the number of instances should be a clue that you need to look at the layout.

I quite often have this issue with async. You get a state machine that is huge because of how Rust builds it.

This clippy lint does a good job of warning you when this might happen: https://rust-lang.github.io/rust-clippy/master/index.html?se...

Box<str> is still two words (length and pointer). That's better than the 3 words (length, pointer, capacity) for strings, but Box<String> is one word (not including the heap allocation).

Without that, if you try to suggest a transformation like this when the schema is first conceived, it will likely be considered premature optimization.

TLDR: use a nullable pointer instead of fields in nested structs to save memory.

1: https://github.com/kde/heaptrack

I'm a huge fan of perfetto. It requires some manual steps to get working with Linux, but it's a great tool: https://perfetto.dev/docs/data-sources/native-heap-profiler#...

I think the number of instances should be a clue that you need to look at the layout.

This clippy lint does a good job of warning you when this might happen: https://rust-lang.github.io/rust-clippy/master/index.html?se...

There's really an endless list of these optimizations. A few I've used (though not necessarily in rust):

Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated.

Bump allocator: your strings are &str, allocation is super fast with limited fragmentation.

Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer.

> There's really an endless list of these optimizations.

These aren't really optimizations. They are specialized implementations that introduce design and architectural tradeoffs.

For example, Rust's Atom represents a string that has been interned, and it's actually an implementation of a design pattern popular in the likes of Erlang/Elixir. This is essentially a specialized implementations of the old Flyweight design pattern, where managing N independent instances of an expensive read-only object is replaced with a singleton instance that's referenced through a key handle.

I would hardly call this an optimization. It actually represents a significant change to a system's architecture. You have to introduce a set of significant architectural constraints into your system to leverage a specific tradeoff. This isn't just a tweak that makes everything run magically leaner and faster.

Atoms: is this similar to interned strings?

> String, which is literally (under the hood, you can't poke this legitimately) Vec<u8>

`String::as_vec_mut` kinda implies that, since it gives you access to that underlying `Vec` which must then exist somewhere.

I looked it up: https://doc.rust-lang.org/std/string/struct.String.html#meth...

In case anyone else was wondering it, yes, it's "unsafe".

CompactStr doesnt have any additional runtime overhead iirc right? So in theory you can drop it in everywhere even when you expect > 25 chars. Maybe an extra branch in the >25 char case?

SSO does have overhead. Firstly, on every access you have a branch. Secondly, and more severely, the "most general" umbrella type that all string methods are defined on is a string slice, and whereas conversion from `String` to `&str` is literally a no-op, SSO strings require work to be done to convert them to string slices. Furthermore, note that in the (surprisingly common) case where the string is zero-length, String already skips the allocation, same as an SSO string.

"You have 400 megabytes of zeros in <this type>" is probably a pretty easy heuristic to add.

The closest I am aware of is clippy (`cargo clippy` in a standard Rust project will run it with default configurations).

Clippy is essentially a linter; and one of its checks catches cases where different enum variants have a significantly different size; with a suggestion to Box the larger variant.

Since this is just a linter, it doesn't actually have any knowledge of how frequently each variant is actually used. It also doesn't address the situation in the article at all.

Box<str> is still two words (length and pointer). That's better than the 3 words (length, pointer, capacity) for strings, but Box<String> is one word (not including the heap allocation).

> There's really an endless list of these optimizations.

These aren't really optimizations. They are specialized implementations that introduce design and architectural tradeoffs.

Atoms: is this similar to interned strings?

> Atoms: is this similar to interned strings?

Yes. It is exactly how they are described.

https://docs.rs/string_cache/latest/string_cache/struct.Atom...

> Represents a string that has been interned.

I looked it up: https://doc.rust-lang.org/std/string/struct.String.html#meth...

In case anyone else was wondering it, yes, it's "unsafe".

The thing they were gesturing at, correctly, is the naming. This is of course a convention and not a promise, but by convention Goose::as_crow would be a function that is cheap and gets you say &Crow instead of the &Goose you might have now, whereas Goose::to_donkey suggests that although we can have a Donkey instead of this Goose it's expensive to do that.

Commonly as... conversions are actually no-ops at runtime (the type changes but the data does not, no CPU instructions are emitted) whereas to... conversions might do quite a lot, especially if they bring into existence an actual thing at runtime -- maybe Goose::to_donkey actually needs to go allocate memory for a Donkey and destroy the Goose.

Yes it's unsafe because the Vec doesn't enforce the promise we made about this being UTF-8 text whereas String did, so now that promise is ours to keep and `unsafe` is how we signify that you the programmer took on the responsibility for safety here.

"You have 400 megabytes of zeros in <this type>" is probably a pretty easy heuristic to add.

That may be surprisingly difficult in Rust. We generally think of Option<T> using O to represent None. However, it can actually use any invalid value of T

> Atoms: is this similar to interned strings?

Yes. It is exactly how they are described.

https://docs.rs/string_cache/latest/string_cache/struct.Atom...

> Represents a string that has been interned.

That may be surprisingly difficult in Rust. We generally think of Option<T> using O to represent None. However, it can actually use any invalid value of T

I saved 475 MB out of the 895 MB used by a real-world Rust program by changing the layout of some structs and the way I was deserializing JSON files.

The real use case

My program deserializes all the JSON files of https://github.com/awslabs/aws-sdk-rust/tree/main/aws-models into "Smithy Shape" structs.

Those files contain thousands of structures similar to this one:

"com.amazonaws.iam#EnableOrganizationsRootSessionsResponse": {
    "type": "structure",
    "members": {
        "OrganizationId": {
            "target": "com.amazonaws.iam#OrganizationIdType",
            "traits": {
                "smithy.api#documentation": "<p>The unique identifier (ID) of an organization.</p>"
            }
        },
        "EnabledFeatures": {
            "target": "com.amazonaws.iam#FeaturesListType",
            "traits": {
                "smithy.api#documentation": "<p>The features you have enabled for centralized root access.</p>"
            }
        }
    },
    "traits": {
        "smithy.api#output": {}
    }
},

As is common in Rust, my program uses the very convenient serde.

I won't go into every details, but part of the structure needs to be shown at this point for clarity.

Don't read it entirely, just note that it's a bunch of structs containing structs, some optional, with serde attributes:

#[derive(Clone, Deserialize, Serialize)]
pub struct SmithyShape {
    #[serde(rename = "type")]
    pub shape_type: SmithyShapeType,
    #[serde(default, skip_serializing_if = "Vec::is_empty")]
    pub operations: Vec<SmithyReference>,
    #[serde(default)]
    pub members: FxHashMap<String, SmithyReference>,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub key: Option<SmithyReference>,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub value: Option<SmithyReference>,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub member: Option<SmithyReference>,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub input: Option<SmithyReference>,
    #[serde(default, skip_serializing_if = "Option::is_none")]
    pub output: Option<SmithyReference>,
    #[serde(default)]
    pub traits: SmithyTraits,
}

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SmithyReference {
    pub target: ShortShapeId,
    #[serde(default)]
    pub traits: SmithyTraits,
}

#[derive(Debug, Clone, Default, Deserialize, Serialize)]
pub struct SmithyTraits {
    #[serde(rename = "smithy.api#title", skip_serializing_if = "Option::is_none")]
    pub title: Option<String>,
    #[serde(rename = "aws.api#service", skip_serializing_if = "Option::is_none")]
    pub service: Option<SmithyServiceTrait>,
    #[serde(
        rename = "smithy.api#sensitive",
        skip_serializing_if = "Option::is_none"
    )]
    pub sensitive: Option<SmithySensitiveTrait>,
    #[serde(
        rename = "smithy.api#documentation",
        skip_serializing_if = "Option::is_none"
    )]
    pub documentation: Option<String>,
    #[serde(rename = "smithy.api#pattern", skip_serializing_if = "Option::is_none")]
    pub pattern: Option<String>,
    #[serde(rename = "aws.iam#iamAction", skip_serializing_if = "Option::is_none")]
    pub iam_action: Option<SmithyIamAction>,
}

#[derive(Debug, Clone, Deserialize, Serialize)]
#[serde(rename_all = "camelCase")]
pub struct SmithyServiceTrait {
    pub sdk_id: Option<String>,
    pub arn_namespace: Option<String>,
    pub cloud_formation_name: Option<String>,
    pub cloud_trail_event_source: Option<String>,
    pub endpoint_prefix: Option<String>,
}

This is some standard looking code, the current practice, but we can also call it naïve. By deserializing this way, the structures were taking 895MB in memory.

An analysis shows that most optional strings are missing, and that's what I leveraged to drastically reduce the memory footprint. But this requires to have in mind some Rust specificities, so a detour is needed:

About rust structs and memory

On a 64-bits platform, a word is made of 8 bytes. That's for example the memory needed to store a usize.

A String needs 3 words (address of the string, allocated size, and capacity) to which you need to add the allocated space for the string bytes. That's 24 bytes for a String (you can check it with dbg!(std::mem::size_of::<String>());), excluding the actual string content on the heap.

There's a niche compiler optimization which makes an Option<String> the same size (basically an option of a pointer type doesn't need an added byte to know whether it's None because it's None when the pointer is zero).

So the following structure, when all strings are missing (None), takes exactly 120 bytes (5*24) in memory:

pub struct SmithyServiceTrait {
    pub sdk_id: Option<String>,
    pub arn_namespace: Option<String>,
    pub cloud_formation_name: Option<String>,
    pub cloud_trail_event_source: Option<String>,
    pub endpoint_prefix: Option<String>,
}

Now to struct composition.

Have a look at a struct "containing" another struct. To simplify, let's imagine it contains our SmithyServiceTrait and another field:

pub struct Container1 {
    pub some_string: Option<String>,
    #[serde(default)]
    pub trait: SmithyServiceTrait,
}

The minimal size is, quite expectedly, 24+120 = 144 bytes.

But our SmithyShape only contains optional structs. What happens if we change our Container struct to use an Option<SmithyServiceTrait> ?

pub struct Container2 {
    pub some_string: Option<String>,
    #[serde(default)]
    pub trait: Option<SmithyServiceTrait>,
}

What's the size of a container when both some_string and trait are None ?

It's the same as the one of Container1, there's no memory gain in having an option (in fact, we're even lucky that our SmithyServiceTrait which contains only Option<String> can allow the compiler to elide the additional byte).

Applying this to our SmithyTraits, we see why a standard implementation balloons in memory.

This differs fundamentally from class composition in languages like Java, Python, JavaScript, etc.

In such language, when you have:

class Container {
    String someString;
    SmithyServiceTrait trait,
}

Then a null trait takes only one pointer-sized word in memory.

To allow our Rust Container to take only one word for the optional content when there's nothing to store, we need basically to do as is done in the languages we want to mimic: we need to put this content on the heap, outside of the container:

pub struct Container3 {
    pub some_string: Option<String>,
    pub trait: Option<Box<SmithyServiceTrait>>,
}

Now, when both some_string and trait are None, a container takes only 32 bytes in memory (3 words for the Option<String>, one for the Option<Box<...>>).

The niche optimization I mentioned before applies to Option<Box<...>> too: it doesn't consume more than a simple Box<...>.

The changes that recovered the memory

Basically, the change consists in

Detecting when structs are useless (i.e. when all their fields are None)
Making them optional in their parent struct, and moving them to the heap
Implementing a custom Deserializer to not store empty useless structs

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SmithyReference {
    pub target: ShortShapeId,
    #[serde(default)]
    pub traits: SmithyTraits,
}

becomes

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SmithyReference {
    pub target: ShortShapeId,
    #[serde(
        default,
        deserialize_with = "deserialize_boxed_traits",
        serialize_with = "serialize_boxed_traits"
    )]
    pub traits: Option<Box<SmithyTraits>>,
}

fn deserialize_boxed_traits<'de, D: Deserializer<'de>>(
    deserializer: D
) -> Result<Option<Box<SmithyTraits>>, D::Error> {
    let traits = SmithyTraits::deserialize(deserializer)?;
    if traits.is_empty() { // i.e. when all optional strings are none
        Ok(None)
    } else {
        Ok(Some(Box::new(traits)))
    }
}

Similarly, SmithyShape was changed to replace all Option<SmithyReference> by Option<Box<SmithyReference>>, some accessors were modified due to options in the way, and that's it, that's how the memory needed to store all deserialized AWS shapes was reduced twofold, sparing 475 MB.

A few notes:

this deserialization costs more in CPU as the object is deserialized before being discarded. It turns out that the trade-off is a full-win as not having to hunt for memory made the complete task faster even with this added step.
a lot of boxes means a fragmented heap. In such case it's not a problem but this might be worth keeping in mind.

Verification: Proving the Impact

With experience, you get an intuition of where to save space, and roughly how much. But to work seriously, you need to check that what you did worked, and verify it was worthwhile. So you need to measure.

There's no simple and light way in Rust to know the total space taken by a composite object following all pointers.

Here, my solution was to use an allocator which gives information about its state (I used jemalloc because the standard allocator provides limited visibility into internal statistics), and compare the memory used before deserialization to the memory used after.

As I don't always want to use this allocator, I defined a "profile" feature in my Cargo.toml:

[features]
profile = ["tikv-jemallocator", "tikv-jemalloc-ctl"]

[dependencies]
tikv-jemallocator = { optional = true, version = "0.6", features = ["stats", "profiling"] }
tikv-jemalloc-ctl = { optional = true, version="0.6", features = ["stats"] }

And I declare the use of this allocator in my main.rs:

#[cfg(feature = "profile")]
#[global_allocator]
static ALLOC: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

Then, in my function deserializing all those shapes, I do the measures:

#[cfg(feature = "profile")]
fn allocated_mb() -> usize {
    tikv_jemalloc_ctl::epoch::advance().unwrap();
    tikv_jemalloc_ctl::stats::allocated::read().unwrap_or(0) / (1024 * 1024)
}

#[cfg(feature = "profile")]
let base = allocated_mb();

... load all the shapes ...

#[cfg(feature = "profile")]
eprintln!(
    "Memory used for the shapes = {} MB (total)",
    allocated_mb() - base
);

Tip: tikv_jemalloc_ctl exposes many more details that may be interesting to follow in a server application

Conclusion: what's to remember, in a few words

Summarized, here's what any Rust developper needs to understand and remember:

Composite structs can consume significant memory
It can pay to make a field: BigStruct optional by detecting when its content doesn't matter
A field: Option<BigStruct> takes at least the space of the BigStruct even when it's None
You can break the chain by boxing with field: Option<Box<BigStruct>> (then a None takes only a word in the parent struct)
Those optimizations are still possible when deserializing with Serde

Hacker Times