https://sealos.io/_next/image?url=.%2Fimages%2Fcontainerd-hi...
https://sealos.io/_next/image?url=.%2Fimages%2Fbloated-conta...
Either way, hope the user was communicated with or alerted to what's going on.
At the same time, someone said that 800 GB container images are a problem in of themselves no matter the circumstances and they got downvoted for saying so - yet I mostly agree.
Most of mine are about 50-250 MB at most and even if you need big ones with software that's GB in size, you will still be happier if you treat them as something largely immutable. I've never had odd issues with them thanks to this. If you really care about data persistence, then you can use volumes/bind mounts or if you don’t then just throw things into tmpfs.
I'm not sure whether treating containers as something long lived with additional commmits/layers is a great idea, but if it works for other people, then good for them. Must be a pain to run something so foundational for your clients, though, cause you'll be exposed to most of the edge cases imaginable sooner or later.
"The key insight is to treat container images not as opaque black boxes, but as structured, manipulable archives. Deeply understanding the underlying technology, like the OCI image specification, allows for advanced optimization and troubleshooting that goes far beyond standard tooling. This knowledge is essential for preventing issues like Kubernetes disk space exhaustion before they start."
For stuff like security keys you should typically add them as build --args-- secrets, not as content in the image.
Build args are content in the image: https://docs.docker.com/reference/build-checks/secrets-used-...
Do not use build arguments for anything secret. The values are committed into the image layers.
The thing here is they're using Docker container images like if they were VM disks and they end up with images with almost 300 layers, like in this case. I think LXC or VMs should be a better case for this (but I don't know if they've tested it or why are they using Docker)
If you absolutely have to do it that way, be very deliberate about what you actually need. Don't run an SSH daemon, don't run cron, don't an SMTP daemon, don't run the suite of daemons that run on a typical Linux server. Only run precisely what you need to create the files that you need for a "docker commit".
Each service that you run can potentially generate log files, lock files, temp files, named pipes, unix sockets and other things you don't want in your image.
Taking a snapshot from a working, regular VM and using that as a docker image is one of the worst ways to built one.
Our users need to connect their local VS Code, Cursor, or JetBrains IDEs to the cloud environment. The industry-standard extensions for this only speak the SSH protocol. So, to give our users the tools they love, the container must run an SSHD to act as the host.
We aren't just a CDE like Coder or Codespaces. We're trying to provide a fully integrated, end-to-end application lifecycle in one place.
The idea is that a developer on Sealos can:
1. Spin up their DevBox instantly. 2. Code and test their feature in that environment (using their local IDE). 3. Then, from that same platform, package their application into a production-ready, versioned image. 4. And finally, deploy that image directly to a production Kubernetes environment with one click.
That "release" feature was how we let a developer "snapshot" their entire working environment into a deployable image without ever having to write a Dockerfile.
Do people not know that each layer comes with its own downsides?
Do people just do 272 layers and think that it’s normal?
This seems like people discovering that water is wet and fire is hot.
When I saw the HN title, I thought this was going to be something subtle like deleting package files (e.g. apt) in a separate layer, so you end up with a layer containing the files and then a subsequent layer that hides them.
They say they made a 800GB container image, so your issue is about singular vs plural?
Regardless, I don't really get why anyone would self report like this. Is next article going to be about how they don't encrypt passwords and when they accidentally dropped prod DB they could restore account from logs because it had the passwords in clear text?
Thankfully LXD is here to serve this need: very lightweight containers for systems, where your app runs in a complete ecosystem, but very light on the ram usage.
Having /var/log set as as a persistent volume would have worked, but ultimately they were using "docker commit" to amend/update their images which is definitely the wrong way to do it.
How are you going to orchestrate all those daemons without systemd? :P
As you mentioned, a container running systemd and a suite of background services is the typical use case of LXD, not docker. But the difference seems to be cultural -- there's nothing preventing one from using systemd as the entry point of a docker container.
Share at:
|
This case study is a real-world story from the Sealos platform engineering team. We believe in transparency, and this is a detailed account of how we diagnosed and resolved a critical production issue, sharing our hands-on experience to help the broader cloud-native community.
We tackled critical container image bloat on our Sealos platform, fixing a severe disk space exhaustion issue by shrinking an 800GB, 272-layer image to just 2.05GB. Our solution involved a custom tool, image-manip, to surgically remove files and squash the image layers. This 390:1 reduction not only resolved all production alerts but also provides a powerful strategy to reduce container image size.
It was 2 PM when the PagerDuty alert blared for the fifth time that week: "Disk Usage > 90% on devbox-node-4." Our Sealos cluster's development environment node was once again evicting pods, grinding developer productivity to a halt. This was a classic symptom of Kubernetes disk space exhaustion, but the root cause was elusive. The node was equipped with a hefty 2TB SSD, yet a simple df -h confirmed only 10% of its space remained.
Our initial reaction was to treat the symptom. We expanded the node's storage to 2.5TB, assuming a transient workload spike. The next day, the alert returned, mocking our efforts. The problem wasn't a spike; it was a cryptic, relentless consumption of storage stemming from what we would later discover was extreme container image bloat. For a platform promising stable and predictable development environments, this failure was an unacceptable breach of trust.
The Sealos devbox feature is a cornerstone of our value proposition: providing developers with isolated, one-click, reproducible cloud-based environments. This persistent disk space exhaustion, a problem often linked to scenarios where a Docker image is too large, wasn't just a technical nuisance; it was a direct threat to that core promise. Unreliable environments lead to frustrated developers, lost productivity, and ultimately, customer churn. The stability of this single feature was directly tied to user trust and our platform's reputation in a competitive market. We weren't just fixing a disk; we were defending our product's integrity.
Our hands-on investigation began by hunting for the source of the bleeding. The first tool we reached for was iotop to identify any processes with abnormal I/O activity. The culprit was immediately apparent: multiple containerd processes were writing to disk at a sustained, alarming rate of over 100MB/s. For a container runtime managing mostly idle development environments, this was a massive red flag.
Terminal output from the iotop command showing containerd processes with disk write speeds over 100MB/s.
This pointed to a problem within the containers themselves. We began hunting for the largest offenders within containerd's storage directory, using du to scan the overlayfs snapshots.
The output was not what we expected. Instead of a mix of large user files, a single filename appeared repeatedly, each instance a monstrous 11GB.
The file /var/log/btmp is a standard Linux system file that records failed login attempts. On a healthy system, it measures in kilobytes. An 11GB btmp file is unheard of. We inspected the contents of one of these files using the last command.
The terminal was flooded with a scrolling wall of failed SSH login attempts, timestamped at a rate of dozens per second. This was clear evidence of a persistent, months-long brute-force attack. Our system had been dutifully recording every single failed attempt.
The discovery of the brute-force attack was only the first layer of the problem. Why did it cause such catastrophic disk usage? This analysis revealed the root cause of our container image bloat: a perfect storm created by the intersection of container image architecture and a series of security oversights.
Primary Technical Contradiction: Copy-on-Write vs. Log Files
The core of the issue was a disastrous interaction between OverlayFS's Copy-on-Write (CoW) mechanism and the ever-growing btmp file, a textbook example of poor OverlayFS copy-on-write performance when handling large, frequently modified files. The problematic user image had an astonishing 272 layers, each representing a commit operation.
Terminal screenshot showing a single container image composed of 272 layers, indicating image bloat.
Here's how the disaster unfolded:
/var/log/btmp grows to 11GB.commit, creating a new image layer./var/log/btmp.Even if the user deleted the btmp file in the newest layer, the 271 copies of the 11GB file would remain immutably stored in the layers underneath. The disk space was fundamentally unrecoverable through standard container operations.
Compounding Factors: The Swiss Cheese Model
This technical failure was enabled by three distinct oversights in our platform's design:
devbox base images prioritized ease-of-use, leaving SSH password authentication enabled and exposed to the public internet without rate-limiting tools like fail2ban.logrotate configuration for system logs like btmp. This fatal assumption allowed a single log file to grow without bounds.Standard docker commands were insufficient; we had to perform surgical OCI image manipulation on its immutable history. This required building our own specialized tooling and a dedicated, high-performance processing environment to truly reduce the container image size.
Architecture Rework: The image-manip Scalpel
We developed a new CLI tool, image-manip, to treat OCI images as manipulable data structures. For this task, we leveraged two of its core functions:
image-manip remove /var/log/btmp <image>: This command adds a new top layer containing an OverlayFS "whiteout" file. This special marker instructs the container runtime to ignore the btmp file in all underlying layers, effectively deleting it from the merged view without altering the original layers.image-manip squash: This is the key to reclaiming disk space and the core of our strategy to squash the image layers. The tool creates a temporary container, applies all 272 layers in sequence to an empty root filesystem, and then exports the final, merged filesystem as a single new layer. This flattens the image's bloated history into a lean, optimized final state.Tool Innovation: The High-Performance Operating Room
Performing these intensive operations on production nodes was not an option. We built dedicated devbox-image-squash-server nodes in three regions using ecs.c7a.2xlarge instances (8-core CPU, 16GB RAM). To handle the I/O storm, we configured a striped LVM (Logical Volume Management) volume across two 1TB ESSD cloud disks.
A fio benchmark confirmed our setup could handle the load, achieving 90.1k random write IOPS.
Finally, we fine-tuned the OS to give containerd maximum I/O priority.
At 10:00 AM on September 11, we began the procedure on the most critical image: 800GB spread across 272 layers. The remove operation, which adds a "whiteout" layer, was nearly instantaneous:
The squash operation was the main event. After an hour of intense processing, the logs delivered the news we were hoping for.
The new, squashed image was a mere 2.05GB. We had achieved a staggering 390:1 compression ratio.
After pushing the optimized image, we restarted the user's devbox. It started successfully. A quick check confirmed the operation's success: the user's environment was perfectly intact, and the btmp monster was gone.
The quantitative impact on the platform was dramatic and immediate. The data unequivocally demonstrates the success of our approach to optimize a container image, leading to massive savings in storage, cost, and developer time.
| Metric | Before Fix | After Fix | Improvement |
|---|---|---|---|
| Disk Space Alerts (30 days) | 23 | 0 | 100% Reduction |
| Avg. Node Disk I/O | 120 MB/s | 26 MB/s | 78% Decrease |
| Avg. Container Image Pull Time | 75 seconds | 26 seconds | 65% Faster |
| Max Container Image Size | 800 GB | 2.05 GB | 390x Smaller |
| Estimated Storage Cost | ~$520/cluster/mo | ~$70/cluster/mo | $450/mo Savings |
This incident was a painful but invaluable lesson in the hidden complexities of containerized systems. Our solution was effective, but it was a reactive, manual procedure—a complex surgery to fix a preventable disease. Here are our key takeaways and future preventative measures, framed as a quick FAQ.
Q: What was the primary cause of the extreme container image bloat?
A: The primary cause was the interaction between OverlayFS's Copy-on-Write (CoW) mechanism and a large, frequently updated log file (/var/log/btmp). Each minor update caused the entire 11GB file to be copied into a new image layer, a process that repeated over 270 times, compounding the storage consumption.
Q: Why couldn't you just delete the file with a standard docker commit?
A: Deleting a file in a new layer only adds a "whiteout" marker that hides the file from the final view. The original 271 copies of the 11GB file would remain immutably stored in the underlying layers, continuing to consume disk space. A full layer squash was necessary to create a new, clean filesystem and truly reclaim the space.
Q: What is the key lesson for other platform engineers from this experience?
A: The key insight is to treat container images not as opaque black boxes, but as structured, manipulable archives. Deeply understanding the underlying technology, like the OCI image specification, allows for advanced optimization and troubleshooting that goes far beyond standard tooling. This knowledge is essential for preventing issues like Kubernetes disk space exhaustion before they start.
Our immediate next step is to move from firefighting to fire prevention. We have already implemented automated monitoring that triggers an alert if any user image exceeds 50 layers or 10GB in size. Furthermore, all new devbox base images now ship with password authentication disabled by default and a properly configured logrotate service.