I agree with the blog post's technical contents, but I feel we came across too strong in the title. For Ubicloud as a managed Postgres provider, we use strict memory overcommit. Our experience with operating Postgres at scale taught us that it's better to enable this than going with the defaults.
However, I can see many other scenarios, where using strict memory overcommit would have unanticipated side-effects. That's why Linux doesn't go with strict memory commit as its default.
For now, we have overcommit_ratio set to a value that is stable from experience, but there really seems to be no silver lining. Go is very happy to allocate a lot of virtual memory, but so are most managed languages. The best solution would probably be to host the backend and the database on separate servers.
I've gone through this exercise in the past on much older kernels which they cover as well and just me personally I ran into less issues by leaving overcommit to 0 and just dropping the overcommit ratio to 0 and setting the oom_score_adj for programs as high as 1000 if I wanted vmscan to leave them alone and of course using the Redhat formulas for setting vm.min_free_kbytes, vm.admin_reserve_kbytes, vm.user_reserve_kbytes. And of course be vigilant in disallowing app owners from using every last bit of memory.
Took k8s ages to get Swap support.
We lost something when we accepted that Hyperscalers just tell you to use more moemory. It was shitty 5 years ago and today especially after the ram price increases
Unfortunately, many programs commit 2x memory than they actually use. Often I see ~32GB committed and ~16GB resident.
GOMEMLIMIT works very well if you set it to around 90% of available memory as a rough heuristic. You should definitely profile your application to fine tune this number (e.g. if you link with C libraries that hold large memory pools then Go doesn't account for that) but also to identify sources of spikey/leaky allocations. For example, encoding/json is notorious for it's inner sync.Pool hanging on to outsized buffers. There's usually a lot of low hanging fruit.
In my experience Go can be extremely stable in terms of memory footprint at both small (~O(1MiB)) and large (~O(256GiB)) scales, and it takes only a small amount of effort.
As far as GC languages go, it is by far the easiest to work with.
[1] - https://man7.org/linux/man-pages/man5/proc_pid_oom_score_adj...
And now, with PSI + MGLRU, situation is much better, but there are still missing features/subsystems which would be nice to have. For example there's no simple way to lock memory mlockall-style to ensure that rarely used daemon would not face long no-cache-latency upon accessing the first time after long idle time.
Whether failed transactions are actually so much more desirable than a OOM-killed process isn't quite obvious, but it might be easier to troubleshoot.
I dont think it has an option for that.
I run Firefox, VSCodium with LSP, Discord, Signal and there's still space left for a game like CS2. I'm not a heavy user by any means.
> I'm not sure they would do much better than crash
I have yet to see a program that silently handles allocation failures and doesn't crash. These days everything is coded to crash if no memory :(
> About once a year a real runaway process (usually a throwaway program I'm working on) gets OOM-killed
In my case it killed system critical processes with no way to recover. With disabled overcommit, it freezes for a while (usually for a minute or two), I close some random program of my choosing and then see in Resource Monitor what's eating my ram.
A memory allocator can implement overcommit, because you can separate reserving virtual memory and having it backed by physical memory into two different system calls. But from the point of view of the kernel, any time it promises to give you physical memory that memory is backed either by RAM or by space reserved in the swap file
The Linux Kernel OOM killer kills random things. Userspace OOM killers are meant to improve this, and they work well in a server situation when you already know in advance what is likely to go haywire and what is safe to kill. But they don't work well on desktop (some of them are improving but it doesn't seem to be a priority).
The Windows OOM killer by comparison usually kills something sensible (i.e. the program that is actually using all the memory), and asks the user for permission before killing it (when possible). You do see a lot of memes of situations where it fails.
If no memory is available where a page file would make a difference, this leads to application crashes instead. A crash is (usually) worse than paging.
Certain applications, Photoshop being the historical example, will outright fail to run with no page file present.
The purpose of the system commit limit and commit charge is to track all uses of these resources to ensure they are never overcommitted β that is, that there is never more virtual address space defined than there is space to store its contents, either in RAM or in backing store (on disk).
- Windows Internals, 7th EditionSame happens if the page file is full. In that case, why don't those programs use disk directly instead?
No such problem would've ever occured if programs hadn't allocated more than they actually use.
Typically, performance drops enough that the user kills the program or reboots before the page file expands to fill the disk. And other threads here suggest there is something that will prompt users to kill programs in states like this.
> No such problem would've ever occured if programs hadn't allocated more than they actually use.
That's part of the issue, but sometimes things do in fact use too much memory as well as allocate too much.
Another part of the issue is that few programs are built to handle allocation failures.
And then you have a metrics issue. There's not really a good metric to know when you're out of memory, other than performance collapse. If your applications don't use disk, it's not too hard; but when they do use disk, performance will collapse once there's insufficient memory to provide the disk caching needed. In my experience, adding a small swap and monitoring swap i/o can be pretty helpful, and a small swap doesn't tend to allow long thrashing when memory use grows. But that's not universal and everybody loves to hate swap these days.
An application that grows in such a way (besides having backing stores for memory-mapped files, as well) will often perform so poorly that it requires addressing (adding RAM, looking for application faults, etc).
A page file is insurance, one that can last you much longer than available system memory.
All Blog Posts
April 27, 2026 Β· 10 min read

Burak Yucesoy
Principal Software Engineer
Our team members built and operated five managed PostgreSQL services over the past 15 years. Across all of them, one configuration has remained constant: strict memory overcommit.
In this blog post, we will explain how strict memory overcommit protects your database from catastrophic OOM (out of memory) kills. We will also share how a three-character kernel bug forced us to temporarily disable this setting. Finally, we will explain our heuristic for determining the right memory overcommit limit. Hopefully, this will help you find the right setting for your workloads.
Linux allows processes to allocate more virtual memory than what is physically available. When a process allocates memory, for example with malloc(), the kernel reserves virtual address space for it. However, the kernel does not immediately back that space with physical memory. Physical pages are only consumed when the process actually touches the memory.
The kernel relies on the assumption that not all allocated memory will be actively used at the same time. Usually, this assumption holds. When it doesnβt, the kernel invokes the OOM killer to free memory by terminating a process.
For most processes, handling an OOM kill is simple: the process restarts, reconnects, and picks up where it left off. PostgreSQL is different.
PostgreSQL's postmaster (its main supervisor process) forks a backend process for each connection. These backends share memory segments that hold shared buffers, WAL buffers, lock tables, and other shared state. The OOM killer doesn't understand this architecture. It simply picks a process based on an heuristic (usually the process that uses the most memory) and terminates it. If that backend was modifying a shared memory segment, the segment may be left in an inconsistent state. Shared memory has no transactional guarantees at the OS level. A half-written page in shared buffers means silent data corruption.
PostgreSQL's postmaster knows this. When it detects that any of its child processes has been killed, it assumes the worst: shared memory may be corrupted. When shared memory is corrupted, there is a risk of corrupting the stored data as well. To prevent this, the postmaster terminates all remaining backends. Every active connection is dropped. Every in-flight transaction is aborted. On its next start, the database goes through crash recovery.
This is the correct behavior. PostgreSQL is protecting your data. But it means a single OOM kill doesn't just affect one connection. It takes down every connection on the server. On top of that, if the write volume was high, replaying all WAL files for crash recovery can take a long time. This means a single out of memory case can cause long outages.
It is possible to configure how the kernel behaves when processes ask for memory. Linux provides three overcommit policies via vm.overcommit_memory:
Under strict overcommit, the kernel has two knobs to set CommitLimit: overcommit_kbytes and overcommit_ratio. The CommitLimit is calculated as:
CommitLimit = overcommit_kbytes + swap
Or, if overcommit_kbytes is not set:
CommitLimit = overcommit_ratio / 100 * available_memory + swap
When allocation fails with ENOMEM error code. PostgreSQL handles this gracefully. A backend that cannot allocate memory reports an error to the client, cancels the transaction, and continues. The postmaster stays up. Other connections remain unaffected. This is a routine error, not a catastrophe. The trade-off is that strict overcommit converts late, destructive failures into early, graceful ones.
This trade-off works best when the machine is dedicated to PostgreSQL and a small set of known sidecar processes. In that scenario, the committed memory profile is predictable and the limit can be tuned with confidence. On shared machines running diverse workloads, committed memory becomes harder to predict. An unrelated process can use up the commit budget. This can make PostgreSQL get an ENOMEM error, even if the database load is fine.
We always favored strict overcommit for PostgreSQL. We used it in previous managed PostgreSQL services we built and also in Ubicloud PostgreSQL. However, after enabling it this time, we quickly ran into trouble. A few weeks after we turned on strict memory overcommit, we started to get failures on some of the databases. They showed out of memory errors, even though there was plenty of free physical memory on the machines. We disabled strict memory overcommit and started investigating.
The first clue came from a routine check of /proc/meminfo on one of our servers with 8 GB memory:
$> cat /proc/meminfo | grep "Committed_AS"
Committed_AS: 683547672 kB
651 GB of committed memory on an 8 GB machine! For comparison, a healthy server of the same size showed:
$> cat /proc/meminfo | grep "Committed_AS"
Committed_AS: 2703940 kB
The counter was off by orders of magnitude.
We first looked at ps output.
$> ps -C postgres -o pid,vsz,rss,cmd --sort=-vsz
PID VSZ RSS CMD
96622 2242244 95416 postgres: 18/main: postgres postgres...
95721 2241668 94708 postgres: 18/main: postgres postgres...
96414 2241436 94892 postgres: 18/main: postgres postgres...
96619 2241076 93308 postgres: 18/main: postgres postgres...
96417 2240900 94300 postgres: 18/main: postgres postgres...
95728 2240736 93864 postgres: 18/main: postgres postgres...
96620 2240736 92852 postgres: 18/main: postgres postgres...
95727 2240428 93640 postgres: 18/main: postgres postgres...
96623 2239840 93164 postgres: 18/main: postgres postgres...
VSZ is the total virtual address space a process has mapped and RSS is the physical memory it's actually using. In the output above, each backend shows 2 GB of VSZ covering its entire mapped address space, but a much smaller RSS (95 MB) reflecting the memory it is actively using. On this 8 GB VM we configure 2 GB of shared_buffers, and if you think ~2 GB VSZ is suspiciously close to the shared_buffers size, you are right. Most of each backend's VSZ is actually the shared memory segment that holds shared_buffers. Every backend maps the same 2 GB region into its own address space, so it shows up in each backend's VSZ. With many backends, the VSZ numbers add up quickly.
That said, none of this should inflate Committed_AS. The shared memory segment appears in every backend's address space but physically exists only once, so it should be counted only once. On top of that, we run PostgreSQL with huge_pages = on, so shared_buffers is allocated from hugetlb. Hugetlb mappings have their own separate reservation accounting and are not supposed to count toward Committed_AS at all. Still, the 2 GB hugetlb region was by far the largest mapping in each backend, and hugetlb accounting is a special case in the kernel. That made it the most natural place to start looking, so our first hypothesis was that the kernel was somehow miscounting these mappings. For example, charging them once per process instead of ignoring them.
To verify, we checked the VMA (Virtual Memory Area) flags on the hugetlb mapping via /proc//smaps. Each VMA has a set of flags, and the ac flag (VM_ACCOUNT) indicates that the region counts toward committed memory:
$> sudo cat /proc/321784/smaps | grep -A 25 "hugepage"
7fce75000000-7fcef0c00000 rw-s 00000000 00:10 10723551 /anon_hugepage (deleted)
Size: 2027520 kB
Shared_Hugetlb: 393216 kB
Private_Hugetlb: 0 kB
...
...
VmFlags: rd wr sh mr mw me ms de ht sd
No ac flag. Huge tables were correctly excluded from committed memory accounting. The hypothesis is ruled out.
We then summed accountable memory (VMAs with the ac flag) across all processes on the machine:
$> sudo awk '/^Size/{size=$2} /VmFlags:/ && / ac/{sum+=size} END{printf "%.2f GB\n", sum/1048576}' /proc/[0-9]*/smaps
2.43 GB
2.43 GB accountable vs 651 GB reported; 648 GB of phantom committed memory. The vm_committed_as counter was leaking. We suspected that the memory was being charged on allocation but was never recredited. This made us consider a potential kernel bug in committed memory calculation.
At that time, we had two different kernels being used on our fleet. We checked our entire fleet of PostgreSQL servers and compared the ratio of Committed_AS to MemTotal against kernel version and uptime:
| Metric | Kernel 6.5.0 | Kernel 6.8.0 |
|---|---|---|
| Median Ratio |
| 0.55 | 0.27 | |
Mean Ratio
| 24.97 | 0.32 | |
Max Ratio
| 3,405 | 1.86 | |
Servers with a ratio > 1.0
| 23% | < 1% |
Drag table left or right to see remaining content
We also ran a statistical analysis and found that a server running the 6.5 kernel was 52x more likely to have inflated committed memory.
On 6.5 servers, uptime was positively correlated with inflation. The leak grew at roughly 4.7% compound per week, proportional to uptime. On 6.8 servers, no correlation existed.
This analysis significantly strengthened our hypothesis that this was a kernel bug.
To have definitive proof, we tasked an LLM to look into every commit between 6.5.0 and 6.8.0 to find possible bug fixes in committed memory calculations. It quickly found the following.
The bug was introduced in Linux 6.5 by commit 408579c. This commit changed the return convention of do_vmi_align_munmap():
The commit updated callers throughout the mm subsystem. However, in mm/mremap.c, inside move_vma(), the error check was converted incorrectly:
BEFORE (correct): error handler runs on negative return (on error)
if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {
/* OOM: unable to split vma, just get accounts right */
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
vm_acct_memory(old_len >> PAGE_SHIFT);
}
AFTER (broken): error handler runs when return is 0 (on success)
if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {
/* OOM: unable to split vma, just get accounts right */
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
vm_acct_memory(old_len >> PAGE_SHIFT);
}
The change from < 0 to ! inverted the condition. To understand why this matters, consider what move_vma() does. It first decrements Committed_AS for the old region as part of the move, then calls do_vmi_munmap() to actually unmap it. If the unmap fails, the kernel needs to increment the counter back to keep accounting correct. After all, unmap has failed and the old region still exists. Its charge must be restored. With the inverted condition, this re-increment runs on every successful mremap instead of only on failure. The counter grew monotonically with every memory remap operation.
The bug was reported here and bisected here. Linus himself analyzed the root cause and fixed it with a one-line change, reverting the condition back to < 0:
- if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {
+ if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {
As Linus Torvalds wrote in the fix:
This didn't change any actual VM behavior _except_ for memory accounting when 'VM_ACCOUNT' was set on the vma. Which made the wrong return value test fairly subtle, since everything continues to work.
Or rather - it continues to work but the "Committed memory" accounting goes all wonky (Committed_AS value in /proc/meminfo), and depending on settings that then causes problems much much later as the VM relies on bogus statistics for its heuristics.
This is the kind of bug that hides in plain sight. Under heuristic overcommit (the default), Committed_AS is purely informational. The kernel doesn't use it to gate allocations. The bug only causes failures under non-default strict overcommit mode, so it went unnoticed. The failure is also indirect. The accounting drifts silently for weeks before Committed_AS finally crosses CommitLimit and allocations start failing.
With the kernel bug behind us, we can gradually go back to enabling strict memory overcommit. This is a good point to explain our heuristic in deciding the commit limit in case you want to enable it for your workloads as well.
We use the formula:
overcommit_kbytes = total_memory_kb Γ 0.8 + 2 Γ 1048576
In plain terms: 80% of total physical memory plus 2 GB.
The 20% holdback covers memory used by kernel data structures not seen in userspace. This includes items like page tables, slab caches, network buffers, and the kernel's own allocations.
It is important to note that 20% is not wasted. The kernel still uses it for page cache (i.e. the kernel uses free physical memory to cache file I/O). This is the biggest consumer and directly benefits PostgreSQL read performance. Page cache doesn't count toward Committed_AS because it's reclaimable. The kernel can evict cached pages anytime a process actually needs the memory.
Every PostgreSQL server in our fleet runs several sidecar processes. Some examples are prometheus, node_exporter, postgres_exporter and wal-g. These are Go programs, and Go's runtime reserves large virtual memory regions upfront via mmap but only faults in pages as needed. Their committed memory contribution is far larger than their actual resident memory.
We surveyed the committed memory of these sidecar processes across our fleet:
| Sidecar Committed Memory | Percentage of Servers |
|---|---|
| 0.0 β 0.5 GB | ~64% |
| 0.5 β 1.0 GB | ~32% |
| 1.0 β 1.5 GB | ~1% |
| 1.5 β 2.0 GB | ~1% |
| 2.0 β 2.5 GB | ~1% |
| 2.5 β 3.0 GB | ~1% |
| 3.0 β 3.5 GB | ~1% |
Drag table left or right to see remaining content
96% of servers fall under 1 GB. We found a weak positive correlation between vCPU count and sidecar committed memory (r = 0.22). This is likely driven by Go's runtime scaling with available CPUs but it is not strong enough to justify proportional scaling.
The fixed 2 GB covers >99% of servers. It is deliberately generous. If this offset is too small, sidecars can silently consume the remaining commit budget, and PostgreSQL, not the sidecar, is the one that hits ENOMEM.
If you are curious about how we implemented this, it is actually pretty straightforward. You can read the code in our GitHub repo here. Iβm also adding the core part of it below for convenience.
def configure_memory_overcommit(strict: false)
if strict
total_mem_kb = File.read("/proc/meminfo").match(/MemTotal:\s+(\d+)/)[1].to_i
# 25% of memory is reserved for hugepages, which do not count towards the
# commit limit, so only the remaining 75% is available for overcommit.
non_hugepage_mem_kb = total_mem_kb * 0.75
overcommit_kbytes = (non_hugepage_mem_kb * 0.8 + 2 * 1048576).round
safe_write_to_file("/etc/sysctl.d/99-overcommit.conf", "vm.overcommit_memory=2\nvm.overcommit_kbytes=#{overcommit_kbytes}\n")
else
r "sudo rm -f /etc/sysctl.d/99-overcommit.conf"
end
r "sudo sysctl --system"
end
Note that we use vm.overcommit_kbytes instead of vm.overcommit_ratio. We need overcommit_kbytes because our formula includes a fixed 2 GB component that can't be expressed as a percentage. On a 4 GB server, the 2 GB buffer is 50% of the physical memory; on a 64 GB server, it's 3%. A single ratio can't capture both.
Strict memory overcommit is a small configuration change that provides a meaningful safety improvement for PostgreSQL. It converts catastrophic OOM kills into graceful allocation failures. This way, each backend can manage the issue without disrupting the whole system. Even though we had to disable it for a while due to a kernel bug, it remains a key configuration for healthy PostgreSQL deployments.
If you run PostgreSQL in production, we recommend enabling vm.overcommit_memory=2. However, it is important to configure this carefully. If CommitLimit is set too low, your application may experience frequent OOM errors. On the other hand, if it is set too high, you will not fully benefit from the protection that strict memory overcommit provides. Our recommendation is to monitor your memory usage over time and enable this setting only after you have a solid understanding of the memory characteristics of your workload.
Next up