Vm.overcommit_memory=2 is always the right setting for servers

(ariadne.space)

33 points | by signa11 2 days ago

15 comments

c0l0 1 hour ago
I realize this is mostly tangential to the article, but a word of warning for those who are about to mess with overcommit for the first time: In my experience, the extreme stance of "always do [thing] with overcommit" is just not defensible, because most (yes, also "server") software is just not written under the assumption that being able to deal with allocation failures in a meaningful way is a necessity. At best, there's an "malloc() or die"-like stanza in the source, and that's that.
You can and maybe even should disable overcommit this way when running postgres on the server (and only a minimum of what you would these days call sidecar processes (monitoring and backup agents, etc.) on the same host/kernel), but once you have a typical zoo of stuff using dynamic languages living there, you WILL blow someone's leg off.
[-]
- kg 1 hour ago
  I run my development VM with overcommit disabled and the way stuff fails when it runs out of memory is really confusing and mysterious sometimes. It's useful for flushing out issues that would otherwise cause system degradation w/overcommit enabled, so I keep it that way, but yeah... doing it in production with a bunch of different applications running is probably asking for trouble.
  [-]
  - Tuna-Fish 54 minutes ago
    The fundamental problem is that your machine is running software from a thousand different projects or libraries just to provide the basic system, and most of them do not handle allocation failure gracefully. If program A allocates too much memory and overcommit is off, that doesn't necessarily mean that A gets an allocation failure. It might also mean that code in library B in background process C gets the failure, and fails in a way that puts the system in a state that's not easily recoverable, and is possibly very different every time it happens.
    For cleanly surfacing errors, overcommit=2 is a bad choice. For most servers, it's much better to leave overcommit on, but make the OOM killer always target your primary service/container, using oom-score-adj, and/or memory.oom.group to take out the whole cgroup. This way, you get to cleanly combine your OOM condition handling with the general failure case and can restart everything from a known foundation, instead of trying to soldier on while possibly lacking some piece of support infrastructure that is necessary but usually invisible.
  - vin10 1 hour ago
    > he way stuff fails when it runs out of memory is really confusing
    have you checked what your `vm.overcommit_ratio` is? If its < 100%, then you will get OOM kills even if plenty of RAM is free since the default is 50 i.e. 50% of RAM can be COMMITTED and no more.
    curious what kind of failures you are alluding to.
LordGrey 2 days ago
For anyone not familiar with the meaning of '2' in this context:
The Linux kernel supports the following overcommit handling modes
0 - Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.
1 - Always overcommit. Appropriate for some scientific applications. Classic example is code using sparse arrays and just relying on the virtual memory consisting almost entirely of zero pages.
2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM. Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate. Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.
[-]
- dbdr 1 hour ago
  > exceed swap + a configurable amount (default is 50%) of physical RAM
  Naive question: why is this default 50%, and more generally why is this not the entire RAM, what happens to the rest?
  [-]
  - dasil003 36 minutes ago
    Not sure if I understand your question but nothing "happens to the rest", overcommitting just means processes can allocate memory in excess of RAM + swap. The percentage is arbitrary, could be 50%, 100% or 1000%. Allocating additional memory is not a problem per se, it only becomes a problem when you try to actually write (and subsequently read) more than you have.
  - crote 44 minutes ago
    Just a guess, but I reckon it doesn't account for things like kernel memory usage, such as caches and buffers. Assigning 100% of physical RAM to applications is probably going to have a Really Bad Outcome.
  - vin10 46 minutes ago
    it's a (then-)safe default from the age when having 1GB of RAM and 2GB of swap was the norm: https://linux-kernel.vger.kernel.narkive.com/U64kKQbW/should...
- sidewndr46 45 minutes ago
  Do any of the settings actually result in "malloc" or a similar function returning NULL?
  [-]
  - LordGrey 37 minutes ago
    malloc() and friends may always return NULL. From the man page:
    If successful, calloc(), malloc(), realloc(), reallocf(), valloc(), and aligned_alloc() functions return a pointer to allocated memory. If there is an error, they return a NULL pointer and set errno to ENOMEM.
    In practice, I find a lot of code that does not check for NULL, which is rather distressing.
    [-]
    - sidewndr46 1 minute ago
      It's been a while but while I agree the man page says that, my limited understanding was the typical libc on linux won't really return NULL under any sane scenario. Even when the memory can't be backed
vin10 1 hour ago
For anyone feeling brave enough to disable overcommit after reading this, be mindful that default `vm.overcommit_ratio` is 50% which means that if no swap is available, on a system with 2GB of total RAM, more than 1GB of RAM can't be allocated and requests will fail with preemptive OOMs. (e.g. postgresql servers typically disable overcommit)
- https://github.com/torvalds/linux/blob/master/mm/util.c#L753
laurencerowe 45 minutes ago
Disabling overcommit on V8 servers like Deno will be incredibly inefficient. Your process might only need ~100MB of memory or so but V8's cppgc caged heap requires a 64GB allocation in order to get a 32GB aligned area in which to contain its pointers. This is a security measure to prevent any possibility of out of cage access.
[-]
- silon42 9 minutes ago
  Maybe it should use MAP_NORESERVE ?
wmf 2 days ago
This doesn't address the fact that forking large processes requires either overcommit or a lot of swap. That may be the source of the Redis problem.
[-]
- loeg 58 minutes ago
  Why? Most COWed pages will remain untouched. They only need to allocate when touched.
  [-]
  - mahkoh 48 minutes ago
    The point of disabling overcommit, as per the article, is that all pages in virtual memory must be backed by physical memory at all times. Therefore all virtual memory must reserve physical memory at the time of the fork call, even if the contents of the pages only get copied when they are touched.
  - pm215 52 minutes ago
    Because the point of forbidding overcommit is to ensure that the only time you can discover you're out of memory is when you make a syscall that tries (explicitly or implicitly) to allocate more memory. If you don't account the COW pages to both the parent and the child process, you have a situation where you can discover the out of memory condition when the process tries to dirty the RAM and there's no page available to do that with...
  - Tuna-Fish 49 minutes ago
    If you have overcommit on, that happens. But if you have it off, it has to assume the worst case, otherwise there can be a failure when someone writes to a page.
deathanatos 1 hour ago
This is quite the bold statement to make with RAM prices sky high.
I want to agree with the locality of errors argument, and while in simple cases, yes, it holds true, it isn't necessarily true. If we don't overcommit, the allocation that kills us is simply the one that fails. Whether this allocation is the problematic one is a different question: if we have a slow leak that, every 10k allocation allocs and leaks, we're probably (9999 / 10k, assuming spherical allocations) going to fail on one that isn't the problem. We get about as much info as the oom-killer would have, anyways: this program is allocating too much.
Animats 1 hour ago
Setting 2 is still pretty generous. It means "Kernel does not allow allocations that exceed swap + (RAM × overcommit_ratio / 100)." It's not a "never swap or overcommit" setting. You can still get into thrashing by memory overload.
We may be entering an era when everyone in computing has to get serious about resource consumption. NVidia says GPUs are going to get more expensive for the next five years. DRAM prices are way up, and Samsung says it's not getting better for the next few years. Bulk electricity prices are up due to all those AI data centers. We have to assume for planning purposes that computing gets a little more expensive each year through at least 2030.
Somebody may make a breakthrough, but there's nothing in the fab pipeline likely to pay off before 2030, if then.
[-]
- silon42 1 hour ago
  For me, on the desktop, thrashing overload is the most common way the Linux system effectively crashes... (I've left it overnight a few times, sometimes it recovered, but not always).
  I'm not disabling overcommit for now, but maybe I should.
  [-]
  - Tuna-Fish 52 minutes ago
    disabling overcommit does not fix trashing. Reducing the size of your swap does.
    [-]
    - silon42 5 minutes ago
      Yes, but not fully, it may still thrash on mmaped files (especially readonly ones).
EdiX 41 minutes ago
This is completely wrong. First, disabling overcommit is wasteful because of fork and because of the way thread stacks are allocated. Sorry, you don't get exact memory accounting with C, not even Windows will do exact accounting of thread stacks.
Secondly, memory is a global resource so you don't get local failures when it's exhausted, whoever allocates first after memory has been exhausted will get an error they might be the application responsible for the exhaustion or they might not be. They might crash on the error or they might "handle it", keep going and render the system completely unusable.
No, exact accounting is not a solution. Ulimits and configuring the OOM killer are solutions.
[-]
- otabdeveloper4 38 minutes ago
  > because of fork and because of the way thread stacks are allocated
  For modern (post-x86_64) memory allocators a common strategy is to allocate hundreds of gigabytes of virtual memory and let the kernel handle deal with actually swapping in physical memory pages upon use.
  This way you can partition the virtual memory space into arenas as you like. This works really well.
charcircuit 41 minutes ago
>Would you rather debug a crash at the allocation site
The allocation site is not necessarily what is leaking memory. What you actually want in either case is a memory dump where you can tell what is leaking or using the memory.
jleyank 1 day ago
As I recall, this appeared in the 90’s and it was a real pain debugging then as well. Having errors deferred added a Heisenbug component to what should have been a quick, clean crash.
Has malloc ever returned zero since then? Or has somebody undone this, erm, feature at times?
[-]
- baq 1 hour ago
  This is exactly what the article’s title does
renehsz 2 days ago
Strongly agree with this article. It highlights really well why overcommit is so harmful.
Memory overcommit means that once you run out of physical memory, the OOM killer will forcefully terminate your processes with no way to handle the error. This is fundamentally incompatible with the goal of writing robust and stable software which should handle out-of-memory situations gracefully.
But it feels like a lost cause these days...
So much software breaks once you turn off overcommit, even in situations where you're nowhere close to running out of physical memory.
What's not helping the situation is the fact that the kernel has no good page allocation API that differentiates between reserving and committing memory. Large virtual memory buffers that aren't fully committed can be very useful in certain situations. But it should be something a program has to ask for, not the default behavior.
[-]
- 201984 45 minutes ago
  > What's not helping the situation is the fact that the kernel has no good page allocation API that differentiates between reserving and committing memory.
  mmap with PROT_NONE is such a reservation and doesn't count towards the commit limit. A later mmap with MAP_FIXED and PROT_READ | PROT_WRITE can commit parts of the reserved region, and mmap calls with PROT_NONE and MAP_FIXED will decommit.
- hparadiz 58 minutes ago
  That's a normal failure state that happens occasionally. Out of memory errors come up all the time when writing robust async job queues. There are a lot of other reasons a failure could happen but running out of memory is just one of them. Sure I can force the system to use swap but that would degrade performance for everything else so it's better to let it die and log the result and check your dead letter queue after.
pizlonator 53 minutes ago
This is such an old debate. The real answer, as with all such things, is "it depends".
Two reasons why overcommit is a good idea:
- It lets you reserve memory and use the dirtying of that memory to be the thing that commits it. Some algorithms and data structures rely on this strongly (i.e. you would have to use a significantly different algorithm, which is demonstrably slower or more memory intensive, if you couldn't rely on overcommit).
- Many applications have no story for out-of-memory other halting. You can scream and yell at them to do better, but that won't help, because those apps that find themselves in that supposedly-bad situation ended up there for complex and well-considered reasons. My favorite: having complex OOM error handling paths is the worst kind of attack surface, since it's hard to get test coverage for it. So, it's better to just have the program killed instead, because that nixes the untested code path. For those programs, there's zero value in having the memory allocator be able to report OOM conditions other than by asserting in prod that mmap/madvise always succeed, which then means that the value of not overcommitting is much smaller.
Are there server apps where the value of gracefully handling out of memory errors outweighs the perf benefits of overcommit and the attack surface mitigation of halting on OOM? Yeah! But I bet that not all server apps fall into that bucket
blibble 1 hour ago
redis uses the copy-on-write property of fork() to implement saving
which is elegant and completely legitimate
[-]
- ycombinatrix 1 hour ago
  How does fork() work with vm.overcommit=2?
  A forked process would assume memory is already allocated, but I guess it would fail when writing to it as if vm.overcommit is set to 0 or 1.
  [-]
  - pm215 47 minutes ago
    I believe (per the stuff at the bottom of https://www.kernel.org/doc/Documentation/vm/overcommit-accou... ) that the kernel does the accounting of how much memory the new child process needs and will fail the fork() if there isn't enough. All the COW pages should be in the "shared anonymous" category so get counted once per user (i.e. once for the parent process, once for the child), ensuring that the COW copy can't fail if the fork succeeded.
  - toast0 11 minutes ago
    As pm215 states, it doubles your memory commit. It's somewhat common for large programs/runtimes that may fork at runtime to spawn an intermediary process during startup to use for runtime forks to avoid the cost of CoW on memory and mapppings and etc where the CoW isn't needed or desirable; but redis has to fork the actual service process because it uses CoW to effectively snapshot memory.
- loeg 53 minutes ago
  Can you elaborate on how this comment is connected to the article?
  [-]
  - blibble 44 minutes ago
    did you read it the article? there's a large section on redis
    the author says it's bad design, but has entirely missed WHY it wants overcommit
jcalvinowens 1 hour ago
There's a reason nobody does this: RAM is expensive. Disabling overcommit on your typical server workload will waste a great deal of it. TFA completely ignores this.
This is one of those classic money vs idealism things. In my experience, the money always wins this particular argument: nobody is going to buy more RAM for you so you can do this.
[-]
- toast0 1 hour ago
  Even if you disable overcommit, I don't think you will get pages assigned when you allocate. If your allocations don't trigger an allocation failure, you should get the same behavior with respect to disk cache using otherwise unused pages.
  The difference is that you'll fail allocations, where there's a reasonable interface for errors, rather than failing at demand paging when writing to previously unused pages where there's not a good interface.
  Of course, there are many software patterns where excessive allocations are made without any intent of touching most of the pages; that's fine with overcommit, but it will lead to allocation failures when you disable overcommit.
  Disabling overcommit does make fork in a large process tricky; I don't think the rant about redis in the article is totally on target; fork to persist is a pretty good solution, copy on write is a reasonable cost to pay while dumping the data to disk and then it returns to normal when the dump is done. But without overcommit, it doubles the memory commitment while the dump is running, and that's likely to cause issues if redis is large relative to memory and that's worth checking for and warning about. The linked jemalloc issue seems like it could be problematic too, but I only skimmed; seems like that's worth warning about as well.
  For the fork path, it might be nice if you could request overcommit in certain circumstances... fork but only commit X% rather than the whole memory space.
  [-]
  - jcalvinowens 1 hour ago
    You're correct it doesn't prefault the mappings, but that's irrelevant: it accounts them as allocated, and a later allocation which goes over the limit will immediately fail.
    Remember, the limit is artificial and defined by the user with overcommit=2, by overcommit_ratio and user_reserve_kbytes. Using overcommit=2 necessarily wastes RAM (renders a larger portion of it unusable).
    [-]
    - toast0 47 minutes ago
      > Using overcommit=2 necessarily wastes RAM (renders a larger portion of it unusable).
      The RAM is not unusable, it will be used. Some portion of ram may be unallocatable, but that doesn't mean it's wasted.
      There's a tradeoff. With overcommit disabled, you will get allocation failure rather than OOM killer. But you'll likely get allocation failures at memory pressure below that needed to trigger the OOM killer. And if you're running a wide variety of software, you'll run into problems because overcommit is the mainstream default for Linux, so many things are only widely tested with it enabled.
      [-]
      - jcalvinowens 39 minutes ago
        > The RAM is not unusable, it will be used. Some portion of ram may be unallocatable
        I think that's a meaningless distinction: if userspace can't allocate it, it is functionally wasted.
        I completely agree with your second paragraph, but again, some portion of RAM obtainable with overcommit=0 will be unobtainable with overcommit=2.
        Maybe a better way to say it is that a system with overcommit=2 will fail at a lower memory pressure than one with overcommit=0. Additional RAM would have to be added to the former system to successfully run the same workload. That RAM is waste.
    - loeg 57 minutes ago
      If the overcommit ratio is 1, there is no portion rendered unusable? This seems to contradict your "necessarily" wastes RAM claim?
      [-]
      - jcalvinowens 54 minutes ago
        Read the comment again, that wasn't the only one I mentioned.
- loeg 58 minutes ago
  How does disabling overcommit waste RAM?
  [-]
  - jcalvinowens 56 minutes ago
    Because userspace rarely actually faults in all the pages it allocates.
    [-]
    - loeg 54 minutes ago
      Surely the source of the waste here is the userspace program not using the memory it allocated, rather than whether or not the kernel overcommits memory. Attributing this to overcommit behavior is invalid.
      [-]
      - nickelpro 38 minutes ago
        Reading COW memory doesn't cause a fault. It doesn't mean unused literally.
        And even if it's not COW, there's nothing wrong or inefficient about opportunistically allocating pages ahead of time to avoid syscall latency. Or mmapping files and deciding halfway through you don't need the whole thing.
        There are plenty of reasons overcommit is the default.
      - jcalvinowens 50 minutes ago
        Obviously. But all programs do that and have done it forever, it's literally the very reason overcommit exists.
- wmf 1 hour ago
  If you have enough swap there's no waste.
  [-]
  - jcalvinowens 1 hour ago
    Wasted swap is still waste, and the swapping costs cycles.
    [-]
    - wmf 1 hour ago
      With overcommit off the swap isn't used; it's only necessary for accounting purposes. I agree that it's a waste of disk space.