This is true insanity. Surely you can disable swap or tune swappiness, but what's the reason for crazy default behavior?
I used to experience this problem, but I haven't lately. I suspect what's going on is that the "-desktop" kernel variant of OpenSUSE uses a differently weighted swappiness algorithm. If your distro offers a choice of different kernel variants, you could try them; otherwise (or if that doesn't help), you could track down the knobs you need to tweak to make the problem go away.
The current default behavior, which you call crazy, does however favor programs which do actually need the data that was just read (e.g., databases).
I get the idea that the reason might be that a lot of programs allocate memory which they don't actually need regularly, which is then very convenient to swap out. Rather than enabling this bad habit using slow disk storage it would be much better to expect programs to be more frugal, or at least signify whether something should be kept in memory or not.
The intro was so well written that by the time I got to the first numa_maps output ("2aaaaad3e000 default anon=13240527 dirty=13223315 swapcache=3440324 active=13202235 N0=7865429 N1=5375098") I immediately thought "well geez look at that N0/N1 imbalance, there's your problem right there".
Point being, I haven't dealt with low-level hardware details since college, and yet your article's delightfully clear intro got me sufficiently educated to feel like I was right there with you.
A question well phrased is half answered...
On commodity servers, unless you have specific reasons to do otherwise just switch from NUMA to SUMA. There are two things yo should do
* Change a BIOS setting. The term for this will vary by manufacturer. For Dell, it means enabling node interleaving.
* Pass numa=off to the linux kernel (e.g. edit grub.cfg)
http://dl.dropbox.com/u/1620890/website/writings/mspc12-stre...
On an EC2 m1.xlarge:
$ numactl --hardware available: 1 nodes (0) libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory node 0 cpus: node 0 size: <not available> node 0 free: <not available> libnuma: Warning: Cannot parse distance information in sysfs: No such file or directory
cmd="/usr/bin/numactl --interleave all $cmd"
I agree with you on a purely technical basis, however, this article was written for the MySQL community, was tested (only) on MySQL, and has primarily affected my only on MySQL systems, which I (and the others referenced in the article) primarily run on Linux.
But the article is great. You should definitely read it.
From the article:
"An aside on zone_reclaim_mode
The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory reclamation policies in a NUMA system. Subject to some clarifications from the linux-mm mailing list, it doesn’t seem to help in this case."
The real TL;DR is "run your mysql command under the auspices of '/usr/bin/numactl --interleave all' so that your big pool allocation is split evenly across nodes"
And an even better solution would be if _only_ the big pool allocation use interleaved allocation, and all the rest used normal node-bound allocation. This would require some sort of change to the malloc calls though, yes? All of the solutions listed in the article operate at the granularity of a process (or higher), not down to the individual allocation.
The new tldr is: numactl --interleave=all /path/to/daemon; echo 0 > /proc/sys/vm/zone_reclaim_mode
This file helps explain the different ways one can tweak memory with Linux: http://www.kernel.org/doc/Documentation/sysctl/vm.txt
This would affect C programs in particular, since they usually manage their memory manually. If bash can't malloc() a buffer for its input, for example, it will simply fail, and you might be able to do anything to fix the system; the same goes for sshd, which might end up refusing new connections as a result. Programs that preallocate important data structures, and programs using garbage collection, would fare somewhat better.
In other words, if swap is disabled you will still need a sort of soft limit or reserved space to ensure that programs can survive memory starvation. I don't know if the Linux kernel (or the GNU C library) has anything of the sort.
It is not desirable for a machine to have a "wall" which, upon being hit, becomes a harsh restriction on its capabilities. This is because we often encounter the "wall" unexpectedly, at a time that might be critical.
1) it needs to either be told the various sizes, speeds, and quirks on each server to make best use. (just some work)
2) it needs to coordinate with the other processes running on the system to divide up the resources. This is hard. Generally people bail and just assign some share of RAM and hope for the best with the other layers.
I guess I want to argue that with the currently typical amounts of RAM, all critical data should fit in RAM and stay there. The idea of virtual memory was to abstract over the difference between RAM and disk, but perhaps this has become a harmful abstraction now that RAM is big enough while the disadvantage of slow disks remains. RAM and disks are fundamentally different parts of the memory hierarchy, and should be treated completely differently by applications.
> DO NOT TURN OFF SWAP to prevent this. Your box will crawl, kswapd will chew up a lot of the processor, Linux needs swap enabled, lets just hope its not used.
(from one of the blogs linked in the article).
However, I can't find a clear explanation of why this is so.
I suspect it would help quite a bit, if done right, and for the right query workload.
Thanks Jeremy, I will do.
But for a lot of systems your service will fail shortly after you start swapping anyway, because the performance cost of swapping is so high that it often starts a death spiral (can't handle enough requests, so they start piling up, eating even more memory, until your system dies or you hit connection limits etc.).
So "best case" in a typical configuration is that the wall is a bit higher. Worst case you gain nothing at all from the swap.
Personally I treat it as a failure if we ever hit swap - it means connection limits etc. has been set too high.
/agree. but still a useful feature.
The degraded performance a system will show when it starts hitting disk instead of memory is a great 'soft' failure.
I think it is good to have graduations. Going from 'OK' to 'Damn-this-is-slow' before 'Fail' is handy.
You can therefor swap it out and use the extra memory for cache.
Most long term applications only need a small fraction of their startup memory.
If they never modify those CoW pages, they can both happily keep using the same copy of the page in memory, and you can have two 1GB processes using a total of e.g. 1.01GB of real memory.
This is very useful in practice, but it means that the system needs the ability to over-commit memory allocations (allow CoW allocations etc., when there is no actual memory available to back it), and over-commit currently, and probably should, requires swap (some place to dump pages in case an over-committed allocation comes calling).
If you have 16GB of ram, and 0GB of swap, the OOM killer will kick in at 16GB.
Also, suppose you need the memory only for startup and shutdown (things like logfiles, network connections, command line parsing, etc).
Things like network connections, logfiles are used all the time, so they won't be swapped out (actually file handles are kernel side so never swapped anyway). You can free the command line parse after setting the options.
And clean shutdown is overrated: long running programs can just terminate fairly gracelessly if necessary, the OS cleans everything up.
If you increase the size of the memory available to you (sbrk) you can only decrease it if no memory is allocated between the new area and the end of it.
In practice the memory is never returned, and applications rely to swap to deal with that.
It's not the logfile (and network) handle that is swapped out - it's the code for deciding where it is, and opening it. Also initialization code.
Some programs can abort, but others will require a (slow) consistency check of their data if that happens to them.
And finally theory is all well and good, but in actual practice about 3/4 of the memory used by running programs can be swapped out.