The MySQL “swap insanity” problem and the effects of the NUMA architecture

The MySQL “swap insanity” problem and the effects of the NUMA architecture(blog.jcole.us)

92 points by admiun 14 years ago | 56 comments

One behavior I've noticed with linux that if you read files sequentially from disk (for example, doing scp), then linux would fill all the memory with those files' contents and then it would swap out everything but the (obviously useless) disk caches. So you'll have all the memory filled with data you would never need again and trying to do anything would cause a large and painful unswapping (had side effect of halting my qemu).

This is true insanity. Surely you can disable swap or tune swappiness, but what's the reason for crazy default behavior?

justincormack 14 years ago | |

Use rsync. It now preserves the buffer cache status that files had before so it does not stomp on your allocations.

http://insights.oetiker.ch/linux/fadvise/

mceachen 14 years ago | | |

Also, consider using --bwlimit to throttle the copy speed, so the spindle can still respond to other IO requests. (25-50% of unthrottled speed seemed to be a reasonable tradeoff).

pmjordan 14 years ago | |

I suspect the reason to be that the system has noticed that your apps haven't touched their memory for a long time and thus "don't need it". This is a reasonably valid assumption on servers: if you've got some daemons that are backgrounded for minutes or hours at a time, keeping their memory resident is a waste. However, on the desktop responsiveness (latency) is more important than throughput. Just because you only switch between apps on a timescale of minutes or hours, doesn't mean the kernel should swap them out. So the algorithm needs different weighting.

I used to experience this problem, but I haven't lately. I suspect what's going on is that the "-desktop" kernel variant of OpenSUSE uses a differently weighted swappiness algorithm. If your distro offers a choice of different kernel variants, you could try them; otherwise (or if that doesn't help), you could track down the knobs you need to tweak to make the problem go away.

steerb 14 years ago | |

I guess that the kernel cannot know that the application (in your case scp) will not try to touch the data ever again.

The current default behavior, which you call crazy, does however favor programs which do actually need the data that was just read (e.g., databases).

masklinn 14 years ago | |

Would mmapping those files instead of reading them provide saner behavior, or does the kernel still do that?

FooBarWidget 14 years ago | | |

Reading a file with mmap() results in the same behavior.

nknight 14 years ago | |

I'm guessing you're either running a rather old distribution/kernel, or your swappiness is set far too aggressively. Try setting it to 0.

andreasvc 14 years ago |

I still don't comprehend why one needs swap at all. All the explanations I have come across talk about not having enough memory. Given that one has at least 8 GB of memory, or maybe even >100GB, why on earth would you need swap? Sure some process might allocate even more than that, but maybe it's better to refuse such a request than to slow down the whole system due to thrashing.

I get the idea that the reason might be that a lot of programs allocate memory which they don't actually need regularly, which is then very convenient to swap out. Rather than enabling this bad habit using slow disk storage it would be much better to expect programs to be more frugal, or at least signify whether something should be kept in memory or not.

xxjaba 14 years ago |

I am very impressed with how well written this article is. A brief description of the problem, links to relevant discusions for less informed readers to come up to speed, and clear examples of how key pieces of information were gathered. I learned more from this article about the topic at hand than I have from a Blog post in recent memory.

jeremycole 14 years ago | |

Thanks! I am glad you learned something, and happy to get great feedback!

finnh 14 years ago | | |

I'll second this.

The intro was so well written that by the time I got to the first numa_maps output ("2aaaaad3e000 default anon=13240527 dirty=13223315 swapcache=3440324 active=13202235 N0=7865429 N1=5375098") I immediately thought "well geez look at that N0/N1 imbalance, there's your problem right there".

Point being, I haven't dealt with low-level hardware details since college, and yet your article's delightfully clear intro got me sufficiently educated to feel like I was right there with you.

A question well phrased is half answered...

sciurus 14 years ago |

Anther good article is https://kevinclosson.wordpress.com/2009/05/14/you-buy-a-numa...

On commodity servers, unless you have specific reasons to do otherwise just switch from NUMA to SUMA. There are two things yo should do

* Change a BIOS setting. The term for this will vary by manufacturer. For Dell, it means enabling node interleaving.

* Pass numa=off to the linux kernel (e.g. edit grub.cfg)

larsberg 14 years ago |

Yes, NUMA effects will really kill you, though how much depends on the particular quad-proc topology. I have some measurments for the interested in a small workshop paper I put together (I gathered the numbers in the context of tuning our garbage collector anyway):

http://dl.dropbox.com/u/1620890/website/writings/mspc12-stre...

WALoeIII 14 years ago |

Will this optimization help on virtualized machines like Xen? Or does all memory appear to be the same?

On an EC2 m1.xlarge:

$ numactl --hardware available: 1 nodes (0) libnuma: Warning: /sys not mounted or invalid. Assuming one node: No such file or directory node 0 cpus: node 0 size: <not available> node 0 free: <not available> libnuma: Warning: Cannot parse distance information in sysfs: No such file or directory

jakejake 14 years ago |

For those of us mere mortals, would it be safe to assume that adding the suggested line to mysql_safe would be ok to do?

cmd="/usr/bin/numactl --interleave all $cmd"

corford 14 years ago |

Interesting read. Does anyone know if things have improved/changed significantly since the article was posted (Sep 2010)?

jeremycole 14 years ago | |

They have not changed in any way, however there is a patchset proposed currently to change how NUMA works a bit. Unclear if it will change this situation.

corford 14 years ago | | |

Would that be a mysql or linux kernel patch (assume the latter)? Also want to echo xxjaba's comment further down - thanks for doing the work on that post, it was really enlightening!

lawnchair_larry 14 years ago |

The title is inaccurate. It should say, "the linux swap insanity problem" because this is entirely related to the linux kernel. It just happens to affect MySQL and similar workloads, but it is not MySQL's fault. It doesn't behave that way on other platforms either.

jeremycole 14 years ago | |

Pretty hard to make you happy, eh?

I agree with you on a purely technical basis, however, this article was written for the MySQL community, was tested (only) on MySQL, and has primarily affected my only on MySQL systems, which I (and the others referenced in the article) primarily run on Linux.

bifrost 14 years ago | |

Thats a very good point, these types of problems exist to some extent on other OSes but this case seems pretty specific to Linux. I suspect testing on Solaris/FreeBSD would show better results in this area.

defen 14 years ago |

Are the lessons here applicable to other commonly used databases (mongo, postgres, redis, etc)?

wmf 14 years ago | |

This applies to any case where you want a single process to use more than half the server's RAM.

jeremycole 14 years ago | |

Yes, absolutely. In fact one of the most common longer term referrers for that post is about MongoDB, not MySQL:

http://www.mongodb.org/display/DOCS/NUMA

j2labs 14 years ago |

tl;dr - If you're running a database, or generally memory intensive system, while also using multiple CPUs you should run this command: echo 0 > /proc/sys/vm/zone_reclaim_mode

But the article is great. You should definitely read it.

finnh 14 years ago | |

Except that's not the TL;DR at all.

From the article:

"An aside on zone_reclaim_mode

The zone_reclaim_mode tunable in /proc/sys/vm can be used to fine-tune memory reclamation policies in a NUMA system. Subject to some clarifications from the linux-mm mailing list, it doesn’t seem to help in this case."

The real TL;DR is "run your mysql command under the auspices of '/usr/bin/numactl --interleave all' so that your big pool allocation is split evenly across nodes"

And an even better solution would be if _only_ the big pool allocation use interleaved allocation, and all the rest used normal node-bound allocation. This would require some sort of change to the malloc calls though, yes? All of the solutions listed in the article operate at the granularity of a process (or higher), not down to the individual allocation.

j2labs 14 years ago | | |

My mistake, you are correct that I forgot the second command.

The new tldr is: numactl --interleave=all /path/to/daemon; echo 0 > /proc/sys/vm/zone_reclaim_mode

This file helps explain the different ways one can tweak memory with Linux: http://www.kernel.org/doc/Documentation/sysctl/vm.txt