Here’s one case we’ve spent some time debugging, related to StorPool, VMs, cgroups, OOM (out-of-memory) killer and caching. Some of it should be useful to a lot of sysadmins out there.
The issue started with a customer complaining:
> It’s happening again, on this hypervisor VMs are being killed
> by the OOM killer. This doesn’t happen on hypervisors where no
> cgroups are configured, can you not configure/use them?
The reason we have to setup cgroups on all StorPool deployments is that there is no other way to reserve memory and CPUs for a specific set of processes. As the storage system is one of the most sensitive parts of the system to latencies and problems (or, to put it better, everything else is very sensitive to problems in the storage system), it really pays off to make sure it’s protected from everything else. It also makes a lot of sense to do this for your VMs and system processes, just in case of a runaway allocation. There is also always the possibility for a memory deadlock (as with any storage system not entirely in the kernel), but this is for another post.
Runaway allocations would not be an issue if Linux’s OOM killer was working properly, but the versions of the OOM killer in most enterprise or enterprise-like distributions have their own quirks, don’t behave properly and all in all make your life miserable and kill your sshd instead of the allocating process. This is not helped by the new ideas (https://lwn.net/Articles/761118/) for the interaction of the OOM killer and cgroups.
So, to reserve a specific amount of memory for each task on a KVM hypervisor, what you should do is to create memory cgroups for system tasks (system.slice), virtual machines (machine.slice), maybe user tasks (user.slice), and in our case – StorPool cgroup (storpool.slice). Then, for all these cgroups you set memory limits, whose sum is ~1-2GB less than the sum of total free memory (as some kernel memory is not accounted in cgroups), and then make sure that there are no processes in the root cgroup, just in the cgroups above or their children. This is accomplished by different configuration options in libvirt and systemd, and this makes sure that even if one cgroup overflows, no other would suffer.
There is also a known issue with memory cgroups, buffer cache and the OOM killer. If you don’t use cgroups and you’re short on memory, the kernel is able to start flushing dirty and clean cache, reclaim some of that memory and give it to whomever needs it. In the case of cgroups, for some reason there is no such reclaim logic for the clean cache, and the kernel prefers to trigger the OOM killer, who then gets rid of some useful process.
(There’s some very useful information for the actual memory usage for each cgroup in the “memory.stat” file in its directory in /sys, for example /sys/fs/cgroup/memory/machine.slice/memory.stat)
Now, in the case of VMs there is no good reason to actually have any clean cache in that cgroup – VMs should be configured to not use the buffer cache (i.e. for example in libvirt all disks should have cache=none). Having such cache prevents you from doing live migrations (libvirt complains that this might lead to data loss), and the VM should have cached the needed blocks anyway and you’re just using at least double the memory for this cache, which is sub-optimal, to say the least.
(This has heen fixed in https://github.com/qemu/qemu/commit/dd577a26ff03b6829721b1ffbbf9e7c411b72378, but that hasn’t been merged in any released version as of the time of this post)
Most orchestration systems nowadays do use “cache=none” in their configurations (and the integrations of StorPool with most of the orchestrations set that, if there’s an interface for it), but in this case the orchestration system had some very old templates (some of them were even using IDE emulation instead of virtio), and the default is to use the buffer-cache. The proper solution for this would be to just fix the templates, restart the VMs and be done with it, but people don’t seem to be happy to reboot their customers’ VMs, and I imagine the customer are also not really keen on the idea. Also, the whole change of the template would’ve been a problem with this orchestration system.
The issue with “too much clean cache in memory that we don’t really need” has been looked into before. Here are the solutions that came to mind and the one we finally implemented:
The first idea that came to mind was to just flush the whole buffer cache periodically, by doing “echo 3 > /proc/sys/vm/drop_caches”. That’d work, but is a somewhat blunt axe, as it will drop also some cache which the system needs (so it’ll be rereading the libc and all other stuff that needs to be cached and will slow down a lot of processes).
For a second idea, we remembered that there’s something wonderful called LD_PRELOAD, which can be used to basically override functions and inject code. With it, we could handle the open() system call, detect if we’re opening a block device, and set the O_DIRECT flag, which basically means “try not to use the buffer cache”. The problem is that O_DIRECT has some annoying limitations (http://man7.org/linux/man-pages/man2/open.2.html#NOTES), like having the position in the file/device and memory that’s being written from/to to be aligned in some way, which according to the man page is 512 bytes, if you’re not using a file system (in which case it might have to be aligned to the system page size or more). Because we can’t be sure what would be done by the VM on top, we’d have to handle also all read() and write() calls, and maybe allocate our own memory to do the failed transfers, which would be too much work and too prone to errors.
Then, there is an interface in the kernel called “posix_fadvise”, which you can use to mark some data from the cache as “not needed any more” (which then the kernel drops). It could be used with LD_PRELOAD for read() calls, to just mark any data that was read as POSIX_FADV_DONTNEED. This idea was already implemented in some way in https://code.google.com/archive/p/pagecache-mangagement/ by Andrew Morton and some other nice people, and I started rewriting it a bit to be simpler (i.e. to just do posix_fadvise() directly after the read, instead of later and based on some high/low watermarks).
At this point our CTO asked “do you actually need to call posix_fadvise() from the process, or it can be done from anywhere”? Turns out that in the same repo there is a small tool that just ejects a file (or block device) from the cache, called “fadv” (which I found out after I wrote basically the same 5 lines).
The end result was a small script that runs “fadv” for all StorPool block devices and keeps them out of the cache, which was deemed an acceptable workaround. It was quick, too – it took around a minute on its first run to free ~100GB of memory, and less that half a second afterward.