While debugging an issue in a new StorPool deployment we re-discovered a kernel bug that exists in some of the latest Linux kernels and causes silent data corruption. The setup is a virtualized environment where the kernel containing the bug was running on the host. The guests were KVM virtual machines with io=threads (the default in standard libvirt/qemu packages) and virtual disk devices with cache=none. The bug has been fixed upstream in Linux kernel 4.18 in July 2018, but the fix has not yet propagated to commonly used affected kernels.
We have opened a bug with Ubuntu at https://bugs.launchpad.net/bugs/1796542 to track the issue.
The bug was introduced in Linux kernel 4.10 and fixed in Linux kernel 4.18.
Known affected distributions:
– Ubuntu 16.04 with “HWE” kernels 4.13 or 4.15
– Ubuntu 18.04
Known not affected distributions:
– Debian 9 (Stretch)
– CentOS 6
– CentOS 7
As of right now there is no readily available fix for the affected distributions. The possible workarounds are:
– Use io=native instead of io=threads in libvirt for your virtual machines. io=native is recommended for all StorPool deployments and default in RedHat Enterprise Virtualization and CentOS’s qemu-kvm-ev package;
– If running an affected kernel (e.g. Ubuntu HWE kernel 4.13 or 4.15) and you have the option to downgrade (e.g. with Ubuntu 16.04), downgrade to a kernel version which is not affected;
– Switch to a Linux distribution which isn’t affected;
– Rebuild the kernel with the relevant fixes described below.
The first workaround (using io=native instead of io=threads) currently seems to be the easiest to deploy. The remaining workarounds carry particular risks and limitations so we cannot recommend them.
The issue was demonstrated to StorPool by a customer with the following test setup:
– Hypervisor host running the affected kernel;
– KVM virtual machine with libvirt/qemu configured as follows:
– cache=none for the disk
– Guest running Linux with XFS root filesystem
– running the following command in the virtual machine:
fio –name=randwrite –ioengine=libaio –iodepth=1 –rw=randwrite –bs=4k –direct=0 –size=512M –numjobs=8 –runtime=240 –group_reporting
– Rebooting after the command is finished
– On boot, the virtual machine would show a message similar to this:
XFS (vda1): Corruption warning: Metadata has LSN (35:14053) ahead of current LSN (35:11489). Please unmount and run xfs_repair (>= v4.3) to resolve.
A simpler test program to reproduce the issue was devised by Jan Kara from SUSE. It is linked from this post in the linux-block mainling list: https://www.spinics.net/lists/linux-block/msg28507.html . This test program can trigger the bug when running in the host OS, so it does not require the more complex virtualization setup.
The issue was introduced in 4.10-rc1 with the following commit 72ecad22d9f198aafee64218512e02ffa7818671 (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.19-rc6&id=72ecad22d9f198aafee64218512e02ffa7818671). As of this writing, it’s included in kernel/distribution combinations described above.
It affects writes to block devices performed with O_DIRECT (qemu/libvirt cache=none). The silent data corruption can happen only if the writes are not page-aligned (but are otherwise correctly 512-byte aligned for O_DIRECT). This is the reason for the issue not to be reproducible with ext4 in the VM, as all of its writes are page-aligned.
The issue was first reported to SUSE, who developed and upstreamed the fixes for the issue in the patches below. We ran tests, first with added logging at the appropriate places to confirm the affected code path and then with the patches applied to verify that the problem was not present any more.
commit 0aa69fd32a5f766e997ca8ab4723c5a1146efa8b – block: add a lower-level bio_add_page interface
commit b403ea2404889e1227812fa9657667a1deb9c694 – block: bio_iov_iter_get_pages: fix size of last iovec
commit 9362dd1109f87a9d0a798fbc890cb339c171ed35 – blkdev: __blkdev_direct_IO_simple: fix leak in error case
commit 17d51b10d7773e4618bcac64648f30f12d4078fb – block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
These commits appear first in 4.18-rc7.
We recommend that immediate steps should be taken by all affected to mitigate the issue.