From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:56927) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TMItR-00043g-MF for qemu-devel@nongnu.org; Thu, 11 Oct 2012 09:33:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TMItM-00022m-M5 for qemu-devel@nongnu.org; Thu, 11 Oct 2012 09:33:41 -0400 Received: from mail.stepping-stone.ch ([194.176.109.206]:37764) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TMItM-00022P-Bl for qemu-devel@nongnu.org; Thu, 11 Oct 2012 09:33:36 -0400 Message-ID: <1349962403.4696.51.camel@storm> From: Tiziano =?ISO-8859-1?Q?M=FCller?= Date: Thu, 11 Oct 2012 15:33:23 +0200 Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Silent filesystem/qcow2 corruptions with qemu-kvm-1.0 and 1.1.1 Reply-To: tiziano.mueller@stepping-stone.ch List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel Hi everyone We have a couple of VMs, each with QCow2-files as disk images and we are seeing random filesystem/qcow2 corruptions of the VM filesystems. We have two types of corruptions: * if the guest OS uses XFS, we get corruption errors originating most of the time in xfs_da_do_buf inside the VM. * if the guest OS uses ext4 (on rare occasions also XFS), there are most of the time no noticeable faults in the guest until qemu shuts down with the following messages (maybe that's a different case and unrelated to the first one): handle_dev_stop: stop handle_dev_stop: stop handle_dev_stop: stop qcow2_free_clusters failed: Invalid argument [...] qcow2_free_clusters failed: Invalid argument handle_dev_stop: stop handle_dev_stop: stop 2012-10-09 00:52:49.294+0000: shutting down (one time there was also a message before the end about a failed assert in a qcow-related function) Checking the image using `qemu-img check` then gives something like this: ERROR OFLAG_COPIED: offset=3bc30000 refcount=1 ERROR offset=c7e331: Cluster is not properly aligned; L2 entry corrupted. ERROR: invalid cluster offset=0x3aa4810400fa0000 ERROR offset=fa000000001000: Cluster is not properly aligned; L2 entry corrupted. ERROR: invalid cluster offset=0x80966181ff0000 ERROR: invalid cluster offset=0x80966182000000 ERROR offset=80966181ffffff: Cluster is not properly aligned; L2 entry corrupted. ERROR: cluster 280376143330560: copied flag must never be set for compressed clusters Warning: cluster offset=0x286d3d000000 is after the end of the image file, can't properly check refcounts. Warning: cluster offset=0x286d3d010000 is after the end of the image file, can't properly check refcounts. We can observe such corruptions on two servers: * one with qemu-kvm-1.1 and kernel 3.5.2 (for the kernel configuration see [1]) * the second one with qemu-kvm-1.0 and kernel 3.2.6 (for the kernel configuration see [2]) The corruptions happen up a lot faster and more often on server 1 with qemu-kvm-1.1 and do not depend on whether write operations happen or not. One test case was: * xfs_repair /dev/vda2 -> corruptions found and fixed * xfs_repair /dev/vda2 -> no corruptions found * mount /dev/vda2 /mnt/something * find /mnt/something > /dev/null * umount /mnt/something * xfs_repair /dev/vda2 -> corruptions found again The interesting thing here is that xfs_repair shows bad magic numbers of 0x0 most of the time which would indicate that blocks get zeroed out and not randomly overwritten. The guest kernels range from 3.2 to 3.5, used distros: Gentoo, Fedora, Ubuntu, all x86_64. The host filesystem is XFS as well, on a LVM volume on a RAID1 (Megaraid SAS). Management is done with libvirt and we are using virtio-blk with cache=writeback (an example configuration of a VM is found in [3]). We were unable to trigger this deliberately, some of the VMs were running without problems for more than a month (server 2), some of them show problems after a couple of hours (server 1). We checked: * in the host: memory and LVM using memtester, verify-data and badblocks. We also tried with a brand new XFS volume and brand new qcow2 files. * in the VMs: memory using memtest86+ and the virtual disk using verify-data (verify-data -e /dev/vda3 1M 10000). Both test are currently running without problems so far. The only suspicious thing is that verify-data sometimes hangs for up to a minute (as do all block operations in that VM), but `dmesg` does not show anything. Does anyone has any ideas on how to narrow this problem down further or how to debug it? We are currently setting up another server with qemu-1.1.2 or qemu-1.2.0 and would like to be able to reproduce this problem before and make sure we don't face the same problems there again. We can provide logs of almost everything if required. We could also provide SSH and/or Spice access to one or more VMs running on the server if that helps and someone would be willing to help track down this problem. Similar problems were already mentioned here, but without a solution: * https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/1040033 * http://lists.gnu.org/archive/html/qemu-devel/2012-04/msg01756.html Thanks in advance, Tiziano [1] server1 kernel config: http://bpaste.net/show/rajkzUNGLdmo95sl35YE/ [2] server2 kernel config: http://bpaste.net/show/cAFQ7FhJzfjAorIirUj6/ [3] libvirt vm xml: http://bpaste.net/show/D7nRCbr26f9nmapq5gTo/ [4] qemu-kvm invocation: http://bpaste.net/show/Mgrq9AeD2ehBfI43SH5u/ -- stepping stone GmbH Neufeldstrasse 9 CH-3012 Bern Telefon: +41 31 332 53 63 www.stepping-stone.ch tiziano.mueller@stepping-stone.ch