From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:56496)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <gceq-qemu-devel@m.gmane.org>) id 1asZkp-0007zr-9e
	for qemu-devel@nongnu.org; Tue, 19 Apr 2016 13:48:04 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <gceq-qemu-devel@m.gmane.org>) id 1asZkl-0002A4-AO
	for qemu-devel@nongnu.org; Tue, 19 Apr 2016 13:48:03 -0400
Received: from plane.gmane.org ([80.91.229.3]:51573)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <gceq-qemu-devel@m.gmane.org>) id 1asZkl-00029o-3J
	for qemu-devel@nongnu.org; Tue, 19 Apr 2016 13:47:59 -0400
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <gceq-qemu-devel@m.gmane.org>) id 1asZkg-0002SX-U0
	for qemu-devel@nongnu.org; Tue, 19 Apr 2016 19:47:55 +0200
Received: from barriere.frankfurter-softwarefabrik.de ([217.11.197.1])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <qemu-devel@nongnu.org>; Tue, 19 Apr 2016 19:47:54 +0200
Received: from lvml by barriere.frankfurter-softwarefabrik.de with local
	(Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00
	for <qemu-devel@nongnu.org>; Tue, 19 Apr 2016 19:47:54 +0200
From: Lutz Vieweg <lvml@5t9.de>
Date: Tue, 19 Apr 2016 19:47:44 +0200
Message-ID: <nf5r01$klb$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] I/O errors reported to guest for raw-image-file backed
 /dev/vda - but host sees no I/O errors
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

Hi,

I have been investigating strange stalls of virtual machines,
and realized that the VMs were (silently) paused because qemu
thinks there were I/O errors when writing to the host.

After using "werror=report,rerror=report" with "-drive" we now see
actual reporting of I/O errors to the guest, where they look like this:

> end_request: I/O error, dev vda, sector 7243680
> EXT4-fs warning (device vda1): ext4_end_bio:259: I/O error writing to inode 951097 (offset 3096576 size 4096 starting block 905461)
> end_request: I/O error, dev vda, sector 22018120
> Buffer I/O error on device vda1, logical block 2752009
> lost page write due to I/O error on vda1
> end_request: I/O error, dev vda, sector 12857032
> JBD2: Detected IO errors while flushing file data on vda1-8
> Aborting journal on device vda1-8.

The qemu instance in question is using an executable compiled
from current sources, running on vanilla linux-4.4.2 - and qemu
is started directly, not via any library or VM management framework.

The guest drive parameters are:
>  -drive "file=image.raw,if=virtio,format=raw,media=disk,cache=unsafe,werror=report,rerror=report"


I've searched the Web and found some people reporting similar symptoms,
but related to either time-outs with NFS, direct use of LVM / DRBD
partitions on the host and such - these circumstances do not apply here.
(We do use DRBD and LVM, but qemu is not accessing raw partitions,
just an ordinary file on an XFS filesystem, and the host does not report
any I/O errors on the device or filesystem layers.)


There seems to be a relationship between the occurence of the
I/O-errors reported to the guest and the load on the I/O system of
the host - the errors become more frequent (like "once per day")
when there is high load.


Is there any kind of timeout or something that might make qemu
assume a write operation on the host has failed?

Can you provide any hint on how to pursue the cause of these errors?
(I thought about using "strace -f -p ..." on qemu, but I don't
know what exactly to look for in the output - some failed "pwrite()"
to the image file?)

Regards,

Lutz Vieweg