From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KZ4YU-0008Dq-PT
	for qemu-devel@nongnu.org; Fri, 29 Aug 2008 10:02:26 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KZ4YS-0008DP-9l
	for qemu-devel@nongnu.org; Fri, 29 Aug 2008 10:02:26 -0400
Received: from [199.232.76.173] (port=46481 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KZ4YS-0008DM-6y
	for qemu-devel@nongnu.org; Fri, 29 Aug 2008 10:02:24 -0400
Received: from host36-195-149-62.serverdedicati.aruba.it
	([62.149.195.36]:50858 helo=mx.cpushare.com)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <andrea@qumranet.com>) id 1KZ4YR-0006ua-UR
	for qemu-devel@nongnu.org; Fri, 29 Aug 2008 10:02:24 -0400
Date: Fri, 29 Aug 2008 15:52:49 +0200
From: Andrea Arcangeli <andrea@qumranet.com>
Message-ID: <20080829135249.GI24884@duo.random>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Subject: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA
	transfer
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

Hello,

while trying to track down weird fs corruption, I noticed the way the
cmd_writeb ioport write cancels the I/O is not atomic from a DMA point
of view if a SG table with more than one entry is used. The DMA
command with qemu/kvm is aborted partially. Unfortunately I don't know
the IDE specs well enough to know what happens in the hardware, but I
doubt hardware will abort the DMA command in the middle. I wonder if
this could explain fs corruption or not. If yes, this should possibly
fix it.

There's also the possibility that the reboot handler is canceling a
dma in the middle of a SG table processing, but if guest has still DMA
writes in flight to disk when it issues reboots, that sounds a guest
bug. Furthermore during poweroff (kind of reboot) a large dma may not
reach the disk. So I'm more worried about the "software" behavior of
bmdma_cmd_writeb than the reboot handler. I wonder if perhaps killing
task in proprietary guest OS (when guest os tries to reboot) could
lead the task to call aio_cancel during cleanup, that would ask the
ide guest kernel driver to cancel the I/O as a whole (not partially).

In general I think there's a small chance that this really is the
source of the fs corruption, and if it does then the corruption would
also happen after killing kvm/qemu with sigkill (but we don't kill kvm
as often as we reboot the guest in it). So I'm not really sure if this
is needed or not, but it certainly rings a bell the fact this
ide_dma_cancel is definitely issued with aio commands in flight in the
reboot loop that reproduces corruption (triggers always a few seconds
before reboot).

Fix is mostly untested as hitting this path takes time, more a RFC so
far.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

Index: hw/ide.c
===================================================================
--- hw/ide.c	(revision 5089)
+++ hw/ide.c	(working copy)
@@ -2894,8 +2904,21 @@
     printf("%s: 0x%08x\n", __func__, val);
 #endif
     if (!(val & BM_CMD_START)) {
-        /* XXX: do it better */
-        ide_dma_cancel(bm);
+        /*
+	 * We can't cancel Scatter Gather DMA in the middle of the
+	 * operation or a partial (not full) DMA transfer would reach
+	 * the storage so we wait for completion instead (we beahve
+	 * like if the DMA was complated by the time the guest trying
+	 * to cancel dma with bmdma_cmd_writeb with BM_CMD_START not
+	 * set).
+	 *
+	 * In the future we'll be able to safely cancel the I/O if the
+	 * whole DMA operation will be submitted to disk with a single
+	 * aio operation in the form of aio_readv/aio_writev
+	 * (supported by linux kernel AIO but not by glibc pthread aio
+	 * lib).
+	 */
+	qemu_aio_flush();
         bm->cmd = val & 0x09;
     } else {
         if (!(bm->status & BM_STATUS_DMAING)) {