[Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
@ 2008-08-29 13:52 Andrea Arcangeli
  2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli
  2008-09-01 11:21 ` Ian Jackson
  0 siblings, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2008-08-29 13:52 UTC (permalink / raw)
  To: qemu-devel

Hello,

while trying to track down weird fs corruption, I noticed the way the
cmd_writeb ioport write cancels the I/O is not atomic from a DMA point
of view if a SG table with more than one entry is used. The DMA
command with qemu/kvm is aborted partially. Unfortunately I don't know
the IDE specs well enough to know what happens in the hardware, but I
doubt hardware will abort the DMA command in the middle. I wonder if
this could explain fs corruption or not. If yes, this should possibly
fix it.

There's also the possibility that the reboot handler is canceling a
dma in the middle of a SG table processing, but if guest has still DMA
writes in flight to disk when it issues reboots, that sounds a guest
bug. Furthermore during poweroff (kind of reboot) a large dma may not
reach the disk. So I'm more worried about the "software" behavior of
bmdma_cmd_writeb than the reboot handler. I wonder if perhaps killing
task in proprietary guest OS (when guest os tries to reboot) could
lead the task to call aio_cancel during cleanup, that would ask the
ide guest kernel driver to cancel the I/O as a whole (not partially).

In general I think there's a small chance that this really is the
source of the fs corruption, and if it does then the corruption would
also happen after killing kvm/qemu with sigkill (but we don't kill kvm
as often as we reboot the guest in it). So I'm not really sure if this
is needed or not, but it certainly rings a bell the fact this
ide_dma_cancel is definitely issued with aio commands in flight in the
reboot loop that reproduces corruption (triggers always a few seconds
before reboot).

Fix is mostly untested as hitting this path takes time, more a RFC so
far.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

Index: hw/ide.c
===================================================================
--- hw/ide.c	(revision 5089)
+++ hw/ide.c	(working copy)
@@ -2894,8 +2904,21 @@
     printf("%s: 0x%08x\n", __func__, val);
 #endif
     if (!(val & BM_CMD_START)) {
-        /* XXX: do it better */
-        ide_dma_cancel(bm);
+        /*
+	 * We can't cancel Scatter Gather DMA in the middle of the
+	 * operation or a partial (not full) DMA transfer would reach
+	 * the storage so we wait for completion instead (we beahve
+	 * like if the DMA was complated by the time the guest trying
+	 * to cancel dma with bmdma_cmd_writeb with BM_CMD_START not
+	 * set).
+	 *
+	 * In the future we'll be able to safely cancel the I/O if the
+	 * whole DMA operation will be submitted to disk with a single
+	 * aio operation in the form of aio_readv/aio_writev
+	 * (supported by linux kernel AIO but not by glibc pthread aio
+	 * lib).
+	 */
+	qemu_aio_flush();
         bm->cmd = val & 0x09;
     } else {
         if (!(bm->status & BM_STATUS_DMAING)) {

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 1/2] ide_dma_cancel will result in partial DMA transfer
  2008-08-29 13:52 [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
@ 2008-09-01 10:43 ` Andrea Arcangeli
  2008-09-01 10:53   ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli
  2009-01-14 18:06   ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
  2008-09-01 11:21 ` Ian Jackson
  1 sibling, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2008-09-01 10:43 UTC (permalink / raw)
  To: qemu-devel

The reason for not actually canceling the I/O is because with
virtualization and lots of VM running, a guest fs may mistake a
overload of the host, as an IDE timeout. So rather than canceling the
I/O, it's safer to wait I/O completion and simulate that the I/O has
completed just before the io cancellation was requested by the
guest. This way if ntfs or an app writes data without checking for
-EIO retval, and it thinks the write has succeeded, it's less likely
to run into troubles. Similar issues for reads.

Furthermore because the DMA operation is splitted into many
synchronous aio_read/write if there's more than one entry in the SG
table, without this patch the DMA would be cancelled in the middle,
something we've no idea if it happens on real hardware too or
not. Overall this seems a great deal of risk of generating data
corruption, for absolutely _zero_ gain at runtime. So regardless if
this is found to fix or not the fs corruption (so far so good, no
corruption reproduced but it takes several days of heavy workload to
reproduce so nothing sure yet) I think it's good idea to apply.

This approach is sure safer than previous code given we can't pretend
all guest fs code out there to check for errors and reply the DMA if
it was completed partially, given a timeout would never materialize on
a real harddisk unless there are defective blocks (and defective
blocks are practically only an issue for reads never for writes in any
recent hardware as writing to blocks is the way to fix them) or the
harddisk breaks as a whole.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---
This is an update for the dma cancellation fix.

Index: hw/ide.c
===================================================================
--- hw/ide.c	(revision 5119)
+++ hw/ide.c	(working copy)
@@ -2894,8 +2894,24 @@
     printf("%s: 0x%08x\n", __func__, val);
 #endif
     if (!(val & BM_CMD_START)) {
-        /* XXX: do it better */
-        ide_dma_cancel(bm);
+        /*
+	 * If guest tries to cancel the DMA we beahve like if the DMA
+	 * was complated by the time the guest tried to cancel dma
+	 * with bmdma_cmd_writeb with BM_CMD_START not set. This is
+	 * safer than cancelling whatever partial DMA in flight
+	 * because it has a chance to be bug-compatible if a guest fs
+	 * isn't checking for I/O errors triggered by guest I/O
+	 * timeouts when the host is overloaded.
+	 */
+	if (bm->aiocb) {
+		qemu_aio_flush();
+#ifdef DEBUG_IDE
+		if (bm->aiocb)
+			printf("aiocb still pending");
+		if (bm->status & BM_STATUS_DMAING)
+			printf("BM_STATUS_DMAING still pending");
+#endif
+	}
         bm->cmd = val & 0x09;
     } else {
         if (!(bm->status & BM_STATUS_DMAING)) {

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2
  2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli
@ 2008-09-01 10:53   ` Andrea Arcangeli
  2008-10-22 14:14     ` [Qemu-devel] [PATCH] " Andrea Arcangeli
  2009-01-14 18:06   ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2008-09-01 10:53 UTC (permalink / raw)
  To: qemu-devel

While testing the dma cancel patch (1/2) I noticed the qemu_aio_flush
was doing nothing at all. And a flood of cmd_writeb commands leading
to a noop-invocation of qemu_aio_flush were executed.

I tracked it down and the major bug in this area (not sure if it could
be the one responsible of the fs corruption) is that if aio callback
is run before the bdrv_aio_read returns, the bm->aiocb of ide will be
not-null and set to the already completed aiocb, so after that
cmd_writeb will be mistaken for a dma cancellation.

In short all 'memset;goto redo' places must be fixed to use the bh and
not to call the callback in the context of bdrv_aio_read or the
bdrv_aio_read model falls apart. Reading from qcow2 holes is possible
with phyisical readahead (kind of breada in linux buffer cache).

All the implications of this bug aren't clear due the amount of code
affected (qcow2 itself with hd_aiocb in qcow_aio_cancel, scsi
etc..). IDE might have been safe by pure luck because of a DMAING
bitflag check before canceling the I/O, otherwise double free would
happen there too. This makes the 1/2 behave perfectly good (aiocb is
always null after qemu_aio_flush returns).

Same bug exists in qcow of course, can be fixed later as it's less
urgent.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

Index: Makefile.target
===================================================================
--- Makefile.target	(revision 5119)
+++ Makefile.target	(working copy)
@@ -474,9 +474,9 @@
 
 OBJS=vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o net-checksum.o
 ifdef CONFIG_WIN32
-OBJS+=block-raw-win32.o
+OBJS+=block-raw-win32.o block-qcow2.o
 else
-OBJS+=block-raw-posix.o
+OBJS+=block-raw-posix.o block-qcow2.o
 endif
 
 LIBS+=-lz
Index: Makefile
===================================================================
--- Makefile	(revision 5119)
+++ Makefile	(working copy)
@@ -46,7 +46,7 @@
 BLOCK_OBJS=cutils.o qemu-malloc.o
 BLOCK_OBJS+=block-cow.o block-qcow.o aes.o block-vmdk.o block-cloop.o
 BLOCK_OBJS+=block-dmg.o block-bochs.o block-vpc.o block-vvfat.o
-BLOCK_OBJS+=block-qcow2.o block-parallels.o
+BLOCK_OBJS+=block-parallels.o
 ifndef CONFIG_WIN32
 BLOCK_OBJS+=block-nbd.o
 endif
@@ -175,9 +175,9 @@
 
 QEMU_IMG_BLOCK_OBJS = $(BLOCK_OBJS)
 ifdef CONFIG_WIN32
-QEMU_IMG_BLOCK_OBJS += qemu-img-block-raw-win32.o
+QEMU_IMG_BLOCK_OBJS += qemu-img-block-raw-win32.o qemu-img-block-qcow2.o
 else
-QEMU_IMG_BLOCK_OBJS += nbd.o qemu-img-block-raw-posix.o
+QEMU_IMG_BLOCK_OBJS += nbd.o qemu-img-block-raw-posix.o qemu-img-block-qcow2.o
 endif
 
 ######################################################################
@@ -195,7 +195,8 @@
 	$(CC) $(CFLAGS) $(CPPFLAGS) -DQEMU_NBD -c -o $@ $<
 
 qemu-nbd$(EXESUF):  qemu-nbd.o qemu-nbd-nbd.o qemu-img-block.o \
-		    osdep.o qemu-nbd-block-raw-posix.o $(BLOCK_OBJS)
+		    osdep.o qemu-nbd-block-raw-posix.o \
+		    qemu-nbd-block-qcow2.o $(BLOCK_OBJS)
 	$(CC) $(LDFLAGS) -o $@ $^ -lz $(LIBS)
 
 # dyngen host tool
Index: block-qcow2.c
===================================================================
--- block-qcow2.c	(revision 5119)
+++ block-qcow2.c	(working copy)
@@ -1169,8 +1169,20 @@
     uint64_t cluster_offset;
     uint8_t *cluster_data;
     BlockDriverAIOCB *hd_aiocb;
+    QEMUBH *bh;
 } QCowAIOCB;
 
+#if !defined(QEMU_IMG) && !defined(QEMU_NBD)
+static void qcow_aio_read_cb(void *opaque, int ret);
+static void qcow_aio_read_bh(void *opaque)
+{
+    QCowAIOCB *acb = opaque;
+    qemu_bh_delete(acb->bh);
+    acb->bh = NULL;
+    qcow_aio_read_cb(opaque, 0);
+}
+#endif
+
 static void qcow_aio_read_cb(void *opaque, int ret)
 {
     QCowAIOCB *acb = opaque;
@@ -1186,7 +1198,9 @@
         return;
     }
 
+#if defined(QEMU_IMG) || defined(QEMU_NBD)
  redo:
+#endif
     /* post process the read buffer */
     if (!acb->cluster_offset) {
         /* nothing to do */
@@ -1227,12 +1241,38 @@
                 if (acb->hd_aiocb == NULL)
                     goto fail;
             } else {
+#if defined(QEMU_IMG) || defined(QEMU_NBD)
                 goto redo;
+#else
+		if (acb->bh) {
+		    ret = -EIO;
+		    goto fail;
+		}
+		acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
+		if (!acb->bh) {
+		    ret = -EIO;
+		    goto fail;
+		}
+		qemu_bh_schedule(acb->bh);
+#endif
             }
         } else {
             /* Note: in this case, no need to wait */
             memset(acb->buf, 0, 512 * acb->n);
+#if defined(QEMU_IMG) || defined(QEMU_NBD)
             goto redo;
+#else
+	    if (acb->bh) {
+		ret = -EIO;
+		goto fail;
+	    }
+	    acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
+	    if (!acb->bh) {
+		ret = -EIO;
+		goto fail;
+	    }
+	    qemu_bh_schedule(acb->bh);
+#endif
         }
     } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
         /* add AIO support for compressed blocks ? */
@@ -1240,7 +1280,20 @@
             goto fail;
         memcpy(acb->buf,
                s->cluster_cache + index_in_cluster * 512, 512 * acb->n);
+#if defined(QEMU_IMG) || defined(QEMU_NBD)
         goto redo;
+#else
+	if (acb->bh) {
+	    ret = -EIO;
+	    goto fail;
+	}
+	acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
+	if (!acb->bh) {
+	    ret = -EIO;
+	    goto fail;
+	}
+	qemu_bh_schedule(acb->bh);
+#endif
     } else {
         if ((acb->cluster_offset & 511) != 0) {
             ret = -EIO;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH] fix bdrv_aio_read API breakage in qcow2
  2008-09-01 10:53   ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli
@ 2008-10-22 14:14     ` Andrea Arcangeli
  2008-10-27 13:49       ` Anthony Liguori
  2008-10-31 17:32       ` Anthony Liguori
  0 siblings, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2008-10-22 14:14 UTC (permalink / raw)
  To: qemu-devel

From: Andrea Arcangeli <aarcange@redhat.com>

I noticed the qemu_aio_flush was doing nothing at all. And a flood of
cmd_writeb commands leading to a noop-invocation of qemu_aio_flush
were executed.

In short all 'memset;goto redo' places must be fixed to use the bh and
not to call the callback in the context of bdrv_aio_read or the
bdrv_aio_read model falls apart. Reading from qcow2 holes is possible
with phyisical readahead (kind of breada in linux buffer cache).

This is needed at least for scsi, ide is lucky (or it has been
band-aided against this API breakage by fixing the symptom and not the
real bug).

Same bug exists in qcow of course, can be fixed later as it's less
urgent.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
--

Index: block-qcow2.c
===================================================================
--- block-qcow2.c	(revision 5506)
+++ block-qcow2.c	(working copy)
@@ -1165,8 +1165,18 @@
     uint64_t cluster_offset;
     uint8_t *cluster_data;
     BlockDriverAIOCB *hd_aiocb;
+    QEMUBH *bh;
 } QCowAIOCB;
 
+static void qcow_aio_read_cb(void *opaque, int ret);
+static void qcow_aio_read_bh(void *opaque)
+{
+    QCowAIOCB *acb = opaque;
+    qemu_bh_delete(acb->bh);
+    acb->bh = NULL;
+    qcow_aio_read_cb(opaque, 0);
+}
+
 static void qcow_aio_read_cb(void *opaque, int ret)
 {
     QCowAIOCB *acb = opaque;
@@ -1182,7 +1192,6 @@
         return;
     }
 
- redo:
     /* post process the read buffer */
     if (!acb->cluster_offset) {
         /* nothing to do */
@@ -1223,12 +1232,30 @@
                 if (acb->hd_aiocb == NULL)
                     goto fail;
             } else {
-                goto redo;
+		if (acb->bh) {
+		    ret = -EIO;
+		    goto fail;
+		}
+		acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
+		if (!acb->bh) {
+		    ret = -EIO;
+		    goto fail;
+		}
+		qemu_bh_schedule(acb->bh);
             }
         } else {
             /* Note: in this case, no need to wait */
             memset(acb->buf, 0, 512 * acb->n);
-            goto redo;
+	    if (acb->bh) {
+		ret = -EIO;
+		goto fail;
+	    }
+	    acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
+	    if (!acb->bh) {
+		ret = -EIO;
+		goto fail;
+	    }
+	    qemu_bh_schedule(acb->bh);
         }
     } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
         /* add AIO support for compressed blocks ? */
@@ -1236,7 +1263,16 @@
             goto fail;
         memcpy(acb->buf,
                s->cluster_cache + index_in_cluster * 512, 512 * acb->n);
-        goto redo;
+	if (acb->bh) {
+	    ret = -EIO;
+	    goto fail;
+	}
+	acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
+	if (!acb->bh) {
+	    ret = -EIO;
+	    goto fail;
+	}
+	qemu_bh_schedule(acb->bh);
     } else {
         if ((acb->cluster_offset & 511) != 0) {
             ret = -EIO;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] fix bdrv_aio_read API breakage in qcow2
  2008-10-22 14:14     ` [Qemu-devel] [PATCH] " Andrea Arcangeli
@ 2008-10-27 13:49       ` Anthony Liguori
  2008-10-31 17:32       ` Anthony Liguori
  1 sibling, 0 replies; 12+ messages in thread
From: Anthony Liguori @ 2008-10-27 13:49 UTC (permalink / raw)
  To: qemu-devel

Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> I noticed the qemu_aio_flush was doing nothing at all. And a flood of
> cmd_writeb commands leading to a noop-invocation of qemu_aio_flush
> were executed.
>
> In short all 'memset;goto redo' places must be fixed to use the bh and
> not to call the callback in the context of bdrv_aio_read or the
> bdrv_aio_read model falls apart. Reading from qcow2 holes is possible
> with phyisical readahead (kind of breada in linux buffer cache).
>
> This is needed at least for scsi, ide is lucky (or it has been
> band-aided against this API breakage by fixing the symptom and not the
> real bug).
>
> Same bug exists in qcow of course, can be fixed later as it's less
> urgent.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>   

You've got whitespace damage (tabs instead of spaces).

Regards,

Anthony Liguori

> --
>
> Index: block-qcow2.c
> ===================================================================
> --- block-qcow2.c	(revision 5506)
> +++ block-qcow2.c	(working copy)
> @@ -1165,8 +1165,18 @@
>      uint64_t cluster_offset;
>      uint8_t *cluster_data;
>      BlockDriverAIOCB *hd_aiocb;
> +    QEMUBH *bh;
>  } QCowAIOCB;
>  
> +static void qcow_aio_read_cb(void *opaque, int ret);
> +static void qcow_aio_read_bh(void *opaque)
> +{
> +    QCowAIOCB *acb = opaque;
> +    qemu_bh_delete(acb->bh);
> +    acb->bh = NULL;
> +    qcow_aio_read_cb(opaque, 0);
> +}
> +
>  static void qcow_aio_read_cb(void *opaque, int ret)
>  {
>      QCowAIOCB *acb = opaque;
> @@ -1182,7 +1192,6 @@
>          return;
>      }
>  
> - redo:
>      /* post process the read buffer */
>      if (!acb->cluster_offset) {
>          /* nothing to do */
> @@ -1223,12 +1232,30 @@
>                  if (acb->hd_aiocb == NULL)
>                      goto fail;
>              } else {
> -                goto redo;
> +		if (acb->bh) {
> +		    ret = -EIO;
> +		    goto fail;
> +		}
> +		acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
> +		if (!acb->bh) {
> +		    ret = -EIO;
> +		    goto fail;
> +		}
> +		qemu_bh_schedule(acb->bh);
>              }
>          } else {
>              /* Note: in this case, no need to wait */
>              memset(acb->buf, 0, 512 * acb->n);
> -            goto redo;
> +	    if (acb->bh) {
> +		ret = -EIO;
> +		goto fail;
> +	    }
> +	    acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
> +	    if (!acb->bh) {
> +		ret = -EIO;
> +		goto fail;
> +	    }
> +	    qemu_bh_schedule(acb->bh);
>          }
>      } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
>          /* add AIO support for compressed blocks ? */
> @@ -1236,7 +1263,16 @@
>              goto fail;
>          memcpy(acb->buf,
>                 s->cluster_cache + index_in_cluster * 512, 512 * acb->n);
> -        goto redo;
> +	if (acb->bh) {
> +	    ret = -EIO;
> +	    goto fail;
> +	}
> +	acb->bh = qemu_bh_new(qcow_aio_read_bh, acb);
> +	if (!acb->bh) {
> +	    ret = -EIO;
> +	    goto fail;
> +	}
> +	qemu_bh_schedule(acb->bh);
>      } else {
>          if ((acb->cluster_offset & 511) != 0) {
>              ret = -EIO;
>
>
>   

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] fix bdrv_aio_read API breakage in qcow2
  2008-10-22 14:14     ` [Qemu-devel] [PATCH] " Andrea Arcangeli
  2008-10-27 13:49       ` Anthony Liguori
@ 2008-10-31 17:32       ` Anthony Liguori
  1 sibling, 0 replies; 12+ messages in thread
From: Anthony Liguori @ 2008-10-31 17:32 UTC (permalink / raw)
  To: qemu-devel

Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> I noticed the qemu_aio_flush was doing nothing at all. And a flood of
> cmd_writeb commands leading to a noop-invocation of qemu_aio_flush
> were executed.
>
> In short all 'memset;goto redo' places must be fixed to use the bh and
> not to call the callback in the context of bdrv_aio_read or the
> bdrv_aio_read model falls apart. Reading from qcow2 holes is possible
> with phyisical readahead (kind of breada in linux buffer cache).
>
> This is needed at least for scsi, ide is lucky (or it has been
> band-aided against this API breakage by fixing the symptom and not the
> real bug).
>
> Same bug exists in qcow of course, can be fixed later as it's less
> urgent.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> --
>   
Applied.  Thanks.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
  2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli
  2008-09-01 10:53   ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli
@ 2009-01-14 18:06   ` Andrea Arcangeli
  2009-01-16 16:41     ` Ian Jackson
  2009-01-22 19:02     ` Anthony Liguori
  1 sibling, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2009-01-14 18:06 UTC (permalink / raw)
  To: qemu-devel

From: Andrea Arcangeli <aarcange@redhat.com>

The reason for not actually canceling the I/O is because with
virtualization and lots of VM running, a guest fs may mistake a
overload of the host, as an IDE timeout. So rather than canceling the
I/O, it's safer to wait I/O completion and simulate that the I/O has
completed just before the io cancellation was requested by the
guest. This way if ntfs or an app writes data without checking for
-EIO retval, and it thinks the write has succeeded, it's less likely
to run into troubles. Similar issues for reads.

Furthermore because the DMA operation is splitted into many synchronous
aio_read/write if there's more than one entry in the SG table, without this
patch the DMA would be cancelled in the middle, something we've no idea if it
happens on real hardware too or not. Overall this seems a great risk for zero
gain.

This approach is sure safer than previous code given we can't pretend all guest
fs code out there to check for errors and reply the DMA if it was completed
partially, given a timeout would never materialize on a real harddisk unless
there are defective blocks (and defective blocks are practically only an issue
for reads never for writes in any recent hardware as writing to blocks is the
way to fix them) or the harddisk breaks as a whole.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

This is a resubmit of an old patch in my queue. Wonder if it'll ever
be merged. I think it's obviously safer (especially once we've
preadv/pwritev driven I/O) even if a noop.

Index: hw/ide.c
===================================================================
--- hw/ide.c	(revision 6296)
+++ hw/ide.c	(working copy)
@@ -2878,8 +2878,28 @@
     printf("%s: 0x%08x\n", __func__, val);
 #endif
     if (!(val & BM_CMD_START)) {
-        /* XXX: do it better */
-        ide_dma_cancel(bm);
+        /*
+	 * We can't cancel Scatter Gather DMA in the middle of the
+	 * operation or a partial (not full) DMA transfer would reach
+	 * the storage so we wait for completion instead (we beahve
+	 * like if the DMA was complated by the time the guest trying
+	 * to cancel dma with bmdma_cmd_writeb with BM_CMD_START not
+	 * set).
+	 *
+	 * In the future we'll be able to safely cancel the I/O if the
+	 * whole DMA operation will be submitted to disk with a single
+	 * aio operation in the form of aio_readv/aio_writev
+	 * (supported by linux kernel AIO but not by glibc pthread aio
+	 * lib).
+	 */
+	if (bm->aiocb) {
+		QEMU_WARN("qemu_aio_flush called");
+		qemu_aio_flush();
+		if (bm->aiocb)
+			QEMU_WARN("aiocb still pending");
+		if (bm->status & BM_STATUS_DMAING)
+			QEMU_WARN("BM_STATUS_DMAING still pending");
+	}
         bm->cmd = val & 0x09;
     } else {
         if (!(bm->status & BM_STATUS_DMAING)) {

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
  2009-01-14 18:06   ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
@ 2009-01-16 16:41     ` Ian Jackson
  2009-01-22 19:02     ` Anthony Liguori
  1 sibling, 0 replies; 12+ messages in thread
From: Ian Jackson @ 2009-01-16 16:41 UTC (permalink / raw)
  To: qemu-devel

Andrea Arcangeli writes ("[Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer"):
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> The reason for not actually canceling the I/O is because with
> virtualization and lots of VM running, a guest fs may mistake a
> overload of the host, as an IDE timeout. So rather than canceling the
> I/O, it's safer to wait I/O completion and simulate that the I/O has
> completed just before the io cancellation was requested by the
> guest. [...]

I haven't tested this patch, but the reasoning seems sound.  The only
downside would seem to be that this may make the cancellation IO port
write take a rather long time while we wait for a flush, so a guest
that did it a lot might have performance problems.

Acked-By: Ian Jackson <ian.jackson@eu.citrix.com>

Ian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
  2009-01-14 18:06   ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
  2009-01-16 16:41     ` Ian Jackson
@ 2009-01-22 19:02     ` Anthony Liguori
  2009-02-26 16:43       ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Anthony Liguori @ 2009-01-22 19:02 UTC (permalink / raw)
  To: qemu-devel

Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
>
> The reason for not actually canceling the I/O is because with
> virtualization and lots of VM running, a guest fs may mistake a
> overload of the host, as an IDE timeout. So rather than canceling the
> I/O, it's safer to wait I/O completion and simulate that the I/O has
> completed just before the io cancellation was requested by the
> guest. This way if ntfs or an app writes data without checking for
> -EIO retval, and it thinks the write has succeeded, it's less likely
> to run into troubles. Similar issues for reads.
>
> Furthermore because the DMA operation is splitted into many synchronous
> aio_read/write if there's more than one entry in the SG table, without this
> patch the DMA would be cancelled in the middle, something we've no idea if it
> happens on real hardware too or not. Overall this seems a great risk for zero
> gain.
>
> This approach is sure safer than previous code given we can't pretend all guest
> fs code out there to check for errors and reply the DMA if it was completed
> partially, given a timeout would never materialize on a real harddisk unless
> there are defective blocks (and defective blocks are practically only an issue
> for reads never for writes in any recent hardware as writing to blocks is the
> way to fix them) or the harddisk breaks as a whole.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>   

I agree with the patch in principle but it doesn't build:

/home/anthony/git/qemu/hw/ide.c: In function ‘bmdma_cmd_writeb’:
/home/anthony/git/qemu/hw/ide.c:3053: error: implicit declaration of 
function ‘QEMU_WARN’

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
  2009-01-22 19:02     ` Anthony Liguori
@ 2009-02-26 16:43       ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2009-02-26 16:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori

On Thu, Jan 22, 2009 at 01:02:27PM -0600, Anthony Liguori wrote:
> /home/anthony/git/qemu/hw/ide.c: In function ‘bmdma_cmd_writeb’:
> /home/anthony/git/qemu/hw/ide.c:3053: error: implicit declaration of 
> function ‘QEMU_WARN’

ahhh seen your replies to #2 only now... I'd expected to get myself
CC'ed to replies sorry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
  2008-08-29 13:52 [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
  2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli
@ 2008-09-01 11:21 ` Ian Jackson
  2008-09-01 12:13   ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Ian Jackson @ 2008-09-01 11:21 UTC (permalink / raw)
  To: qemu-devel

Andrea Arcangeli writes ("[Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer"):
> while trying to track down weird fs corruption, I noticed the way the
> cmd_writeb ioport write cancels the I/O is not atomic from a DMA point
> of view if a SG table with more than one entry is used. The DMA
> command with qemu/kvm is aborted partially. Unfortunately I don't know
> the IDE specs well enough to know what happens in the hardware, but I
> doubt hardware will abort the DMA command in the middle. I wonder if
> this could explain fs corruption or not. If yes, this should possibly
> fix it.

We're talking about scatter/gather to host memory during DMA rather
than SG to different disk sectors (which is done via queued commands),
right ?

In which case we're talking about one of the READ or WRITE DMA family
being aborted.  That's an unrecoverable error.  According to the
specification I have here, various registers are supposed to indicate
the `address of the sector where the first unrecoverable error
occurred' but also says that `the amount of data transferred is
indeterminate'.

So I think the host can't safely assume anything about which parts of
the transfer have completed.  `First' must surely mean temporally
first, rather than the sector with the lowest-address, and would seem
to be intended for diagnostic logging rather than error recovery.

Ian.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer
  2008-09-01 11:21 ` Ian Jackson
@ 2008-09-01 12:13   ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2008-09-01 12:13 UTC (permalink / raw)
  To: qemu-devel

On Mon, Sep 01, 2008 at 12:21:54PM +0100, Ian Jackson wrote:
> So I think the host can't safely assume anything about which parts of
> the transfer have completed.  `First' must surely mean temporally

The point is that the guest will never issue those dma cancel ops in
real hardware. So if it doesn't notice its own block-write operation
has failed, it will go ahead without retrying the write, and fs
corruption will be generated silently without any apparent error
message or fs shutdown. With qemu the aio thread can stall
indefinitely if there's heavy disk (or cpu) load, such an
IRQ-completion delay could happen on real hardware.

I can't see why anybody should take any risk by aborting such a
timed-out I/O it in the middle, given we perfectly know there are
never timeouts triggering on real hardware and not all filesystem
software is perfect and checks for I/O errors in every single path
(this includes userland that may not notice a -EIO out of a write()
syscall).

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-02-26 19:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-29 13:52 [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli
2008-09-01 10:53   ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli
2008-10-22 14:14     ` [Qemu-devel] [PATCH] " Andrea Arcangeli
2008-10-27 13:49       ` Anthony Liguori
2008-10-31 17:32       ` Anthony Liguori
2009-01-14 18:06   ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli
2009-01-16 16:41     ` Ian Jackson
2009-01-22 19:02     ` Anthony Liguori
2009-02-26 16:43       ` Andrea Arcangeli
2008-09-01 11:21 ` Ian Jackson
2008-09-01 12:13   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).