* [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer @ 2008-08-29 13:52 Andrea Arcangeli 2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli 2008-09-01 11:21 ` Ian Jackson 0 siblings, 2 replies; 12+ messages in thread From: Andrea Arcangeli @ 2008-08-29 13:52 UTC (permalink / raw) To: qemu-devel Hello, while trying to track down weird fs corruption, I noticed the way the cmd_writeb ioport write cancels the I/O is not atomic from a DMA point of view if a SG table with more than one entry is used. The DMA command with qemu/kvm is aborted partially. Unfortunately I don't know the IDE specs well enough to know what happens in the hardware, but I doubt hardware will abort the DMA command in the middle. I wonder if this could explain fs corruption or not. If yes, this should possibly fix it. There's also the possibility that the reboot handler is canceling a dma in the middle of a SG table processing, but if guest has still DMA writes in flight to disk when it issues reboots, that sounds a guest bug. Furthermore during poweroff (kind of reboot) a large dma may not reach the disk. So I'm more worried about the "software" behavior of bmdma_cmd_writeb than the reboot handler. I wonder if perhaps killing task in proprietary guest OS (when guest os tries to reboot) could lead the task to call aio_cancel during cleanup, that would ask the ide guest kernel driver to cancel the I/O as a whole (not partially). In general I think there's a small chance that this really is the source of the fs corruption, and if it does then the corruption would also happen after killing kvm/qemu with sigkill (but we don't kill kvm as often as we reboot the guest in it). So I'm not really sure if this is needed or not, but it certainly rings a bell the fact this ide_dma_cancel is definitely issued with aio commands in flight in the reboot loop that reproduces corruption (triggers always a few seconds before reboot). Fix is mostly untested as hitting this path takes time, more a RFC so far. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Index: hw/ide.c =================================================================== --- hw/ide.c (revision 5089) +++ hw/ide.c (working copy) @@ -2894,8 +2904,21 @@ printf("%s: 0x%08x\n", __func__, val); #endif if (!(val & BM_CMD_START)) { - /* XXX: do it better */ - ide_dma_cancel(bm); + /* + * We can't cancel Scatter Gather DMA in the middle of the + * operation or a partial (not full) DMA transfer would reach + * the storage so we wait for completion instead (we beahve + * like if the DMA was complated by the time the guest trying + * to cancel dma with bmdma_cmd_writeb with BM_CMD_START not + * set). + * + * In the future we'll be able to safely cancel the I/O if the + * whole DMA operation will be submitted to disk with a single + * aio operation in the form of aio_readv/aio_writev + * (supported by linux kernel AIO but not by glibc pthread aio + * lib). + */ + qemu_aio_flush(); bm->cmd = val & 0x09; } else { if (!(bm->status & BM_STATUS_DMAING)) { ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 1/2] ide_dma_cancel will result in partial DMA transfer 2008-08-29 13:52 [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli @ 2008-09-01 10:43 ` Andrea Arcangeli 2008-09-01 10:53 ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli 2009-01-14 18:06 ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli 2008-09-01 11:21 ` Ian Jackson 1 sibling, 2 replies; 12+ messages in thread From: Andrea Arcangeli @ 2008-09-01 10:43 UTC (permalink / raw) To: qemu-devel The reason for not actually canceling the I/O is because with virtualization and lots of VM running, a guest fs may mistake a overload of the host, as an IDE timeout. So rather than canceling the I/O, it's safer to wait I/O completion and simulate that the I/O has completed just before the io cancellation was requested by the guest. This way if ntfs or an app writes data without checking for -EIO retval, and it thinks the write has succeeded, it's less likely to run into troubles. Similar issues for reads. Furthermore because the DMA operation is splitted into many synchronous aio_read/write if there's more than one entry in the SG table, without this patch the DMA would be cancelled in the middle, something we've no idea if it happens on real hardware too or not. Overall this seems a great deal of risk of generating data corruption, for absolutely _zero_ gain at runtime. So regardless if this is found to fix or not the fs corruption (so far so good, no corruption reproduced but it takes several days of heavy workload to reproduce so nothing sure yet) I think it's good idea to apply. This approach is sure safer than previous code given we can't pretend all guest fs code out there to check for errors and reply the DMA if it was completed partially, given a timeout would never materialize on a real harddisk unless there are defective blocks (and defective blocks are practically only an issue for reads never for writes in any recent hardware as writing to blocks is the way to fix them) or the harddisk breaks as a whole. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> --- This is an update for the dma cancellation fix. Index: hw/ide.c =================================================================== --- hw/ide.c (revision 5119) +++ hw/ide.c (working copy) @@ -2894,8 +2894,24 @@ printf("%s: 0x%08x\n", __func__, val); #endif if (!(val & BM_CMD_START)) { - /* XXX: do it better */ - ide_dma_cancel(bm); + /* + * If guest tries to cancel the DMA we beahve like if the DMA + * was complated by the time the guest tried to cancel dma + * with bmdma_cmd_writeb with BM_CMD_START not set. This is + * safer than cancelling whatever partial DMA in flight + * because it has a chance to be bug-compatible if a guest fs + * isn't checking for I/O errors triggered by guest I/O + * timeouts when the host is overloaded. + */ + if (bm->aiocb) { + qemu_aio_flush(); +#ifdef DEBUG_IDE + if (bm->aiocb) + printf("aiocb still pending"); + if (bm->status & BM_STATUS_DMAING) + printf("BM_STATUS_DMAING still pending"); +#endif + } bm->cmd = val & 0x09; } else { if (!(bm->status & BM_STATUS_DMAING)) { ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli @ 2008-09-01 10:53 ` Andrea Arcangeli 2008-10-22 14:14 ` [Qemu-devel] [PATCH] " Andrea Arcangeli 2009-01-14 18:06 ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Andrea Arcangeli @ 2008-09-01 10:53 UTC (permalink / raw) To: qemu-devel While testing the dma cancel patch (1/2) I noticed the qemu_aio_flush was doing nothing at all. And a flood of cmd_writeb commands leading to a noop-invocation of qemu_aio_flush were executed. I tracked it down and the major bug in this area (not sure if it could be the one responsible of the fs corruption) is that if aio callback is run before the bdrv_aio_read returns, the bm->aiocb of ide will be not-null and set to the already completed aiocb, so after that cmd_writeb will be mistaken for a dma cancellation. In short all 'memset;goto redo' places must be fixed to use the bh and not to call the callback in the context of bdrv_aio_read or the bdrv_aio_read model falls apart. Reading from qcow2 holes is possible with phyisical readahead (kind of breada in linux buffer cache). All the implications of this bug aren't clear due the amount of code affected (qcow2 itself with hd_aiocb in qcow_aio_cancel, scsi etc..). IDE might have been safe by pure luck because of a DMAING bitflag check before canceling the I/O, otherwise double free would happen there too. This makes the 1/2 behave perfectly good (aiocb is always null after qemu_aio_flush returns). Same bug exists in qcow of course, can be fixed later as it's less urgent. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Index: Makefile.target =================================================================== --- Makefile.target (revision 5119) +++ Makefile.target (working copy) @@ -474,9 +474,9 @@ OBJS=vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o net-checksum.o ifdef CONFIG_WIN32 -OBJS+=block-raw-win32.o +OBJS+=block-raw-win32.o block-qcow2.o else -OBJS+=block-raw-posix.o +OBJS+=block-raw-posix.o block-qcow2.o endif LIBS+=-lz Index: Makefile =================================================================== --- Makefile (revision 5119) +++ Makefile (working copy) @@ -46,7 +46,7 @@ BLOCK_OBJS=cutils.o qemu-malloc.o BLOCK_OBJS+=block-cow.o block-qcow.o aes.o block-vmdk.o block-cloop.o BLOCK_OBJS+=block-dmg.o block-bochs.o block-vpc.o block-vvfat.o -BLOCK_OBJS+=block-qcow2.o block-parallels.o +BLOCK_OBJS+=block-parallels.o ifndef CONFIG_WIN32 BLOCK_OBJS+=block-nbd.o endif @@ -175,9 +175,9 @@ QEMU_IMG_BLOCK_OBJS = $(BLOCK_OBJS) ifdef CONFIG_WIN32 -QEMU_IMG_BLOCK_OBJS += qemu-img-block-raw-win32.o +QEMU_IMG_BLOCK_OBJS += qemu-img-block-raw-win32.o qemu-img-block-qcow2.o else -QEMU_IMG_BLOCK_OBJS += nbd.o qemu-img-block-raw-posix.o +QEMU_IMG_BLOCK_OBJS += nbd.o qemu-img-block-raw-posix.o qemu-img-block-qcow2.o endif ###################################################################### @@ -195,7 +195,8 @@ $(CC) $(CFLAGS) $(CPPFLAGS) -DQEMU_NBD -c -o $@ $< qemu-nbd$(EXESUF): qemu-nbd.o qemu-nbd-nbd.o qemu-img-block.o \ - osdep.o qemu-nbd-block-raw-posix.o $(BLOCK_OBJS) + osdep.o qemu-nbd-block-raw-posix.o \ + qemu-nbd-block-qcow2.o $(BLOCK_OBJS) $(CC) $(LDFLAGS) -o $@ $^ -lz $(LIBS) # dyngen host tool Index: block-qcow2.c =================================================================== --- block-qcow2.c (revision 5119) +++ block-qcow2.c (working copy) @@ -1169,8 +1169,20 @@ uint64_t cluster_offset; uint8_t *cluster_data; BlockDriverAIOCB *hd_aiocb; + QEMUBH *bh; } QCowAIOCB; +#if !defined(QEMU_IMG) && !defined(QEMU_NBD) +static void qcow_aio_read_cb(void *opaque, int ret); +static void qcow_aio_read_bh(void *opaque) +{ + QCowAIOCB *acb = opaque; + qemu_bh_delete(acb->bh); + acb->bh = NULL; + qcow_aio_read_cb(opaque, 0); +} +#endif + static void qcow_aio_read_cb(void *opaque, int ret) { QCowAIOCB *acb = opaque; @@ -1186,7 +1198,9 @@ return; } +#if defined(QEMU_IMG) || defined(QEMU_NBD) redo: +#endif /* post process the read buffer */ if (!acb->cluster_offset) { /* nothing to do */ @@ -1227,12 +1241,38 @@ if (acb->hd_aiocb == NULL) goto fail; } else { +#if defined(QEMU_IMG) || defined(QEMU_NBD) goto redo; +#else + if (acb->bh) { + ret = -EIO; + goto fail; + } + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); + if (!acb->bh) { + ret = -EIO; + goto fail; + } + qemu_bh_schedule(acb->bh); +#endif } } else { /* Note: in this case, no need to wait */ memset(acb->buf, 0, 512 * acb->n); +#if defined(QEMU_IMG) || defined(QEMU_NBD) goto redo; +#else + if (acb->bh) { + ret = -EIO; + goto fail; + } + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); + if (!acb->bh) { + ret = -EIO; + goto fail; + } + qemu_bh_schedule(acb->bh); +#endif } } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) { /* add AIO support for compressed blocks ? */ @@ -1240,7 +1280,20 @@ goto fail; memcpy(acb->buf, s->cluster_cache + index_in_cluster * 512, 512 * acb->n); +#if defined(QEMU_IMG) || defined(QEMU_NBD) goto redo; +#else + if (acb->bh) { + ret = -EIO; + goto fail; + } + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); + if (!acb->bh) { + ret = -EIO; + goto fail; + } + qemu_bh_schedule(acb->bh); +#endif } else { if ((acb->cluster_offset & 511) != 0) { ret = -EIO; ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH] fix bdrv_aio_read API breakage in qcow2 2008-09-01 10:53 ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli @ 2008-10-22 14:14 ` Andrea Arcangeli 2008-10-27 13:49 ` Anthony Liguori 2008-10-31 17:32 ` Anthony Liguori 0 siblings, 2 replies; 12+ messages in thread From: Andrea Arcangeli @ 2008-10-22 14:14 UTC (permalink / raw) To: qemu-devel From: Andrea Arcangeli <aarcange@redhat.com> I noticed the qemu_aio_flush was doing nothing at all. And a flood of cmd_writeb commands leading to a noop-invocation of qemu_aio_flush were executed. In short all 'memset;goto redo' places must be fixed to use the bh and not to call the callback in the context of bdrv_aio_read or the bdrv_aio_read model falls apart. Reading from qcow2 holes is possible with phyisical readahead (kind of breada in linux buffer cache). This is needed at least for scsi, ide is lucky (or it has been band-aided against this API breakage by fixing the symptom and not the real bug). Same bug exists in qcow of course, can be fixed later as it's less urgent. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> -- Index: block-qcow2.c =================================================================== --- block-qcow2.c (revision 5506) +++ block-qcow2.c (working copy) @@ -1165,8 +1165,18 @@ uint64_t cluster_offset; uint8_t *cluster_data; BlockDriverAIOCB *hd_aiocb; + QEMUBH *bh; } QCowAIOCB; +static void qcow_aio_read_cb(void *opaque, int ret); +static void qcow_aio_read_bh(void *opaque) +{ + QCowAIOCB *acb = opaque; + qemu_bh_delete(acb->bh); + acb->bh = NULL; + qcow_aio_read_cb(opaque, 0); +} + static void qcow_aio_read_cb(void *opaque, int ret) { QCowAIOCB *acb = opaque; @@ -1182,7 +1192,6 @@ return; } - redo: /* post process the read buffer */ if (!acb->cluster_offset) { /* nothing to do */ @@ -1223,12 +1232,30 @@ if (acb->hd_aiocb == NULL) goto fail; } else { - goto redo; + if (acb->bh) { + ret = -EIO; + goto fail; + } + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); + if (!acb->bh) { + ret = -EIO; + goto fail; + } + qemu_bh_schedule(acb->bh); } } else { /* Note: in this case, no need to wait */ memset(acb->buf, 0, 512 * acb->n); - goto redo; + if (acb->bh) { + ret = -EIO; + goto fail; + } + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); + if (!acb->bh) { + ret = -EIO; + goto fail; + } + qemu_bh_schedule(acb->bh); } } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) { /* add AIO support for compressed blocks ? */ @@ -1236,7 +1263,16 @@ goto fail; memcpy(acb->buf, s->cluster_cache + index_in_cluster * 512, 512 * acb->n); - goto redo; + if (acb->bh) { + ret = -EIO; + goto fail; + } + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); + if (!acb->bh) { + ret = -EIO; + goto fail; + } + qemu_bh_schedule(acb->bh); } else { if ((acb->cluster_offset & 511) != 0) { ret = -EIO; ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] fix bdrv_aio_read API breakage in qcow2 2008-10-22 14:14 ` [Qemu-devel] [PATCH] " Andrea Arcangeli @ 2008-10-27 13:49 ` Anthony Liguori 2008-10-31 17:32 ` Anthony Liguori 1 sibling, 0 replies; 12+ messages in thread From: Anthony Liguori @ 2008-10-27 13:49 UTC (permalink / raw) To: qemu-devel Andrea Arcangeli wrote: > From: Andrea Arcangeli <aarcange@redhat.com> > > I noticed the qemu_aio_flush was doing nothing at all. And a flood of > cmd_writeb commands leading to a noop-invocation of qemu_aio_flush > were executed. > > In short all 'memset;goto redo' places must be fixed to use the bh and > not to call the callback in the context of bdrv_aio_read or the > bdrv_aio_read model falls apart. Reading from qcow2 holes is possible > with phyisical readahead (kind of breada in linux buffer cache). > > This is needed at least for scsi, ide is lucky (or it has been > band-aided against this API breakage by fixing the symptom and not the > real bug). > > Same bug exists in qcow of course, can be fixed later as it's less > urgent. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > You've got whitespace damage (tabs instead of spaces). Regards, Anthony Liguori > -- > > Index: block-qcow2.c > =================================================================== > --- block-qcow2.c (revision 5506) > +++ block-qcow2.c (working copy) > @@ -1165,8 +1165,18 @@ > uint64_t cluster_offset; > uint8_t *cluster_data; > BlockDriverAIOCB *hd_aiocb; > + QEMUBH *bh; > } QCowAIOCB; > > +static void qcow_aio_read_cb(void *opaque, int ret); > +static void qcow_aio_read_bh(void *opaque) > +{ > + QCowAIOCB *acb = opaque; > + qemu_bh_delete(acb->bh); > + acb->bh = NULL; > + qcow_aio_read_cb(opaque, 0); > +} > + > static void qcow_aio_read_cb(void *opaque, int ret) > { > QCowAIOCB *acb = opaque; > @@ -1182,7 +1192,6 @@ > return; > } > > - redo: > /* post process the read buffer */ > if (!acb->cluster_offset) { > /* nothing to do */ > @@ -1223,12 +1232,30 @@ > if (acb->hd_aiocb == NULL) > goto fail; > } else { > - goto redo; > + if (acb->bh) { > + ret = -EIO; > + goto fail; > + } > + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); > + if (!acb->bh) { > + ret = -EIO; > + goto fail; > + } > + qemu_bh_schedule(acb->bh); > } > } else { > /* Note: in this case, no need to wait */ > memset(acb->buf, 0, 512 * acb->n); > - goto redo; > + if (acb->bh) { > + ret = -EIO; > + goto fail; > + } > + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); > + if (!acb->bh) { > + ret = -EIO; > + goto fail; > + } > + qemu_bh_schedule(acb->bh); > } > } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) { > /* add AIO support for compressed blocks ? */ > @@ -1236,7 +1263,16 @@ > goto fail; > memcpy(acb->buf, > s->cluster_cache + index_in_cluster * 512, 512 * acb->n); > - goto redo; > + if (acb->bh) { > + ret = -EIO; > + goto fail; > + } > + acb->bh = qemu_bh_new(qcow_aio_read_bh, acb); > + if (!acb->bh) { > + ret = -EIO; > + goto fail; > + } > + qemu_bh_schedule(acb->bh); > } else { > if ((acb->cluster_offset & 511) != 0) { > ret = -EIO; > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] fix bdrv_aio_read API breakage in qcow2 2008-10-22 14:14 ` [Qemu-devel] [PATCH] " Andrea Arcangeli 2008-10-27 13:49 ` Anthony Liguori @ 2008-10-31 17:32 ` Anthony Liguori 1 sibling, 0 replies; 12+ messages in thread From: Anthony Liguori @ 2008-10-31 17:32 UTC (permalink / raw) To: qemu-devel Andrea Arcangeli wrote: > From: Andrea Arcangeli <aarcange@redhat.com> > > I noticed the qemu_aio_flush was doing nothing at all. And a flood of > cmd_writeb commands leading to a noop-invocation of qemu_aio_flush > were executed. > > In short all 'memset;goto redo' places must be fixed to use the bh and > not to call the callback in the context of bdrv_aio_read or the > bdrv_aio_read model falls apart. Reading from qcow2 holes is possible > with phyisical readahead (kind of breada in linux buffer cache). > > This is needed at least for scsi, ide is lucky (or it has been > band-aided against this API breakage by fixing the symptom and not the > real bug). > > Same bug exists in qcow of course, can be fixed later as it's less > urgent. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > -- > Applied. Thanks. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer 2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli 2008-09-01 10:53 ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli @ 2009-01-14 18:06 ` Andrea Arcangeli 2009-01-16 16:41 ` Ian Jackson 2009-01-22 19:02 ` Anthony Liguori 1 sibling, 2 replies; 12+ messages in thread From: Andrea Arcangeli @ 2009-01-14 18:06 UTC (permalink / raw) To: qemu-devel From: Andrea Arcangeli <aarcange@redhat.com> The reason for not actually canceling the I/O is because with virtualization and lots of VM running, a guest fs may mistake a overload of the host, as an IDE timeout. So rather than canceling the I/O, it's safer to wait I/O completion and simulate that the I/O has completed just before the io cancellation was requested by the guest. This way if ntfs or an app writes data without checking for -EIO retval, and it thinks the write has succeeded, it's less likely to run into troubles. Similar issues for reads. Furthermore because the DMA operation is splitted into many synchronous aio_read/write if there's more than one entry in the SG table, without this patch the DMA would be cancelled in the middle, something we've no idea if it happens on real hardware too or not. Overall this seems a great risk for zero gain. This approach is sure safer than previous code given we can't pretend all guest fs code out there to check for errors and reply the DMA if it was completed partially, given a timeout would never materialize on a real harddisk unless there are defective blocks (and defective blocks are practically only an issue for reads never for writes in any recent hardware as writing to blocks is the way to fix them) or the harddisk breaks as a whole. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- This is a resubmit of an old patch in my queue. Wonder if it'll ever be merged. I think it's obviously safer (especially once we've preadv/pwritev driven I/O) even if a noop. Index: hw/ide.c =================================================================== --- hw/ide.c (revision 6296) +++ hw/ide.c (working copy) @@ -2878,8 +2878,28 @@ printf("%s: 0x%08x\n", __func__, val); #endif if (!(val & BM_CMD_START)) { - /* XXX: do it better */ - ide_dma_cancel(bm); + /* + * We can't cancel Scatter Gather DMA in the middle of the + * operation or a partial (not full) DMA transfer would reach + * the storage so we wait for completion instead (we beahve + * like if the DMA was complated by the time the guest trying + * to cancel dma with bmdma_cmd_writeb with BM_CMD_START not + * set). + * + * In the future we'll be able to safely cancel the I/O if the + * whole DMA operation will be submitted to disk with a single + * aio operation in the form of aio_readv/aio_writev + * (supported by linux kernel AIO but not by glibc pthread aio + * lib). + */ + if (bm->aiocb) { + QEMU_WARN("qemu_aio_flush called"); + qemu_aio_flush(); + if (bm->aiocb) + QEMU_WARN("aiocb still pending"); + if (bm->status & BM_STATUS_DMAING) + QEMU_WARN("BM_STATUS_DMAING still pending"); + } bm->cmd = val & 0x09; } else { if (!(bm->status & BM_STATUS_DMAING)) { ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer 2009-01-14 18:06 ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli @ 2009-01-16 16:41 ` Ian Jackson 2009-01-22 19:02 ` Anthony Liguori 1 sibling, 0 replies; 12+ messages in thread From: Ian Jackson @ 2009-01-16 16:41 UTC (permalink / raw) To: qemu-devel Andrea Arcangeli writes ("[Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer"): > From: Andrea Arcangeli <aarcange@redhat.com> > > The reason for not actually canceling the I/O is because with > virtualization and lots of VM running, a guest fs may mistake a > overload of the host, as an IDE timeout. So rather than canceling the > I/O, it's safer to wait I/O completion and simulate that the I/O has > completed just before the io cancellation was requested by the > guest. [...] I haven't tested this patch, but the reasoning seems sound. The only downside would seem to be that this may make the cancellation IO port write take a rather long time while we wait for a flush, so a guest that did it a lot might have performance problems. Acked-By: Ian Jackson <ian.jackson@eu.citrix.com> Ian. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer 2009-01-14 18:06 ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli 2009-01-16 16:41 ` Ian Jackson @ 2009-01-22 19:02 ` Anthony Liguori 2009-02-26 16:43 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Anthony Liguori @ 2009-01-22 19:02 UTC (permalink / raw) To: qemu-devel Andrea Arcangeli wrote: > From: Andrea Arcangeli <aarcange@redhat.com> > > The reason for not actually canceling the I/O is because with > virtualization and lots of VM running, a guest fs may mistake a > overload of the host, as an IDE timeout. So rather than canceling the > I/O, it's safer to wait I/O completion and simulate that the I/O has > completed just before the io cancellation was requested by the > guest. This way if ntfs or an app writes data without checking for > -EIO retval, and it thinks the write has succeeded, it's less likely > to run into troubles. Similar issues for reads. > > Furthermore because the DMA operation is splitted into many synchronous > aio_read/write if there's more than one entry in the SG table, without this > patch the DMA would be cancelled in the middle, something we've no idea if it > happens on real hardware too or not. Overall this seems a great risk for zero > gain. > > This approach is sure safer than previous code given we can't pretend all guest > fs code out there to check for errors and reply the DMA if it was completed > partially, given a timeout would never materialize on a real harddisk unless > there are defective blocks (and defective blocks are practically only an issue > for reads never for writes in any recent hardware as writing to blocks is the > way to fix them) or the harddisk breaks as a whole. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > I agree with the patch in principle but it doesn't build: /home/anthony/git/qemu/hw/ide.c: In function ‘bmdma_cmd_writeb’: /home/anthony/git/qemu/hw/ide.c:3053: error: implicit declaration of function ‘QEMU_WARN’ Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer 2009-01-22 19:02 ` Anthony Liguori @ 2009-02-26 16:43 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2009-02-26 16:43 UTC (permalink / raw) To: qemu-devel; +Cc: Anthony Liguori On Thu, Jan 22, 2009 at 01:02:27PM -0600, Anthony Liguori wrote: > /home/anthony/git/qemu/hw/ide.c: In function ‘bmdma_cmd_writeb’: > /home/anthony/git/qemu/hw/ide.c:3053: error: implicit declaration of > function ‘QEMU_WARN’ ahhh seen your replies to #2 only now... I'd expected to get myself CC'ed to replies sorry. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer 2008-08-29 13:52 [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli 2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli @ 2008-09-01 11:21 ` Ian Jackson 2008-09-01 12:13 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Ian Jackson @ 2008-09-01 11:21 UTC (permalink / raw) To: qemu-devel Andrea Arcangeli writes ("[Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer"): > while trying to track down weird fs corruption, I noticed the way the > cmd_writeb ioport write cancels the I/O is not atomic from a DMA point > of view if a SG table with more than one entry is used. The DMA > command with qemu/kvm is aborted partially. Unfortunately I don't know > the IDE specs well enough to know what happens in the hardware, but I > doubt hardware will abort the DMA command in the middle. I wonder if > this could explain fs corruption or not. If yes, this should possibly > fix it. We're talking about scatter/gather to host memory during DMA rather than SG to different disk sectors (which is done via queued commands), right ? In which case we're talking about one of the READ or WRITE DMA family being aborted. That's an unrecoverable error. According to the specification I have here, various registers are supposed to indicate the `address of the sector where the first unrecoverable error occurred' but also says that `the amount of data transferred is indeterminate'. So I think the host can't safely assume anything about which parts of the transfer have completed. `First' must surely mean temporally first, rather than the sector with the lowest-address, and would seem to be intended for diagnostic logging rather than error recovery. Ian. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer 2008-09-01 11:21 ` Ian Jackson @ 2008-09-01 12:13 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2008-09-01 12:13 UTC (permalink / raw) To: qemu-devel On Mon, Sep 01, 2008 at 12:21:54PM +0100, Ian Jackson wrote: > So I think the host can't safely assume anything about which parts of > the transfer have completed. `First' must surely mean temporally The point is that the guest will never issue those dma cancel ops in real hardware. So if it doesn't notice its own block-write operation has failed, it will go ahead without retrying the write, and fs corruption will be generated silently without any apparent error message or fs shutdown. With qemu the aio thread can stall indefinitely if there's heavy disk (or cpu) load, such an IRQ-completion delay could happen on real hardware. I can't see why anybody should take any risk by aborting such a timed-out I/O it in the middle, given we perfectly know there are never timeouts triggering on real hardware and not all filesystem software is perfect and checks for I/O errors in every single path (this includes userland that may not notice a -EIO out of a write() syscall). ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-02-26 19:10 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-29 13:52 [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli 2008-09-01 10:43 ` [Qemu-devel] [PATCH 1/2] " Andrea Arcangeli 2008-09-01 10:53 ` [Qemu-devel] [PATCH 2/2] fix bdrv_aio_read API breakage in qcow2 Andrea Arcangeli 2008-10-22 14:14 ` [Qemu-devel] [PATCH] " Andrea Arcangeli 2008-10-27 13:49 ` Anthony Liguori 2008-10-31 17:32 ` Anthony Liguori 2009-01-14 18:06 ` [Qemu-devel] [PATCH] ide_dma_cancel will result in partial DMA transfer Andrea Arcangeli 2009-01-16 16:41 ` Ian Jackson 2009-01-22 19:02 ` Anthony Liguori 2009-02-26 16:43 ` Andrea Arcangeli 2008-09-01 11:21 ` Ian Jackson 2008-09-01 12:13 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).