* [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues
@ 2011-11-30 12:23 Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 1/3] qcow2: avoid reentrant bdrv_read() in copy_sectors() Stefan Hajnoczi
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2011-11-30 12:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi
This series fixes an issue where qemu_aio_flush() would return to IDE hardware
emulation with requests still pending. Kevin Wolf <kwolf@redhat.com> found it
in the block tree before qemu.git/master merge, so qemu.git/master is not
affected and these patches are against his tree.
A recent interaction between copy-on-read and "qcow2: Unlock during COW"
emphasized the need for strong assertion checking. Patch 1 fixes an assertion
failure with qcow2 images in hw/ide/pci.c:bmdma_cmd_writeb().
Patches 2 and 3 then introduce assertions that detect
wait_for_overlapping_requests() deadlock and broken qemu_aio_wait(). In the
future we will have better information if issues related to qemu_aio_wait()
come up.
I have run qemu-iotests for qcow2 and qed. I have also performed a RHEL6 guest
install to qed and tested booting the guest with IDE and qcow2.
Stefan Hajnoczi (3):
qcow2: avoid reentrant bdrv_read() in copy_sectors()
block: wait_for_overlapping_requests() deadlock detection
block: convert qemu_aio_flush() calls to bdrv_drain_all()
block-migration.c | 2 +-
block.c | 27 +++++++++++++++++++++++++++
block.h | 1 +
block/qcow2-cluster.c | 27 +++++++++++++++++++--------
blockdev.c | 4 ++--
cpus.c | 2 +-
hw/ide/macio.c | 5 +++--
hw/ide/pci.c | 2 +-
hw/virtio-blk.c | 2 +-
hw/xen_platform.c | 2 +-
qemu-io.c | 4 ++--
savevm.c | 2 +-
xen-mapcache.c | 2 +-
13 files changed, 61 insertions(+), 21 deletions(-)
--
1.7.7.3
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH 1/3] qcow2: avoid reentrant bdrv_read() in copy_sectors()
2011-11-30 12:23 [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Stefan Hajnoczi
@ 2011-11-30 12:23 ` Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 2/3] block: wait_for_overlapping_requests() deadlock detection Stefan Hajnoczi
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2011-11-30 12:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi
A BlockDriverState should not issue requests on itself through the
public block layer interface. Nested, or reentrant, requests are
problematic because they do I/O throttling and request tracking twice.
Features like block layer copy-on-read use request tracking to avoid
race conditions between concurrent requests. The reentrant request will
have to "wait" for its parent request to complete. But the parent is
waiting for the reentrant request to make progress so we have reached
deadlock.
The solution is for block drivers to avoid the public block layer
interfaces for reentrant requests. Instead they should call their own
internal functions if they wish to perform reentrant requests.
This is also a good opportunity to make copy_sectors() a true
coroutine_fn. That means calling bdrv_co_writev() instead of
bdrv_write(). Behavior is unchanged but we're being explicit that this
executes in coroutine context.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block/qcow2-cluster.c | 27 +++++++++++++++++++--------
1 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 0e33707..07a2e93 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -289,12 +289,15 @@ void qcow2_encrypt_sectors(BDRVQcowState *s, int64_t sector_num,
}
}
-static int copy_sectors(BlockDriverState *bs, uint64_t start_sect,
- uint64_t cluster_offset, int n_start, int n_end)
+static int coroutine_fn copy_sectors(BlockDriverState *bs,
+ uint64_t start_sect,
+ uint64_t cluster_offset,
+ int n_start, int n_end)
{
BDRVQcowState *s = bs->opaque;
+ QEMUIOVector qiov;
+ struct iovec iov;
int n, ret;
- void *buf;
/*
* If this is the last cluster and it is only partially used, we must only
@@ -310,29 +313,37 @@ static int copy_sectors(BlockDriverState *bs, uint64_t start_sect,
return 0;
}
- buf = qemu_blockalign(bs, n * BDRV_SECTOR_SIZE);
+ iov.iov_len = n * BDRV_SECTOR_SIZE;
+ iov.iov_base = qemu_blockalign(bs, iov.iov_len);
+
+ qemu_iovec_init_external(&qiov, &iov, 1);
BLKDBG_EVENT(bs->file, BLKDBG_COW_READ);
- ret = bdrv_read(bs, start_sect + n_start, buf, n);
+
+ /* Call .bdrv_co_readv() directly instead of using the public block-layer
+ * interface. This avoids double I/O throttling and request tracking,
+ * which can lead to deadlock when block layer copy-on-read is enabled.
+ */
+ ret = bs->drv->bdrv_co_readv(bs, start_sect + n_start, n, &qiov);
if (ret < 0) {
goto out;
}
if (s->crypt_method) {
qcow2_encrypt_sectors(s, start_sect + n_start,
- buf, buf, n, 1,
+ iov.iov_base, iov.iov_base, n, 1,
&s->aes_encrypt_key);
}
BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
- ret = bdrv_write(bs->file, (cluster_offset >> 9) + n_start, buf, n);
+ ret = bdrv_co_writev(bs->file, (cluster_offset >> 9) + n_start, n, &qiov);
if (ret < 0) {
goto out;
}
ret = 0;
out:
- qemu_vfree(buf);
+ qemu_vfree(iov.iov_base);
return ret;
}
--
1.7.7.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH 2/3] block: wait_for_overlapping_requests() deadlock detection
2011-11-30 12:23 [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 1/3] qcow2: avoid reentrant bdrv_read() in copy_sectors() Stefan Hajnoczi
@ 2011-11-30 12:23 ` Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all() Stefan Hajnoczi
2011-12-05 13:57 ` [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Kevin Wolf
3 siblings, 0 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2011-11-30 12:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi
Debugging a reentrant request deadlock was fun but in the future we need
a quick and obvious way of detecting such bugs. Add an assert that
checks we are not about to deadlock when waiting for another request.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/block.c b/block.c
index 4310eb5..37a96f3 100644
--- a/block.c
+++ b/block.c
@@ -1099,6 +1099,7 @@ struct BdrvTrackedRequest {
int nb_sectors;
bool is_write;
QLIST_ENTRY(BdrvTrackedRequest) list;
+ Coroutine *co; /* owner, used for deadlock detection */
CoQueue wait_queue; /* coroutines blocked on this request */
};
@@ -1126,6 +1127,7 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
.sector_num = sector_num,
.nb_sectors = nb_sectors,
.is_write = is_write,
+ .co = qemu_coroutine_self(),
};
qemu_co_queue_init(&req->wait_queue);
@@ -1189,6 +1191,12 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
QLIST_FOREACH(req, &bs->tracked_requests, list) {
if (tracked_request_overlaps(req, cluster_sector_num,
cluster_nb_sectors)) {
+ /* Hitting this means there was a reentrant request, for
+ * example, a block driver issuing nested requests. This must
+ * never happen since it means deadlock.
+ */
+ assert(qemu_coroutine_self() != req->co);
+
qemu_co_queue_wait(&req->wait_queue);
retry = true;
break;
--
1.7.7.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all()
2011-11-30 12:23 [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 1/3] qcow2: avoid reentrant bdrv_read() in copy_sectors() Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 2/3] block: wait_for_overlapping_requests() deadlock detection Stefan Hajnoczi
@ 2011-11-30 12:23 ` Stefan Hajnoczi
2011-11-30 12:28 ` Christoph Hellwig
2011-12-05 13:57 ` [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Kevin Wolf
3 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2011-11-30 12:23 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi
Many places in QEMU call qemu_aio_flush() to complete all pending
asynchronous I/O. Most of these places actually want to drain all block
requests but there is block layer API to do so.
This patch introduces the bdrv_drain_all() API to wait for requests
across all BlockDriverStates to complete. As a bonus we perform checks
after qemu_aio_wait() to ensure that requests really have finished.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
block-migration.c | 2 +-
block.c | 19 +++++++++++++++++++
block.h | 1 +
blockdev.c | 4 ++--
cpus.c | 2 +-
hw/ide/macio.c | 5 +++--
hw/ide/pci.c | 2 +-
hw/virtio-blk.c | 2 +-
hw/xen_platform.c | 2 +-
qemu-io.c | 4 ++--
savevm.c | 2 +-
xen-mapcache.c | 2 +-
12 files changed, 34 insertions(+), 13 deletions(-)
diff --git a/block-migration.c b/block-migration.c
index 5f104864..423c5a0 100644
--- a/block-migration.c
+++ b/block-migration.c
@@ -387,7 +387,7 @@ static int mig_save_device_dirty(Monitor *mon, QEMUFile *f,
for (sector = bmds->cur_dirty; sector < bmds->total_sectors;) {
if (bmds_aio_inflight(bmds, sector)) {
- qemu_aio_flush();
+ bdrv_drain_all();
}
if (bdrv_get_dirty(bmds->bs, sector)) {
diff --git a/block.c b/block.c
index 37a96f3..86807d6 100644
--- a/block.c
+++ b/block.c
@@ -846,6 +846,25 @@ void bdrv_close_all(void)
}
}
+/*
+ * Wait for pending requests to complete across all BlockDriverStates
+ *
+ * This function does not flush data to disk, use bdrv_flush_all() for that
+ * after calling this function.
+ */
+void bdrv_drain_all(void)
+{
+ BlockDriverState *bs;
+
+ qemu_aio_flush();
+
+ /* If requests are still pending there is a bug somewhere */
+ QTAILQ_FOREACH(bs, &bdrv_states, list) {
+ assert(QLIST_EMPTY(&bs->tracked_requests));
+ assert(qemu_co_queue_empty(&bs->throttled_reqs));
+ }
+}
+
/* make a BlockDriverState anonymous by removing from bdrv_state list.
Also, NULL terminate the device_name to prevent double remove */
void bdrv_make_anon(BlockDriverState *bs)
diff --git a/block.h b/block.h
index 1d28e85..0dc4905 100644
--- a/block.h
+++ b/block.h
@@ -214,6 +214,7 @@ int bdrv_flush(BlockDriverState *bs);
int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
void bdrv_flush_all(void);
void bdrv_close_all(void);
+void bdrv_drain_all(void);
int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors);
int bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors);
diff --git a/blockdev.c b/blockdev.c
index af4e239..dbf0251 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -653,7 +653,7 @@ int do_snapshot_blkdev(Monitor *mon, const QDict *qdict, QObject **ret_data)
goto out;
}
- qemu_aio_flush();
+ bdrv_drain_all();
bdrv_flush(bs);
bdrv_close(bs);
@@ -840,7 +840,7 @@ int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
}
/* quiesce block driver; prevent further io */
- qemu_aio_flush();
+ bdrv_drain_all();
bdrv_flush(bs);
bdrv_close(bs);
diff --git a/cpus.c b/cpus.c
index 82530c4..58e173c 100644
--- a/cpus.c
+++ b/cpus.c
@@ -396,7 +396,7 @@ static void do_vm_stop(RunState state)
pause_all_vcpus();
runstate_set(state);
vm_state_notify(0, state);
- qemu_aio_flush();
+ bdrv_drain_all();
bdrv_flush_all();
monitor_protocol_event(QEVENT_STOP, NULL);
}
diff --git a/hw/ide/macio.c b/hw/ide/macio.c
index 70b3342..c09d2e0 100644
--- a/hw/ide/macio.c
+++ b/hw/ide/macio.c
@@ -200,8 +200,9 @@ static void pmac_ide_flush(DBDMA_io *io)
{
MACIOIDEState *m = io->opaque;
- if (m->aiocb)
- qemu_aio_flush();
+ if (m->aiocb) {
+ bdrv_drain_all();
+ }
}
/* PowerMac IDE memory IO */
diff --git a/hw/ide/pci.c b/hw/ide/pci.c
index 49b823d..5078c0b 100644
--- a/hw/ide/pci.c
+++ b/hw/ide/pci.c
@@ -309,7 +309,7 @@ void bmdma_cmd_writeb(BMDMAState *bm, uint32_t val)
* aio operation with preadv/pwritev.
*/
if (bm->bus->dma->aiocb) {
- qemu_aio_flush();
+ bdrv_drain_all();
assert(bm->bus->dma->aiocb == NULL);
assert((bm->status & BM_STATUS_DMAING) == 0);
}
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index d6d1f87..4b0d113 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -474,7 +474,7 @@ static void virtio_blk_reset(VirtIODevice *vdev)
* This should cancel pending requests, but can't do nicely until there
* are per-device request lists.
*/
- qemu_aio_flush();
+ bdrv_drain_all();
}
/* coalesce internal state, copy to pci i/o region 0
diff --git a/hw/xen_platform.c b/hw/xen_platform.c
index 5e792f5..e62eaef 100644
--- a/hw/xen_platform.c
+++ b/hw/xen_platform.c
@@ -120,7 +120,7 @@ static void platform_fixed_ioport_writew(void *opaque, uint32_t addr, uint32_t v
devices, and bit 2 the non-primary-master IDE devices. */
if (val & UNPLUG_ALL_IDE_DISKS) {
DPRINTF("unplug disks\n");
- qemu_aio_flush();
+ bdrv_drain_all();
bdrv_flush_all();
pci_unplug_disks(s->pci_dev.bus);
}
diff --git a/qemu-io.c b/qemu-io.c
index de26422..b35adbb 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -1853,9 +1853,9 @@ int main(int argc, char **argv)
command_loop();
/*
- * Make sure all outstanding requests get flushed the program exits.
+ * Make sure all outstanding requests complete before the program exits.
*/
- qemu_aio_flush();
+ bdrv_drain_all();
if (bs) {
bdrv_delete(bs);
diff --git a/savevm.c b/savevm.c
index f53cd4c..0443554 100644
--- a/savevm.c
+++ b/savevm.c
@@ -2104,7 +2104,7 @@ int load_vmstate(const char *name)
}
/* Flush all IO requests so they don't interfere with the new state. */
- qemu_aio_flush();
+ bdrv_drain_all();
bs = NULL;
while ((bs = bdrv_next(bs))) {
diff --git a/xen-mapcache.c b/xen-mapcache.c
index 7bcb86e..9fecc64 100644
--- a/xen-mapcache.c
+++ b/xen-mapcache.c
@@ -351,7 +351,7 @@ void xen_invalidate_map_cache(void)
MapCacheRev *reventry;
/* Flush pending AIO before destroying the mapcache */
- qemu_aio_flush();
+ bdrv_drain_all();
QTAILQ_FOREACH(reventry, &mapcache->locked_entries, next) {
DPRINTF("There should be no locked mappings at this time, "
--
1.7.7.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all()
2011-11-30 12:23 ` [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all() Stefan Hajnoczi
@ 2011-11-30 12:28 ` Christoph Hellwig
2011-11-30 12:54 ` Stefan Hajnoczi
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2011-11-30 12:28 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel
On Wed, Nov 30, 2011 at 12:23:43PM +0000, Stefan Hajnoczi wrote:
> Many places in QEMU call qemu_aio_flush() to complete all pending
> asynchronous I/O. Most of these places actually want to drain all block
> requests but there is block layer API to do so.
there seems to be a "not" missing in the last half sentence.
>
> This patch introduces the bdrv_drain_all() API to wait for requests
> across all BlockDriverStates to complete. As a bonus we perform checks
> after qemu_aio_wait() to ensure that requests really have finished.
It looks like all but four of the callers actually just want to drain a
single BlockDriverState. And one of those four already has its own loop
over all BlockDriverStates.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all()
2011-11-30 12:28 ` Christoph Hellwig
@ 2011-11-30 12:54 ` Stefan Hajnoczi
0 siblings, 0 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2011-11-30 12:54 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel
On Wed, Nov 30, 2011 at 12:28 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Nov 30, 2011 at 12:23:43PM +0000, Stefan Hajnoczi wrote:
>> Many places in QEMU call qemu_aio_flush() to complete all pending
>> asynchronous I/O. Most of these places actually want to drain all block
>> requests but there is block layer API to do so.
>
> there seems to be a "not" missing in the last half sentence.
>
>>
>> This patch introduces the bdrv_drain_all() API to wait for requests
>> across all BlockDriverStates to complete. As a bonus we perform checks
>> after qemu_aio_wait() to ensure that requests really have finished.
>
> It looks like all but four of the callers actually just want to drain a
> single BlockDriverState. And one of those four already has its own loop
> over all BlockDriverStates.
Yes, we still don't have an interface for waiting on just one
BlockDriverState. virtio-blk even has a comment about the fact that
there is no per-block device way of waiting.
I think this should be done later since it is independent of adding
these asserts after qemu_aio_flush() (which is what this patch does).
Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues
2011-11-30 12:23 [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Stefan Hajnoczi
` (2 preceding siblings ...)
2011-11-30 12:23 ` [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all() Stefan Hajnoczi
@ 2011-12-05 13:57 ` Kevin Wolf
3 siblings, 0 replies; 7+ messages in thread
From: Kevin Wolf @ 2011-12-05 13:57 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: qemu-devel
Am 30.11.2011 13:23, schrieb Stefan Hajnoczi:
> This series fixes an issue where qemu_aio_flush() would return to IDE hardware
> emulation with requests still pending. Kevin Wolf <kwolf@redhat.com> found it
> in the block tree before qemu.git/master merge, so qemu.git/master is not
> affected and these patches are against his tree.
>
> A recent interaction between copy-on-read and "qcow2: Unlock during COW"
> emphasized the need for strong assertion checking. Patch 1 fixes an assertion
> failure with qcow2 images in hw/ide/pci.c:bmdma_cmd_writeb().
>
> Patches 2 and 3 then introduce assertions that detect
> wait_for_overlapping_requests() deadlock and broken qemu_aio_wait(). In the
> future we will have better information if issues related to qemu_aio_wait()
> come up.
>
> I have run qemu-iotests for qcow2 and qed. I have also performed a RHEL6 guest
> install to qed and tested booting the guest with IDE and qcow2.
>
> Stefan Hajnoczi (3):
> qcow2: avoid reentrant bdrv_read() in copy_sectors()
> block: wait_for_overlapping_requests() deadlock detection
> block: convert qemu_aio_flush() calls to bdrv_drain_all()
Thanks, applied all to the block branch.
Kevin
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-12-05 13:54 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-30 12:23 [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 1/3] qcow2: avoid reentrant bdrv_read() in copy_sectors() Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 2/3] block: wait_for_overlapping_requests() deadlock detection Stefan Hajnoczi
2011-11-30 12:23 ` [Qemu-devel] [PATCH 3/3] block: convert qemu_aio_flush() calls to bdrv_drain_all() Stefan Hajnoczi
2011-11-30 12:28 ` Christoph Hellwig
2011-11-30 12:54 ` Stefan Hajnoczi
2011-12-05 13:57 ` [Qemu-devel] [PATCH 0/3] block: fix and detect qemu_aio_wait() issues Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).