[RFC 00/11] do not drain while holding the graph lock

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/11] do not drain while holding the graph lock
@ 2025-05-08 14:09 Fiona Ebner
  2025-05-08 14:09 ` [PATCH 01/11] block: remove outdated comments about AioContext locking Fiona Ebner
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

This series is an attempt to fix a deadlock issue reported by Andrey
here [0].

bdrv_drained_begin() polls and is not allowed to be called with the
block graph lock held. Mark the function as GRAPH_UNLOCKED.

This does not catch the issue reported by Andrey, because there
is a bdrv_graph_rdunlock_main_loop() before bdrv_drained_begin() in
the function bdrv_change_aio_context(). That unlock is of course
ineffective if the exclusive lock is held, but it seems to prevent TSA
from finding the issue.

Thus the bdrv_drained_begin() call from inside
bdrv_change_aio_context() needs to be moved up the call stack before
acquiring the locks. This required the most changes, patch 09/11.

Granular draining is not trivially possible, because many of the
affected functions can recursively call themselves.

In place where bdrv_drained_begin() calls were removed, assertions
are added, checking the quiesced_counter to ensure that the nodes
already got drained further up in the call stack.

In bdrv_graph_wrlock() there is a comment that it uses
bdrv_drain_all_begin_nopoll() to make sure that constantly arriving
new I/O doesn't cause starvation. The changes from this series are at
odds with that, but there doesn't seem to be any (new) test failures.

[0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/

I did not CC the maintainers for blkverify, vmdk, etc. because this is
just an RFC and the changes there are mechanical, for adapting some
callers of block layer functions.

Andrey Drobyshev (1):
  iotests/graph-changes-while-io: add test case with removal of lower
    snapshot

Fiona Ebner (10):
  block: remove outdated comments about AioContext locking
  block: move drain outside of read-locked bdrv_reopen_queue_child()
  block/snapshot: move drain outside of read-locked
    bdrv_snapshot_delete()
  block: drain while unlocked in bdrv_reopen_parse_file_or_backing()
  block: move drain outside of read-locked bdrv_inactivate_recurse()
  blockdev: drain while unlocked in internal_snapshot_action()
  blockdev: drain while unlocked in external_snapshot_action()
  block: mark bdrv_drained_begin() as GRAPH_UNLOCKED
  block: move drain out of bdrv_change_aio_context()
  block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock

 block.c                                       | 170 +++++++++++-------
 block/backup.c                                |   4 +-
 block/blklogwrites.c                          |   8 +-
 block/blkverify.c                             |   4 +-
 block/block-backend.c                         |   8 +-
 block/commit.c                                |  16 +-
 block/graph-lock.c                            |  22 ++-
 block/mirror.c                                |  22 +--
 block/qcow2.c                                 |   4 +-
 block/quorum.c                                |   8 +-
 block/replication.c                           |  14 +-
 block/snapshot.c                              |  22 ++-
 block/stream.c                                |  18 +-
 block/vmdk.c                                  |  20 +--
 blockdev.c                                    |  59 ++++--
 blockjob.c                                    |  12 +-
 include/block/block-global-state.h            |   8 +-
 include/block/block-io.h                      |   3 +-
 include/block/graph-lock.h                    |   8 +-
 qemu-img.c                                    |   2 +
 scripts/block-coroutine-wrapper.py            |   4 +-
 .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++-
 .../tests/graph-changes-while-io.out          |   4 +-
 tests/unit/test-bdrv-drain.c                  |  40 ++---
 tests/unit/test-bdrv-graph-mod.c              |  20 +--
 25 files changed, 384 insertions(+), 217 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/11] block: remove outdated comments about AioContext locking
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 16:22   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

AioContext locking was removed in commit b49f4755c7 ("block: remove
AioContext locking").

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 block.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block.c b/block.c
index 0ece805e41..33f78dbe1c 100644
--- a/block.c
+++ b/block.c
@@ -4359,8 +4359,6 @@ bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child)
  * bs_queue, or the existing bs_queue being used.
  *
  * bs is drained here and undrained by bdrv_reopen_queue_free().
- *
- * To be called with bs->aio_context locked.
  */
 static BlockReopenQueue * GRAPH_RDLOCK
 bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
@@ -4519,7 +4517,6 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
     return bs_queue;
 }
 
-/* To be called with bs->aio_context locked */
 BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
                                     BlockDriverState *bs,
                                     QDict *options, bool keep_old_opts)
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
  2025-05-08 14:09 ` [PATCH 01/11] block: remove outdated comments about AioContext locking Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 16:36   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

More granular draining is not trivially possible, because
bdrv_reopen_queue_child() can recursively call itself.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 block.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index 33f78dbe1c..0085dbfa74 100644
--- a/block.c
+++ b/block.c
@@ -4358,7 +4358,7 @@ bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child)
  * returns a pointer to bs_queue, which is either the newly allocated
  * bs_queue, or the existing bs_queue being used.
  *
- * bs is drained here and undrained by bdrv_reopen_queue_free().
+ * bs needs to be drained
  */
 static BlockReopenQueue * GRAPH_RDLOCK
 bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
@@ -4368,6 +4368,7 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
                         bool keep_old_opts)
 {
     assert(bs != NULL);
+    assert(qatomic_read(&bs->quiesce_counter) > 0);
 
     BlockReopenQueueEntry *bs_entry;
     BdrvChild *child;
@@ -4377,13 +4378,6 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
 
     GLOBAL_STATE_CODE();
 
-    /*
-     * Strictly speaking, draining is illegal under GRAPH_RDLOCK. We know that
-     * we've been called with bdrv_graph_rdlock_main_loop(), though, so it's ok
-     * in practice.
-     */
-    bdrv_drained_begin(bs);
-
     if (bs_queue == NULL) {
         bs_queue = g_new0(BlockReopenQueue, 1);
         QTAILQ_INIT(bs_queue);
@@ -4522,6 +4516,14 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
                                     QDict *options, bool keep_old_opts)
 {
     GLOBAL_STATE_CODE();
+
+    if (bs_queue == NULL) {
+        /*
+         * paired with bdrv_drain_all_end() in bdrv_reopen_queue_free().
+         */
+        bdrv_drain_all_begin();
+    }
+
     GRAPH_RDLOCK_GUARD_MAINLOOP();
 
     return bdrv_reopen_queue_child(bs_queue, bs, options, NULL, 0, false,
@@ -4534,12 +4536,16 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
     if (bs_queue) {
         BlockReopenQueueEntry *bs_entry, *next;
         QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
-            bdrv_drained_end(bs_entry->state.bs);
             qobject_unref(bs_entry->state.explicit_options);
             qobject_unref(bs_entry->state.options);
             g_free(bs_entry);
         }
         g_free(bs_queue);
+
+        /*
+         * paired with bdrv_drain_all_begin() in bdrv_reopen_queue().
+         */
+        bdrv_drain_all_end();
     }
 }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
  2025-05-08 14:09 ` [PATCH 01/11] block: remove outdated comments about AioContext locking Fiona Ebner
  2025-05-08 14:09 ` [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 16:52   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 04/11] block: drain while unlocked in bdrv_reopen_parse_file_or_backing() Fiona Ebner
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

More granular draining is not trivially possible, because
bdrv_snapshot_delete() can recursively call itself.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 block/snapshot.c | 18 ++++++++++++------
 blockdev.c       | 25 +++++++++++++++++--------
 qemu-img.c       |  2 ++
 3 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index 22567f1fb9..7788e1130b 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -327,7 +327,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 
 /**
  * Delete an internal snapshot by @snapshot_id and @name.
- * @bs: block device used in the operation
+ * @bs: block device used in the operation, needs to be drained
  * @snapshot_id: unique snapshot ID, or NULL
  * @name: snapshot name, or NULL
  * @errp: location to store error
@@ -356,6 +356,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
     BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
     int ret;
 
+    assert(qatomic_read(&bs->quiesce_counter) > 0);
+
     GLOBAL_STATE_CODE();
 
     if (!drv) {
@@ -368,9 +370,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
         return -EINVAL;
     }
 
-    /* drain all pending i/o before deleting snapshot */
-    bdrv_drained_begin(bs);
-
     if (drv->bdrv_snapshot_delete) {
         ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
     } else if (fallback_bs) {
@@ -382,7 +381,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
         ret = -ENOTSUP;
     }
 
-    bdrv_drained_end(bs);
     return ret;
 }
 
@@ -573,9 +571,13 @@ int bdrv_all_delete_snapshot(const char *name,
     GList *iterbdrvs;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
+        bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_end();
         return -1;
     }
 
@@ -594,12 +596,16 @@ int bdrv_all_delete_snapshot(const char *name,
         if (ret < 0) {
             error_prepend(errp, "Could not delete snapshot '%s' on '%s': ",
                           name, bdrv_get_device_or_node_name(bs));
+            bdrv_graph_rdunlock_main_loop();
+            bdrv_drain_all_end();
             return -1;
         }
 
         iterbdrvs = iterbdrvs->next;
     }
 
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
     return 0;
 }
 
diff --git a/blockdev.c b/blockdev.c
index 1d1f27cfff..1272b9a745 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1132,39 +1132,41 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
     int ret;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     bs = qmp_get_root_bs(device, errp);
     if (!bs) {
-        return NULL;
+        goto error;
     }
 
     if (!id && !name) {
         error_setg(errp, "Name or id must be provided");
-        return NULL;
+        goto error;
     }
 
     if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INTERNAL_SNAPSHOT_DELETE, errp)) {
-        return NULL;
+        goto error;
     }
 
     ret = bdrv_snapshot_find_by_id_and_name(bs, id, name, &sn, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
-        return NULL;
+        goto error;
     }
     if (!ret) {
         error_setg(errp,
                    "Snapshot with id '%s' and name '%s' does not exist on "
                    "device '%s'",
                    STR_OR_NULL(id), STR_OR_NULL(name), device);
-        return NULL;
+        goto error;
     }
 
     bdrv_snapshot_delete(bs, id, name, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
-        return NULL;
+        goto error;
     }
 
     info = g_new0(SnapshotInfo, 1);
@@ -1180,6 +1182,9 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
         info->has_icount = true;
     }
 
+error:
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
     return info;
 }
 
@@ -1295,12 +1300,14 @@ static void internal_snapshot_abort(void *opaque)
     Error *local_error = NULL;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
 
     if (!state->created) {
         return;
     }
 
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
+
     if (bdrv_snapshot_delete(bs, sn->id_str, sn->name, &local_error) < 0) {
         error_reportf_err(local_error,
                           "Failed to delete snapshot with id '%s' and "
@@ -1308,6 +1315,8 @@ static void internal_snapshot_abort(void *opaque)
                           sn->id_str, sn->name,
                           bdrv_get_device_name(bs));
     }
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
 }
 
 static void internal_snapshot_clean(void *opaque)
diff --git a/qemu-img.c b/qemu-img.c
index 76ac5d3028..e81b0fbb6c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3505,6 +3505,7 @@ static int img_snapshot(int argc, char **argv)
         break;
 
     case SNAPSHOT_DELETE:
+        bdrv_drain_all_begin();
         bdrv_graph_rdlock_main_loop();
         ret = bdrv_snapshot_find(bs, &sn, snapshot_name);
         if (ret < 0) {
@@ -3520,6 +3521,7 @@ static int img_snapshot(int argc, char **argv)
             }
         }
         bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_end();
         break;
     }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/11] block: drain while unlocked in bdrv_reopen_parse_file_or_backing()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (2 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 16:59   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 05/11] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 block.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 0085dbfa74..c7c26533c9 100644
--- a/block.c
+++ b/block.c
@@ -4805,10 +4805,11 @@ bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
 
     if (old_child_bs) {
         bdrv_ref(old_child_bs);
+        bdrv_graph_rdunlock_main_loop();
         bdrv_drained_begin(old_child_bs);
+    } else {
+        bdrv_graph_rdunlock_main_loop();
     }
-
-    bdrv_graph_rdunlock_main_loop();
     bdrv_graph_wrlock();
 
     ret = bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/11] block: move drain outside of read-locked bdrv_inactivate_recurse()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (3 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 04/11] block: drain while unlocked in bdrv_reopen_parse_file_or_backing() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 17:06   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

More granular draining is not trivially possible, because
bdrv_inactivate_recurse() can recursively call itself.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 block.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index c7c26533c9..10052027cd 100644
--- a/block.c
+++ b/block.c
@@ -6991,6 +6991,8 @@ bdrv_inactivate_recurse(BlockDriverState *bs, bool top_level)
     int ret;
     uint64_t cumulative_perms, cumulative_shared_perms;
 
+    assert(qatomic_read(&bs->quiesce_counter) > 0);
+
     GLOBAL_STATE_CODE();
 
     if (!bs->drv) {
@@ -7036,9 +7038,7 @@ bdrv_inactivate_recurse(BlockDriverState *bs, bool top_level)
         return -EPERM;
     }
 
-    bdrv_drained_begin(bs);
     bs->open_flags |= BDRV_O_INACTIVE;
-    bdrv_drained_end(bs);
 
     /*
      * Update permissions, they may differ for inactive nodes.
@@ -7063,14 +7063,20 @@ int bdrv_inactivate(BlockDriverState *bs, Error **errp)
     int ret;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     if (bdrv_has_bds_parent(bs, true)) {
         error_setg(errp, "Node has active parent node");
+        bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_end();
         return -EPERM;
     }
 
     ret = bdrv_inactivate_recurse(bs, true);
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to inactivate node");
         return ret;
@@ -7086,7 +7092,9 @@ int bdrv_inactivate_all(void)
     int ret = 0;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
         /* Nodes with BDS parents are covered by recursion from the last
@@ -7102,6 +7110,9 @@ int bdrv_inactivate_all(void)
         }
     }
 
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
+
     return ret;
 }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (4 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 05/11] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 17:26   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 07/11] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Could the bs associated to the device change because of polling
when draining? If yes, does that mean we need to drain all in the
beginning and not temporarily unlock?

 blockdev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index 1272b9a745..f2b4fdf1b3 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1216,7 +1216,7 @@ static void internal_snapshot_action(BlockdevSnapshotInternal *internal,
     int ret1;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+    bdrv_graph_rdlock_main_loop();
 
     tran_add(tran, &internal_snapshot_drv, state);
 
@@ -1225,14 +1225,18 @@ static void internal_snapshot_action(BlockdevSnapshotInternal *internal,
 
     bs = qmp_get_root_bs(device, errp);
     if (!bs) {
+        bdrv_graph_rdunlock_main_loop();
         return;
     }
 
     state->bs = bs;
 
+    bdrv_graph_rdunlock_main_loop();
     /* Paired with .clean() */
     bdrv_drained_begin(bs);
 
+    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
     if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INTERNAL_SNAPSHOT, errp)) {
         return;
     }
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/11] blockdev: drain while unlocked in external_snapshot_action()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (5 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-08 14:09 ` [PATCH 08/11] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Could the bs associated to the device change because of polling
when draining? If yes, does that mean we need to drain all in the
beginning and not temporarily unlock?

 blockdev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index f2b4fdf1b3..4a672f9bac 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1368,7 +1368,7 @@ static void external_snapshot_action(TransactionAction *action,
     uint64_t perm, shared;
 
     /* TODO We'll eventually have to take a writer lock in this function */
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+    bdrv_graph_rdlock_main_loop();
 
     tran_add(tran, &external_snapshot_drv, state);
 
@@ -1401,11 +1401,14 @@ static void external_snapshot_action(TransactionAction *action,
 
     state->old_bs = bdrv_lookup_bs(device, node_name, errp);
     if (!state->old_bs) {
+        bdrv_graph_rdunlock_main_loop();
         return;
     }
 
+    bdrv_graph_rdunlock_main_loop();
     /* Paired with .clean() */
     bdrv_drained_begin(state->old_bs);
+    GRAPH_RDLOCK_GUARD_MAINLOOP();
 
     if (!bdrv_is_inserted(state->old_bs)) {
         error_setg(errp, "Device '%s' has no medium",
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/11] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (6 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 07/11] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 17:50   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 09/11] block: move drain out of bdrv_change_aio_context() Fiona Ebner
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

bdrv_drained_begin() polls and is not allowed to be called with the
block graph lock held. Mark the function as such.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

This does not catch the issue reported by Andrey, because there
is a bdrv_graph_rdunlock_main_loop() before bdrv_drained_begin() in
the function bdrv_change_aio_context(). That unlock is of course
ineffective if the exclusive lock is held, but it seems to prevent TSA
from finiding the issue.

The next patch is concerned with that.

 include/block/block-io.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index b49e0537dd..34b9f1cbfc 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -429,7 +429,8 @@ bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
  *
  * This function can be recursive.
  */
-void bdrv_drained_begin(BlockDriverState *bs);
+void GRAPH_UNLOCKED
+bdrv_drained_begin(BlockDriverState *bs);
 
 /**
  * bdrv_do_drained_begin_quiesce:
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/11] block: move drain out of bdrv_change_aio_context()
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (7 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 08/11] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 19:45   ` Kevin Wolf
  2025-05-08 14:09 ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{, un}lock Fiona Ebner
  2025-05-08 14:09 ` [PATCH 11/11] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

While bdrv_drained_begin() is now marked as GRAPH_UNLOCKED, TSA is not
able to catch a remaining instance where the function is called with
the graph lock held in bdrv_change_aio_context(). This is because the
call is preceded by a bdrv_graph_rdunlock_main_loop(), but for callers
that hold the exclusive lock that does not actually release the lock.

In combination with block-stream, there is a deadlock that can happen
because of this [0]. In particular, it can happen that
main thread              IO thread
1. acquires write lock
                         in blk_co_do_preadv_part():
                         2. have non-zero blk->in_flight
                         3. try to acquire read lock
4. begin drain

Steps 3 and 4 might be switched. Draining will poll and get stuck,
because it will see the non-zero in_flight counter. But the IO thread
will not make any progress either, because it cannot acquire the read
lock.

More granular draining is not trivially possible, because
bdrv_change_aio_context() can recursively call itself e.g. via
bdrv_child_change_aio_context().

[0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/

Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 block.c                            | 91 ++++++++++++++++++++----------
 block/backup.c                     |  2 +
 block/blklogwrites.c               |  4 ++
 block/blkverify.c                  |  2 +
 block/block-backend.c              |  4 ++
 block/commit.c                     |  4 ++
 block/mirror.c                     |  5 ++
 block/qcow2.c                      |  2 +
 block/quorum.c                     |  4 ++
 block/replication.c                |  7 +++
 block/snapshot.c                   |  2 +
 block/stream.c                     | 10 ++--
 block/vmdk.c                       | 10 ++++
 blockdev.c                         | 17 ++++--
 blockjob.c                         |  6 ++
 include/block/block-global-state.h |  8 ++-
 tests/unit/test-bdrv-drain.c       | 18 ++++++
 tests/unit/test-bdrv-graph-mod.c   | 10 ++++
 18 files changed, 164 insertions(+), 42 deletions(-)

diff --git a/block.c b/block.c
index 10052027cd..d84ac2f394 100644
--- a/block.c
+++ b/block.c
@@ -1720,12 +1720,14 @@ bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name,
 open_failed:
     bs->drv = NULL;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     if (bs->file != NULL) {
         bdrv_unref_child(bs, bs->file);
         assert(!bs->file);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -3027,7 +3029,8 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
     bdrv_replace_child_noperm(s->child, NULL);
 
     if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
-        bdrv_try_change_aio_context(bs, s->old_child_ctx, NULL, &error_abort);
+        bdrv_try_change_aio_context_locked(bs, s->old_child_ctx, NULL,
+                                           &error_abort);
     }
 
     if (bdrv_child_get_parent_aio_context(s->child) != s->old_parent_ctx) {
@@ -3112,8 +3115,8 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
     parent_ctx = bdrv_child_get_parent_aio_context(new_child);
     if (child_ctx != parent_ctx) {
         Error *local_err = NULL;
-        int ret = bdrv_try_change_aio_context(child_bs, parent_ctx, NULL,
-                                              &local_err);
+        int ret = bdrv_try_change_aio_context_locked(child_bs, parent_ctx, NULL,
+                                                     &local_err);
 
         if (ret < 0 && child_class->change_aio_ctx) {
             Transaction *aio_ctx_tran = tran_new();
@@ -3314,8 +3317,8 @@ void bdrv_root_unref_child(BdrvChild *child)
          * When the parent requiring a non-default AioContext is removed, the
          * node moves back to the main AioContext
          */
-        bdrv_try_change_aio_context(child_bs, qemu_get_aio_context(), NULL,
-                                    NULL);
+        bdrv_try_change_aio_context_locked(child_bs, qemu_get_aio_context(),
+                                           NULL, NULL);
     }
 
     bdrv_schedule_unref(child_bs);
@@ -3584,11 +3587,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     bdrv_graph_rdunlock_main_loop();
 
     bdrv_ref(drain_bs);
-    bdrv_drained_begin(drain_bs);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp);
     bdrv_graph_wrunlock();
-    bdrv_drained_end(drain_bs);
+    bdrv_drain_all_end();
     bdrv_unref(drain_bs);
 
     return ret;
@@ -3780,10 +3783,12 @@ static BdrvChild *bdrv_open_child_common(const char *filename,
         return NULL;
     }
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     child = bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
                               errp);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     return child;
 }
@@ -4805,20 +4810,19 @@ bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
 
     if (old_child_bs) {
         bdrv_ref(old_child_bs);
-        bdrv_graph_rdunlock_main_loop();
-        bdrv_drained_begin(old_child_bs);
-    } else {
-        bdrv_graph_rdunlock_main_loop();
     }
+
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     ret = bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
                                           tran, errp);
 
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     if (old_child_bs) {
-        bdrv_drained_end(old_child_bs);
         bdrv_unref(old_child_bs);
     }
 
@@ -5160,6 +5164,7 @@ static void bdrv_close(BlockDriverState *bs)
         bs->drv = NULL;
     }
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
         bdrv_unref_child(bs, child);
@@ -5168,6 +5173,7 @@ static void bdrv_close(BlockDriverState *bs)
     assert(!bs->backing);
     assert(!bs->file);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -5493,9 +5499,7 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     assert(!bs_new->backing);
     bdrv_graph_rdunlock_main_loop();
 
-    bdrv_drained_begin(bs_top);
-    bdrv_drained_begin(bs_new);
-
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     child = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
@@ -5517,9 +5521,7 @@ out:
 
     bdrv_refresh_limits(bs_top, NULL, NULL);
     bdrv_graph_wrunlock();
-
-    bdrv_drained_end(bs_top);
-    bdrv_drained_end(bs_new);
+    bdrv_drain_all_end();
 
     return ret;
 }
@@ -7626,10 +7628,6 @@ bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
 static void bdrv_set_aio_context_clean(void *opaque)
 {
     BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
-    BlockDriverState *bs = (BlockDriverState *) state->bs;
-
-    /* Paired with bdrv_drained_begin in bdrv_change_aio_context() */
-    bdrv_drained_end(bs);
 
     g_free(state);
 }
@@ -7693,8 +7691,7 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
         .bs = bs,
     };
 
-    /* Paired with bdrv_drained_end in bdrv_set_aio_context_clean() */
-    bdrv_drained_begin(bs);
+    assert(qatomic_read(&bs->quiesce_counter) > 0);
 
     tran_add(tran, &set_aio_context, state);
 
@@ -7702,14 +7699,12 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
 }
 
 /*
- * Change bs's and recursively all of its parents' and children's AioContext
- * to the given new context, returning an error if that isn't possible.
- *
- * If ignore_child is not NULL, that child (and its subgraph) will not
- * be touched.
+ * Use bdrv_try_change_aio_context() or bdrv_try_change_aio_context_locked().
  */
-int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                BdrvChild *ignore_child, Error **errp)
+static int bdrv_try_change_aio_context_common(BlockDriverState *bs,
+                                              AioContext *ctx,
+                                              BdrvChild *ignore_child,
+                                              Error **errp)
 {
     Transaction *tran;
     GHashTable *visited;
@@ -7747,6 +7742,40 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
     return 0;
 }
 
+/*
+ * Change bs's and recursively all of its parents' and children's AioContext
+ * to the given new context, returning an error if that isn't possible.
+ *
+ * If ignore_child is not NULL, that child (and its subgraph) will not
+ * be touched.
+ */
+int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                                BdrvChild *ignore_child, Error **errp)
+{
+    int ret;
+    bdrv_drain_all_begin();
+    ret = bdrv_try_change_aio_context_common(bs, ctx, ignore_child, errp);
+    bdrv_drain_all_end();
+    return ret;
+}
+
+/*
+ * Change bs's and recursively all of its parents' and children's AioContext
+ * to the given new context, returning an error if that isn't possible.
+ *
+ * If ignore_child is not NULL, that child (and its subgraph) will not
+ * be touched.
+ *
+ * Called with the graph lock held.
+ *
+ * Called while all bs are drained.
+ */
+int bdrv_try_change_aio_context_locked(BlockDriverState *bs, AioContext *ctx,
+                                       BdrvChild *ignore_child, Error **errp)
+{
+    return bdrv_try_change_aio_context_common(bs, ctx, ignore_child, errp);
+}
+
 void bdrv_add_aio_context_notifier(BlockDriverState *bs,
         void (*attached_aio_context)(AioContext *new_context, void *opaque),
         void (*detach_aio_context)(void *opaque), void *opaque)
diff --git a/block/backup.c b/block/backup.c
index 79652bf57b..9d55e55b79 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -497,10 +497,12 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     block_copy_set_speed(bcs, speed);
 
     /* Required permissions are taken by copy-before-write filter target */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     block_job_add_bdrv(&job->common, "target", target, 0, BLK_PERM_ALL,
                        &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     return &job->common;
 
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index b0f78c4bc7..70ac76f401 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -281,9 +281,11 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
     ret = 0;
 fail_log:
     if (ret < 0) {
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, s->log_file);
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         s->log_file = NULL;
         qemu_mutex_destroy(&s->mutex);
     }
@@ -296,10 +298,12 @@ static void blk_log_writes_close(BlockDriverState *bs)
 {
     BDRVBlkLogWritesState *s = bs->opaque;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     qemu_mutex_destroy(&s->mutex);
 }
 
diff --git a/block/blkverify.c b/block/blkverify.c
index db79a36681..3a71f7498c 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -151,10 +151,12 @@ static void blkverify_close(BlockDriverState *bs)
 {
     BDRVBlkverifyState *s = bs->opaque;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_unref_child(bs, s->test_file);
     s->test_file = NULL;
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
diff --git a/block/block-backend.c b/block/block-backend.c
index a402db13f2..2b30a34662 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -889,9 +889,11 @@ void blk_remove_bs(BlockBackend *blk)
     root = blk->root;
     blk->root = NULL;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_root_unref_child(root);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 /*
@@ -904,6 +906,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
 
     GLOBAL_STATE_CODE();
     bdrv_ref(bs);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     if ((bs->open_flags & BDRV_O_INACTIVE) && blk_can_inactivate(blk)) {
@@ -919,6 +922,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
                                        BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                                        perm, shared_perm, blk, errp);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     if (blk->root == NULL) {
         return -EPERM;
     }
diff --git a/block/commit.c b/block/commit.c
index 5df3d05346..6c06b894ff 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -342,6 +342,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
      * this is the responsibility of the interface (i.e. whoever calls
      * commit_start()).
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     s->base_overlay = bdrv_find_overlay(top, base);
     assert(s->base_overlay);
@@ -374,18 +375,21 @@ void commit_start(const char *job_id, BlockDriverState *bs,
                                  iter_shared_perms, errp);
         if (ret < 0) {
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             goto fail;
         }
     }
 
     if (bdrv_freeze_backing_chain(commit_top_bs, base, errp) < 0) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         goto fail;
     }
     s->chain_frozen = true;
 
     ret = block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_ALL, errp);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     if (ret < 0) {
         goto fail;
diff --git a/block/mirror.c b/block/mirror.c
index a53582f17b..5c1acd650b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1904,6 +1904,7 @@ static BlockJob *mirror_start_job(
      */
     bdrv_disable_dirty_bitmap(s->dirty_bitmap);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     ret = block_job_add_bdrv(&s->common, "source", bs, 0,
                              BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE |
@@ -1911,6 +1912,7 @@ static BlockJob *mirror_start_job(
                              errp);
     if (ret < 0) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         goto fail;
     }
 
@@ -1956,16 +1958,19 @@ static BlockJob *mirror_start_job(
                                      iter_shared_perms, errp);
             if (ret < 0) {
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 goto fail;
             }
         }
 
         if (bdrv_freeze_backing_chain(mirror_top_bs, target, errp) < 0) {
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             goto fail;
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     QTAILQ_INIT(&s->ops_in_flight);
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 7774e7f090..bba7c90713 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2821,9 +2821,11 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
     if (close_data_file && has_data_file(bs)) {
         GLOBAL_STATE_CODE();
         bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, s->data_file);
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         s->data_file = NULL;
         bdrv_graph_rdlock_main_loop();
     }
diff --git a/block/quorum.c b/block/quorum.c
index 30747a6df9..3a5928d574 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1037,6 +1037,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
 close_exit:
     /* cleanup on error */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < s->num_children; i++) {
         if (!opened[i]) {
@@ -1045,6 +1046,7 @@ close_exit:
         bdrv_unref_child(bs, s->children[i]);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     g_free(s->children);
     g_free(opened);
 exit:
@@ -1057,11 +1059,13 @@ static void quorum_close(BlockDriverState *bs)
     BDRVQuorumState *s = bs->opaque;
     int i;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < s->num_children; i++) {
         bdrv_unref_child(bs, s->children[i]);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(s->children);
 }
diff --git a/block/replication.c b/block/replication.c
index d6625c51fe..ec54a9a907 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -540,6 +540,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             return;
         }
 
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
 
         bdrv_ref(hidden_disk->bs);
@@ -549,6 +550,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (local_err) {
             error_propagate(errp, local_err);
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             return;
         }
 
@@ -559,6 +561,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (local_err) {
             error_propagate(errp, local_err);
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             return;
         }
 
@@ -571,12 +574,14 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             !check_top_bs(top_bs, bs)) {
             error_setg(errp, "No top_bs or it is invalid");
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             reopen_backing_file(bs, false, NULL);
             return;
         }
         bdrv_op_block_all(top_bs, s->blocker);
 
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
 
         s->backup_job = backup_job_create(
                                 NULL, s->secondary_disk->bs, s->hidden_disk->bs,
@@ -649,12 +654,14 @@ static void replication_done(void *opaque, int ret)
     if (ret == 0) {
         s->stage = BLOCK_REPLICATION_DONE;
 
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, s->secondary_disk);
         s->secondary_disk = NULL;
         bdrv_unref_child(bs, s->hidden_disk);
         s->hidden_disk = NULL;
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
 
         s->error = 0;
     } else {
diff --git a/block/snapshot.c b/block/snapshot.c
index 7788e1130b..88c3190eff 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -291,9 +291,11 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
         }
 
         /* .bdrv_open() will re-attach it */
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, fallback);
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
 
         ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
         memset(bs->opaque, 0, drv->instance_size);
diff --git a/block/stream.c b/block/stream.c
index 999d9e56d4..6e68b0c9f8 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -80,10 +80,9 @@ static int stream_prepare(Job *job)
      * may end up working with the wrong base node (or it might even have gone
      * away by the time we want to use it).
      */
-    bdrv_drained_begin(unfiltered_bs);
+    bdrv_drain_all_begin();
     if (unfiltered_bs_cow) {
         bdrv_ref(unfiltered_bs_cow);
-        bdrv_drained_begin(unfiltered_bs_cow);
     }
 
     bdrv_graph_rdlock_main_loop();
@@ -124,10 +123,9 @@ static int stream_prepare(Job *job)
 
 out:
     if (unfiltered_bs_cow) {
-        bdrv_drained_end(unfiltered_bs_cow);
         bdrv_unref(unfiltered_bs_cow);
     }
-    bdrv_drained_end(unfiltered_bs);
+    bdrv_drain_all_end();
     return ret;
 }
 
@@ -373,10 +371,12 @@ void stream_start(const char *job_id, BlockDriverState *bs,
      * already have our own plans. Also don't allow resize as the image size is
      * queried only at the job start and then cached.
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     if (block_job_add_bdrv(&s->common, "active node", bs, 0,
                            basic_flags | BLK_PERM_WRITE, errp)) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         goto fail;
     }
 
@@ -397,10 +397,12 @@ void stream_start(const char *job_id, BlockDriverState *bs,
                                  basic_flags, errp);
         if (ret < 0) {
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             goto fail;
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     s->base_overlay = base_overlay;
     s->above_base = above_base;
diff --git a/block/vmdk.c b/block/vmdk.c
index 2adec49912..e76464685e 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -271,6 +271,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
     BDRVVmdkState *s = bs->opaque;
     VmdkExtent *e;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < s->num_extents; i++) {
         e = &s->extents[i];
@@ -283,6 +284,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(s->extents);
 }
@@ -1247,9 +1249,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
                             0, 0, 0, 0, 0, &extent, errp);
             if (ret < 0) {
                 bdrv_graph_rdunlock_main_loop();
+                bdrv_drain_all_begin();
                 bdrv_graph_wrlock();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1266,9 +1270,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             g_free(buf);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
+                bdrv_drain_all_begin();
                 bdrv_graph_wrlock();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1277,9 +1283,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
+                bdrv_drain_all_begin();
                 bdrv_graph_wrlock();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1287,9 +1295,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
         } else {
             error_setg(errp, "Unsupported extent type '%s'", type);
             bdrv_graph_rdunlock_main_loop();
+            bdrv_drain_all_begin();
             bdrv_graph_wrlock();
             bdrv_unref_child(bs, extent_file);
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             bdrv_graph_rdlock_main_loop();
             ret = -ENOTSUP;
             goto out;
diff --git a/blockdev.c b/blockdev.c
index 4a672f9bac..b6a4bc2bfc 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3539,6 +3539,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
     BlockDriverState *parent_bs, *new_bs = NULL;
     BdrvChild *p_child;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     parent_bs = bdrv_lookup_bs(parent, parent, errp);
@@ -3576,6 +3577,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
 
 out:
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
@@ -3609,12 +3611,13 @@ void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
     AioContext *new_context;
     BlockDriverState *bs;
 
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     bs = bdrv_find_node(node_name);
     if (!bs) {
         error_setg(errp, "Failed to find node with node-name='%s'", node_name);
-        return;
+        goto out;
     }
 
     /* Protects against accidents. */
@@ -3622,14 +3625,14 @@ void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
         error_setg(errp, "Node %s is associated with a BlockBackend and could "
                          "be in use (use force=true to override this check)",
                          node_name);
-        return;
+        goto out;
     }
 
     if (iothread->type == QTYPE_QSTRING) {
         IOThread *obj = iothread_by_id(iothread->u.s);
         if (!obj) {
             error_setg(errp, "Cannot find iothread %s", iothread->u.s);
-            return;
+            goto out;
         }
 
         new_context = iothread_get_aio_context(obj);
@@ -3637,7 +3640,11 @@ void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
         new_context = qemu_get_aio_context();
     }
 
-    bdrv_try_change_aio_context(bs, new_context, NULL, errp);
+    bdrv_try_change_aio_context_locked(bs, new_context, NULL, errp);
+
+out:
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
 }
 
 QemuOptsList qemu_common_drive_opts = {
diff --git a/blockjob.c b/blockjob.c
index 32007f31a9..390301d3cf 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
      * one to make sure that such a concurrent access does not attempt
      * to process an already freed BdrvChild.
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     while (job->nodes) {
         GSList *l = job->nodes;
@@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
         g_slist_free_1(l);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
@@ -496,6 +498,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     int ret;
     GLOBAL_STATE_CODE();
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     if (job_id == NULL && !(flags & JOB_INTERNAL)) {
@@ -506,6 +509,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
                      flags, cb, opaque, errp);
     if (job == NULL) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         return NULL;
     }
 
@@ -544,10 +548,12 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     }
 
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     return job;
 
 fail:
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     job_early_fail(&job->job);
     return NULL;
 }
diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 9be34b3c99..da842e6019 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -277,8 +277,12 @@ bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
 bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
                                    GHashTable *visited, Transaction *tran,
                                    Error **errp);
-int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                BdrvChild *ignore_child, Error **errp);
+int GRAPH_UNLOCKED
+bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                            BdrvChild *ignore_child, Error **errp);
+int GRAPH_RDLOCK
+bdrv_try_change_aio_context_locked(BlockDriverState *bs, AioContext *ctx,
+                                   BdrvChild *ignore_child, Error **errp);
 
 int GRAPH_RDLOCK bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 290cd2a70e..d931c5183e 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -772,9 +772,11 @@ static void test_blockjob_common_drain_node(enum drain_type drain_type,
     tjob->bs = src;
     job = &tjob->common;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     block_job_add_bdrv(job, "target", target, 0, BLK_PERM_ALL, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     switch (result) {
     case TEST_JOB_SUCCESS:
@@ -953,11 +955,13 @@ static void bdrv_test_top_close(BlockDriverState *bs)
 {
     BdrvChild *c, *next_c;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     QLIST_FOREACH_SAFE(c, &bs->children, next, next_c) {
         bdrv_unref_child(bs, c);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 static int coroutine_fn GRAPH_RDLOCK
@@ -1047,10 +1051,12 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
 
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     /* This child will be the one to pass to requests through to, and
      * it will stall until a drain occurs */
@@ -1058,21 +1064,25 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
                                     &error_abort);
     child_bs->total_sectors = 65536 >> BDRV_SECTOR_BITS;
     /* Takes our reference to child_bs */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     tts->wait_child = bdrv_attach_child(bs, child_bs, "wait-child",
                                         &child_of_bds,
                                         BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY,
                                         &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     /* This child is just there to be deleted
      * (for detach_instead_of_delete == true) */
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     blk = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
     blk_insert_bs(blk, bs, &error_abort);
@@ -1155,6 +1165,7 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
 
     bdrv_dec_in_flight(data->child_b->bs);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_unref_child(data->parent_b, data->child_b);
 
@@ -1163,6 +1174,7 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
                                       &child_of_bds, BDRV_CHILD_DATA,
                                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 static void coroutine_mixed_fn detach_by_parent_aio_cb(void *opaque, int ret)
@@ -1260,6 +1272,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
     /* Set child relationships */
     bdrv_ref(b);
     bdrv_ref(a);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     child_b = bdrv_attach_child(parent_b, b, "PB-B", &child_of_bds,
                                 BDRV_CHILD_DATA, &error_abort);
@@ -1271,6 +1284,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
                       by_parent_cb ? &child_of_bds : &detach_by_driver_cb_class,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_assert_cmpint(parent_a->refcnt, ==, 1);
     g_assert_cmpint(parent_b->refcnt, ==, 1);
@@ -1687,6 +1701,7 @@ static void test_drop_intermediate_poll(void)
      * Establish the chain last, so the chain links are the first
      * elements in the BDS.parents lists
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < 3; i++) {
         if (i) {
@@ -1696,6 +1711,7 @@ static void test_drop_intermediate_poll(void)
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     job = block_job_create("job", &test_simple_job_driver, NULL, job_node,
                            0, BLK_PERM_ALL, 0, 0, NULL, NULL, &error_abort);
@@ -1942,10 +1958,12 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
     new_child_bs->total_sectors = 1;
 
     bdrv_ref(old_child_bs);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(parent_bs, old_child_bs, "child", &child_of_bds,
                       BDRV_CHILD_COW, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     parent_s->setup_completed = true;
 
     for (i = 0; i < old_drain_count; i++) {
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index d743abb4bb..7b03ebe4b0 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -137,10 +137,12 @@ static void test_update_perm_tree(void)
 
     blk_insert_bs(root, bs, &error_abort);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(filter, bs, "child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     ret = bdrv_append(filter, bs, NULL);
     g_assert_cmpint(ret, <, 0);
@@ -204,11 +206,13 @@ static void test_should_update_child(void)
 
     bdrv_set_backing_hd(target, bs, &error_abort);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     g_assert(target->backing->bs == bs);
     bdrv_attach_child(filter, target, "target", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     bdrv_append(filter, bs, &error_abort);
 
     bdrv_graph_rdlock_main_loop();
@@ -244,6 +248,7 @@ static void test_parallel_exclusive_write(void)
     bdrv_ref(base);
     bdrv_ref(fl1);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(top, fl1, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -257,6 +262,7 @@ static void test_parallel_exclusive_write(void)
 
     bdrv_replace_node(fl1, fl2, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     bdrv_drained_end(fl2);
     bdrv_drained_end(fl1);
@@ -363,6 +369,7 @@ static void test_parallel_perm_update(void)
      */
     bdrv_ref(base);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(top, ws, "file", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
@@ -377,6 +384,7 @@ static void test_parallel_perm_update(void)
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     /* Select fl1 as first child to be active */
     s->selected = c_fl1;
@@ -430,11 +438,13 @@ static void test_append_greedy_filter(void)
     BlockDriverState *base = no_perm_node("base");
     BlockDriverState *fl = exclusive_writer_node("fl1");
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(top, base, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     bdrv_append(fl, base, &error_abort);
     bdrv_unref(fl);
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{, un}lock
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (8 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 09/11] block: move drain out of bdrv_change_aio_context() Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 19:54   ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock Kevin Wolf
  2025-05-08 14:09 ` [PATCH 11/11] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

For convenience, allow indicating whether the write-locked section
should also be a drained section directly when taking the write
lock.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

In bdrv_graph_wrlock() there is a comment that it uses
bdrv_drain_all_begin_nopoll() to make sure that constantly arriving
new I/O doesn't cause starvation. The changes from this series are at
odds with that, but there doesn't seem to be any (new) test failures.

The following script was used:

> find . -name '*.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\s*bdrv_graph_wrlock();/bdrv_graph_wrlock(true);/g' {} ';'
> find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\s*bdrv_drain_all_end();/bdrv_graph_wrunlock(true);/g' {} ';'
> find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrlock();/bdrv_graph_wrlock(false);/g' {} ';'
> find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();/bdrv_graph_wrunlock(false);/g' {} ';'

 block.c                            | 62 ++++++++++++------------------
 block/backup.c                     |  6 +--
 block/blklogwrites.c               | 12 ++----
 block/blkverify.c                  |  6 +--
 block/block-backend.c              | 12 ++----
 block/commit.c                     | 20 ++++------
 block/graph-lock.c                 | 22 ++++++++---
 block/mirror.c                     | 27 ++++++-------
 block/qcow2.c                      |  6 +--
 block/quorum.c                     | 12 ++----
 block/replication.c                | 21 ++++------
 block/snapshot.c                   |  6 +--
 block/stream.c                     | 16 +++-----
 block/vmdk.c                       | 30 +++++----------
 blockdev.c                         | 10 ++---
 blockjob.c                         | 18 +++------
 include/block/graph-lock.h         |  8 +++-
 scripts/block-coroutine-wrapper.py |  4 +-
 tests/unit/test-bdrv-drain.c       | 58 ++++++++++------------------
 tests/unit/test-bdrv-graph-mod.c   | 30 +++++----------
 20 files changed, 152 insertions(+), 234 deletions(-)

diff --git a/block.c b/block.c
index d84ac2f394..836e23b989 100644
--- a/block.c
+++ b/block.c
@@ -1720,14 +1720,12 @@ bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name,
 open_failed:
     bs->drv = NULL;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     if (bs->file != NULL) {
         bdrv_unref_child(bs, bs->file);
         assert(!bs->file);
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -3587,11 +3585,9 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     bdrv_graph_rdunlock_main_loop();
 
     bdrv_ref(drain_bs);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     bdrv_unref(drain_bs);
 
     return ret;
@@ -3783,12 +3779,10 @@ static BdrvChild *bdrv_open_child_common(const char *filename,
         return NULL;
     }
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     child = bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
                               errp);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     return child;
 }
@@ -4640,9 +4634,9 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
         bdrv_reopen_commit(&bs_entry->state);
     }
 
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     tran_commit(tran);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
 
     QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) {
         BlockDriverState *bs = bs_entry->state.bs;
@@ -4656,9 +4650,9 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
     goto cleanup;
 
 abort:
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     tran_abort(tran);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
 
     QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
         if (bs_entry->prepared) {
@@ -4813,14 +4807,12 @@ bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
     }
 
     bdrv_graph_rdunlock_main_loop();
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
 
     ret = bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
                                           tran, errp);
 
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     if (old_child_bs) {
         bdrv_unref(old_child_bs);
@@ -5164,16 +5156,14 @@ static void bdrv_close(BlockDriverState *bs)
         bs->drv = NULL;
     }
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
         bdrv_unref_child(bs, child);
     }
 
     assert(!bs->backing);
     assert(!bs->file);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -5466,9 +5456,9 @@ int bdrv_drop_filter(BlockDriverState *bs, Error **errp)
     bdrv_graph_rdunlock_main_loop();
 
     bdrv_drained_begin(child_bs);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     ret = bdrv_replace_node_common(bs, child_bs, true, true, errp);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
     bdrv_drained_end(child_bs);
 
     return ret;
@@ -5499,8 +5489,7 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     assert(!bs_new->backing);
     bdrv_graph_rdunlock_main_loop();
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
 
     child = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
                                      &child_of_bds, bdrv_backing_role(bs_new),
@@ -5520,8 +5509,7 @@ out:
     tran_finalize(tran, ret);
 
     bdrv_refresh_limits(bs_top, NULL, NULL);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     return ret;
 }
@@ -5540,7 +5528,7 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
     bdrv_ref(old_bs);
     bdrv_drained_begin(old_bs);
     bdrv_drained_begin(new_bs);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
 
     bdrv_replace_child_tran(child, new_bs, tran);
 
@@ -5551,7 +5539,7 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
 
     tran_finalize(tran, ret);
 
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
     bdrv_drained_end(old_bs);
     bdrv_drained_end(new_bs);
     bdrv_unref(old_bs);
@@ -5633,9 +5621,9 @@ BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
     bdrv_ref(bs);
     bdrv_drained_begin(bs);
     bdrv_drained_begin(new_node_bs);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     ret = bdrv_replace_node(bs, new_node_bs, errp);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
     bdrv_drained_end(new_node_bs);
     bdrv_drained_end(bs);
     bdrv_unref(bs);
@@ -5891,7 +5879,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
 
     bdrv_ref(top);
     bdrv_drained_begin(base);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
 
     if (!top->drv || !base->drv) {
         goto exit_wrlock;
@@ -5931,7 +5919,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
      * That's a FIXME.
      */
     bdrv_replace_node_common(top, base, false, false, &local_err);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
 
     if (local_err) {
         error_report_err(local_err);
@@ -5969,7 +5957,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
     goto exit;
 
 exit_wrlock:
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
 exit:
     bdrv_drained_end(base);
     bdrv_unref(top);
diff --git a/block/backup.c b/block/backup.c
index 9d55e55b79..02f2cd8276 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -497,12 +497,10 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     block_copy_set_speed(bcs, speed);
 
     /* Required permissions are taken by copy-before-write filter target */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     block_job_add_bdrv(&job->common, "target", target, 0, BLK_PERM_ALL,
                        &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     return &job->common;
 
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index 70ac76f401..b08c584e08 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -281,11 +281,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
     ret = 0;
 fail_log:
     if (ret < 0) {
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(true);
         bdrv_unref_child(bs, s->log_file);
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
         s->log_file = NULL;
         qemu_mutex_destroy(&s->mutex);
     }
@@ -298,12 +296,10 @@ static void blk_log_writes_close(BlockDriverState *bs)
 {
     BDRVBlkLogWritesState *s = bs->opaque;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     qemu_mutex_destroy(&s->mutex);
 }
 
diff --git a/block/blkverify.c b/block/blkverify.c
index 3a71f7498c..554817f88c 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -151,12 +151,10 @@ static void blkverify_close(BlockDriverState *bs)
 {
     BDRVBlkverifyState *s = bs->opaque;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_unref_child(bs, s->test_file);
     s->test_file = NULL;
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
diff --git a/block/block-backend.c b/block/block-backend.c
index 2b30a34662..63393ddb73 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -889,11 +889,9 @@ void blk_remove_bs(BlockBackend *blk)
     root = blk->root;
     blk->root = NULL;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_root_unref_child(root);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 }
 
 /*
@@ -906,8 +904,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
 
     GLOBAL_STATE_CODE();
     bdrv_ref(bs);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
 
     if ((bs->open_flags & BDRV_O_INACTIVE) && blk_can_inactivate(blk)) {
         blk->disable_perm = true;
@@ -921,8 +918,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
     blk->root = bdrv_root_attach_child(bs, "root", &child_root,
                                        BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                                        perm, shared_perm, blk, errp);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     if (blk->root == NULL) {
         return -EPERM;
     }
diff --git a/block/commit.c b/block/commit.c
index 6c06b894ff..9791ac5d2a 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -102,9 +102,9 @@ static void commit_abort(Job *job)
     bdrv_graph_rdunlock_main_loop();
 
     bdrv_drained_begin(commit_top_backing_bs);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     bdrv_replace_node(s->commit_top_bs, commit_top_backing_bs, &error_abort);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
     bdrv_drained_end(commit_top_backing_bs);
 
     bdrv_unref(s->commit_top_bs);
@@ -342,8 +342,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
      * this is the responsibility of the interface (i.e. whoever calls
      * commit_start()).
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     s->base_overlay = bdrv_find_overlay(top, base);
     assert(s->base_overlay);
 
@@ -374,22 +373,19 @@ void commit_start(const char *job_id, BlockDriverState *bs,
         ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
                                  iter_shared_perms, errp);
         if (ret < 0) {
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             goto fail;
         }
     }
 
     if (bdrv_freeze_backing_chain(commit_top_bs, base, errp) < 0) {
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
         goto fail;
     }
     s->chain_frozen = true;
 
     ret = block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_ALL, errp);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     if (ret < 0) {
         goto fail;
@@ -442,9 +438,9 @@ fail:
      * otherwise this would fail because of lack of permissions. */
     if (commit_top_bs) {
         bdrv_drained_begin(top);
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(false);
         bdrv_replace_node(commit_top_bs, top, &error_abort);
-        bdrv_graph_wrunlock();
+        bdrv_graph_wrunlock(false);
         bdrv_drained_end(top);
     }
 }
diff --git a/block/graph-lock.c b/block/graph-lock.c
index c81162b147..a2bc0de060 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -106,14 +106,20 @@ static uint32_t reader_count(void)
     return rd;
 }
 
-void no_coroutine_fn bdrv_graph_wrlock(void)
+void no_coroutine_fn bdrv_graph_wrlock(bool drain)
 {
     GLOBAL_STATE_CODE();
     assert(!qatomic_read(&has_writer));
     assert(!qemu_in_coroutine());
 
-    /* Make sure that constantly arriving new I/O doesn't cause starvation */
-    bdrv_drain_all_begin_nopoll();
+    if (drain) {
+        bdrv_drain_all_begin();
+    } else {
+        /*
+         * Make sure that constantly arriving new I/O doesn't cause starvation
+         */
+        bdrv_drain_all_begin_nopoll();
+    }
 
     /*
      * reader_count == 0: this means writer will read has_reader as 1
@@ -139,10 +145,12 @@ void no_coroutine_fn bdrv_graph_wrlock(void)
         smp_mb();
     } while (reader_count() >= 1);
 
-    bdrv_drain_all_end();
+    if (!drain) {
+        bdrv_drain_all_end();
+    }
 }
 
-void no_coroutine_fn bdrv_graph_wrunlock(void)
+void no_coroutine_fn bdrv_graph_wrunlock(bool drain)
 {
     GLOBAL_STATE_CODE();
     assert(qatomic_read(&has_writer));
@@ -168,6 +176,10 @@ void no_coroutine_fn bdrv_graph_wrunlock(void)
      * progress.
      */
     aio_bh_poll(qemu_get_aio_context());
+
+    if (drain) {
+        bdrv_drain_all_end();
+    }
 }
 
 void coroutine_fn bdrv_graph_co_rdlock(void)
diff --git a/block/mirror.c b/block/mirror.c
index 5c1acd650b..4acb03e61c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -762,7 +762,7 @@ static int mirror_exit_common(Job *job)
          * check for an op blocker on @to_replace, and we have our own
          * there.
          */
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(false);
         if (bdrv_recurse_can_replace(src, to_replace)) {
             bdrv_replace_node(to_replace, target_bs, &local_err);
         } else {
@@ -771,7 +771,7 @@ static int mirror_exit_common(Job *job)
                        "would not lead to an abrupt change of visible data",
                        to_replace->node_name, target_bs->node_name);
         }
-        bdrv_graph_wrunlock();
+        bdrv_graph_wrunlock(false);
         bdrv_drained_end(to_replace);
         if (local_err) {
             error_report_err(local_err);
@@ -791,9 +791,9 @@ static int mirror_exit_common(Job *job)
      * valid.
      */
     block_job_remove_all_bdrv(bjob);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, &error_abort);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
 
     if (abort && s->base_ro && !bdrv_is_read_only(target_bs)) {
         bdrv_reopen_set_read_only(target_bs, true, NULL);
@@ -1904,15 +1904,13 @@ static BlockJob *mirror_start_job(
      */
     bdrv_disable_dirty_bitmap(s->dirty_bitmap);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     ret = block_job_add_bdrv(&s->common, "source", bs, 0,
                              BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE |
                              BLK_PERM_CONSISTENT_READ,
                              errp);
     if (ret < 0) {
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
         goto fail;
     }
 
@@ -1957,20 +1955,17 @@ static BlockJob *mirror_start_job(
             ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
                                      iter_shared_perms, errp);
             if (ret < 0) {
-                bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
+                bdrv_graph_wrunlock(true);
                 goto fail;
             }
         }
 
         if (bdrv_freeze_backing_chain(mirror_top_bs, target, errp) < 0) {
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             goto fail;
         }
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     QTAILQ_INIT(&s->ops_in_flight);
 
@@ -1996,12 +1991,12 @@ fail:
 
     bs_opaque->stop = true;
     bdrv_drained_begin(bs);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     assert(mirror_top_bs->backing->bs == bs);
     bdrv_child_refresh_perms(mirror_top_bs, mirror_top_bs->backing,
                              &error_abort);
     bdrv_replace_node(mirror_top_bs, bs, &error_abort);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
     bdrv_drained_end(bs);
 
     bdrv_unref(mirror_top_bs);
diff --git a/block/qcow2.c b/block/qcow2.c
index bba7c90713..8d0bc49e85 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2821,11 +2821,9 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
     if (close_data_file && has_data_file(bs)) {
         GLOBAL_STATE_CODE();
         bdrv_graph_rdunlock_main_loop();
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(true);
         bdrv_unref_child(bs, s->data_file);
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
         s->data_file = NULL;
         bdrv_graph_rdlock_main_loop();
     }
diff --git a/block/quorum.c b/block/quorum.c
index 3a5928d574..fd542b25f2 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1037,16 +1037,14 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
 close_exit:
     /* cleanup on error */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     for (i = 0; i < s->num_children; i++) {
         if (!opened[i]) {
             continue;
         }
         bdrv_unref_child(bs, s->children[i]);
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     g_free(s->children);
     g_free(opened);
 exit:
@@ -1059,13 +1057,11 @@ static void quorum_close(BlockDriverState *bs)
     BDRVQuorumState *s = bs->opaque;
     int i;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     for (i = 0; i < s->num_children; i++) {
         bdrv_unref_child(bs, s->children[i]);
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     g_free(s->children);
 }
diff --git a/block/replication.c b/block/replication.c
index ec54a9a907..a2640a97f1 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -540,8 +540,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             return;
         }
 
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(true);
 
         bdrv_ref(hidden_disk->bs);
         s->hidden_disk = bdrv_attach_child(bs, hidden_disk->bs, "hidden disk",
@@ -549,8 +548,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
                                            &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             return;
         }
 
@@ -560,8 +558,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
                                               BDRV_CHILD_DATA, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             return;
         }
 
@@ -573,15 +570,13 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (!top_bs || !bdrv_is_root_node(top_bs) ||
             !check_top_bs(top_bs, bs)) {
             error_setg(errp, "No top_bs or it is invalid");
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             reopen_backing_file(bs, false, NULL);
             return;
         }
         bdrv_op_block_all(top_bs, s->blocker);
 
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
 
         s->backup_job = backup_job_create(
                                 NULL, s->secondary_disk->bs, s->hidden_disk->bs,
@@ -654,14 +649,12 @@ static void replication_done(void *opaque, int ret)
     if (ret == 0) {
         s->stage = BLOCK_REPLICATION_DONE;
 
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(true);
         bdrv_unref_child(bs, s->secondary_disk);
         s->secondary_disk = NULL;
         bdrv_unref_child(bs, s->hidden_disk);
         s->hidden_disk = NULL;
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
 
         s->error = 0;
     } else {
diff --git a/block/snapshot.c b/block/snapshot.c
index 88c3190eff..7f79fe8056 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -291,11 +291,9 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
         }
 
         /* .bdrv_open() will re-attach it */
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(true);
         bdrv_unref_child(bs, fallback);
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
 
         ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
         memset(bs->opaque, 0, drv->instance_size);
diff --git a/block/stream.c b/block/stream.c
index 6e68b0c9f8..665e75099e 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -104,9 +104,9 @@ static int stream_prepare(Job *job)
             }
         }
 
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock(false);
         bdrv_set_backing_hd_drained(unfiltered_bs, base, &local_err);
-        bdrv_graph_wrunlock();
+        bdrv_graph_wrunlock(false);
 
         /*
          * This call will do I/O, so the graph can change again from here on.
@@ -371,12 +371,10 @@ void stream_start(const char *job_id, BlockDriverState *bs,
      * already have our own plans. Also don't allow resize as the image size is
      * queried only at the job start and then cached.
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     if (block_job_add_bdrv(&s->common, "active node", bs, 0,
                            basic_flags | BLK_PERM_WRITE, errp)) {
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
         goto fail;
     }
 
@@ -396,13 +394,11 @@ void stream_start(const char *job_id, BlockDriverState *bs,
         ret = block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
                                  basic_flags, errp);
         if (ret < 0) {
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             goto fail;
         }
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     s->base_overlay = base_overlay;
     s->above_base = above_base;
diff --git a/block/vmdk.c b/block/vmdk.c
index e76464685e..c626cf6157 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -271,8 +271,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
     BDRVVmdkState *s = bs->opaque;
     VmdkExtent *e;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     for (i = 0; i < s->num_extents; i++) {
         e = &s->extents[i];
         g_free(e->l1_table);
@@ -283,8 +282,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
             bdrv_unref_child(bs, e->file);
         }
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     g_free(s->extents);
 }
@@ -1249,11 +1247,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
                             0, 0, 0, 0, 0, &extent, errp);
             if (ret < 0) {
                 bdrv_graph_rdunlock_main_loop();
-                bdrv_drain_all_begin();
-                bdrv_graph_wrlock();
+                bdrv_graph_wrlock(true);
                 bdrv_unref_child(bs, extent_file);
-                bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
+                bdrv_graph_wrunlock(true);
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1270,11 +1266,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             g_free(buf);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
-                bdrv_drain_all_begin();
-                bdrv_graph_wrlock();
+                bdrv_graph_wrlock(true);
                 bdrv_unref_child(bs, extent_file);
-                bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
+                bdrv_graph_wrunlock(true);
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1283,11 +1277,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
-                bdrv_drain_all_begin();
-                bdrv_graph_wrlock();
+                bdrv_graph_wrlock(true);
                 bdrv_unref_child(bs, extent_file);
-                bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
+                bdrv_graph_wrunlock(true);
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1295,11 +1287,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
         } else {
             error_setg(errp, "Unsupported extent type '%s'", type);
             bdrv_graph_rdunlock_main_loop();
-            bdrv_drain_all_begin();
-            bdrv_graph_wrlock();
+            bdrv_graph_wrlock(true);
             bdrv_unref_child(bs, extent_file);
-            bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
+            bdrv_graph_wrunlock(true);
             bdrv_graph_rdlock_main_loop();
             ret = -ENOTSUP;
             goto out;
diff --git a/blockdev.c b/blockdev.c
index b6a4bc2bfc..267746172a 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1576,9 +1576,9 @@ static void external_snapshot_abort(void *opaque)
             }
 
             bdrv_drained_begin(state->new_bs);
-            bdrv_graph_wrlock();
+            bdrv_graph_wrlock(false);
             bdrv_replace_node(state->new_bs, state->old_bs, &error_abort);
-            bdrv_graph_wrunlock();
+            bdrv_graph_wrunlock(false);
             bdrv_drained_end(state->new_bs);
 
             bdrv_unref(state->old_bs); /* bdrv_replace_node() ref'ed old_bs */
@@ -3539,8 +3539,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
     BlockDriverState *parent_bs, *new_bs = NULL;
     BdrvChild *p_child;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
 
     parent_bs = bdrv_lookup_bs(parent, parent, errp);
     if (!parent_bs) {
@@ -3576,8 +3575,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
     }
 
 out:
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 }
 
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
diff --git a/blockjob.c b/blockjob.c
index 390301d3cf..5a1d74a672 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,8 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
      * one to make sure that such a concurrent access does not attempt
      * to process an already freed BdrvChild.
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     while (job->nodes) {
         GSList *l = job->nodes;
         BdrvChild *c = l->data;
@@ -211,8 +210,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
 
         g_slist_free_1(l);
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 }
 
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
@@ -498,8 +496,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     int ret;
     GLOBAL_STATE_CODE();
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
 
     if (job_id == NULL && !(flags & JOB_INTERNAL)) {
         job_id = bdrv_get_device_name(bs);
@@ -508,8 +505,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     job = job_create(job_id, &driver->job_driver, txn, bdrv_get_aio_context(bs),
                      flags, cb, opaque, errp);
     if (job == NULL) {
-        bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
+        bdrv_graph_wrunlock(true);
         return NULL;
     }
 
@@ -547,13 +543,11 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
         goto fail;
     }
 
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     return job;
 
 fail:
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     job_early_fail(&job->job);
     return NULL;
 }
diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index 2c26c72108..f291ccbc97 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -108,17 +108,21 @@ void unregister_aiocontext(AioContext *ctx);
  *
  * The wrlock can only be taken from the main loop, with BQL held, as only the
  * main loop is allowed to modify the graph.
+ *
+ * @drain whether bdrv_drain_all_begin() should be called before taking the lock
  */
 void no_coroutine_fn TSA_ACQUIRE(graph_lock) TSA_NO_TSA
-bdrv_graph_wrlock(void);
+bdrv_graph_wrlock(bool drain);
 
 /*
  * bdrv_graph_wrunlock:
  * Write finished, reset global has_writer to 0 and restart
  * all readers that are waiting.
+ *
+ * @drain whether bdrv_drain_all_end() should be called after releasing the lock
  */
 void no_coroutine_fn TSA_RELEASE(graph_lock) TSA_NO_TSA
-bdrv_graph_wrunlock(void);
+bdrv_graph_wrunlock(bool drain);
 
 /*
  * bdrv_graph_co_rdlock:
diff --git a/scripts/block-coroutine-wrapper.py b/scripts/block-coroutine-wrapper.py
index dbbde99e39..3286c4fe02 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -259,8 +259,8 @@ def gen_no_co_wrapper(func: FuncDecl) -> str:
         graph_lock='    bdrv_graph_rdlock_main_loop();'
         graph_unlock='    bdrv_graph_rdunlock_main_loop();'
     elif func.graph_wrlock:
-        graph_lock='    bdrv_graph_wrlock();'
-        graph_unlock='    bdrv_graph_wrunlock();'
+        graph_lock='    bdrv_graph_wrlock(false);'
+        graph_unlock='    bdrv_graph_wrunlock(false);'
 
     return f"""\
 /*
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index d931c5183e..589ef66a11 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -772,11 +772,9 @@ static void test_blockjob_common_drain_node(enum drain_type drain_type,
     tjob->bs = src;
     job = &tjob->common;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     block_job_add_bdrv(job, "target", target, 0, BLK_PERM_ALL, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     switch (result) {
     case TEST_JOB_SUCCESS:
@@ -955,13 +953,11 @@ static void bdrv_test_top_close(BlockDriverState *bs)
 {
     BdrvChild *c, *next_c;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     QLIST_FOREACH_SAFE(c, &bs->children, next, next_c) {
         bdrv_unref_child(bs, c);
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 }
 
 static int coroutine_fn GRAPH_RDLOCK
@@ -1051,12 +1047,10 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
 
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     /* This child will be the one to pass to requests through to, and
      * it will stall until a drain occurs */
@@ -1064,25 +1058,21 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
                                     &error_abort);
     child_bs->total_sectors = 65536 >> BDRV_SECTOR_BITS;
     /* Takes our reference to child_bs */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     tts->wait_child = bdrv_attach_child(bs, child_bs, "wait-child",
                                         &child_of_bds,
                                         BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY,
                                         &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     /* This child is just there to be deleted
      * (for detach_instead_of_delete == true) */
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     blk = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
     blk_insert_bs(blk, bs, &error_abort);
@@ -1165,16 +1155,14 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
 
     bdrv_dec_in_flight(data->child_b->bs);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_unref_child(data->parent_b, data->child_b);
 
     bdrv_ref(data->c);
     data->child_c = bdrv_attach_child(data->parent_b, data->c, "PB-C",
                                       &child_of_bds, BDRV_CHILD_DATA,
                                       &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 }
 
 static void coroutine_mixed_fn detach_by_parent_aio_cb(void *opaque, int ret)
@@ -1272,8 +1260,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
     /* Set child relationships */
     bdrv_ref(b);
     bdrv_ref(a);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     child_b = bdrv_attach_child(parent_b, b, "PB-B", &child_of_bds,
                                 BDRV_CHILD_DATA, &error_abort);
     child_a = bdrv_attach_child(parent_b, a, "PB-A", &child_of_bds,
@@ -1283,8 +1270,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
     bdrv_attach_child(parent_a, a, "PA-A",
                       by_parent_cb ? &child_of_bds : &detach_by_driver_cb_class,
                       BDRV_CHILD_DATA, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     g_assert_cmpint(parent_a->refcnt, ==, 1);
     g_assert_cmpint(parent_b->refcnt, ==, 1);
@@ -1701,8 +1687,7 @@ static void test_drop_intermediate_poll(void)
      * Establish the chain last, so the chain links are the first
      * elements in the BDS.parents lists
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     for (i = 0; i < 3; i++) {
         if (i) {
             /* Takes the reference to chain[i - 1] */
@@ -1710,8 +1695,7 @@ static void test_drop_intermediate_poll(void)
                               &chain_child_class, BDRV_CHILD_COW, &error_abort);
         }
     }
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     job = block_job_create("job", &test_simple_job_driver, NULL, job_node,
                            0, BLK_PERM_ALL, 0, 0, NULL, NULL, &error_abort);
@@ -1958,12 +1942,10 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
     new_child_bs->total_sectors = 1;
 
     bdrv_ref(old_child_bs);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(parent_bs, old_child_bs, "child", &child_of_bds,
                       BDRV_CHILD_COW, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     parent_s->setup_completed = true;
 
     for (i = 0; i < old_drain_count; i++) {
@@ -1994,9 +1976,9 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
     g_assert(parent_bs->quiesce_counter == old_drain_count);
     bdrv_drained_begin(old_child_bs);
     bdrv_drained_begin(new_child_bs);
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(false);
     bdrv_replace_node(old_child_bs, new_child_bs, &error_abort);
-    bdrv_graph_wrunlock();
+    bdrv_graph_wrunlock(false);
     bdrv_drained_end(new_child_bs);
     bdrv_drained_end(old_child_bs);
     g_assert(parent_bs->quiesce_counter == new_drain_count);
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index 7b03ebe4b0..f4961d3f09 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -137,12 +137,10 @@ static void test_update_perm_tree(void)
 
     blk_insert_bs(root, bs, &error_abort);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(filter, bs, "child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     ret = bdrv_append(filter, bs, NULL);
     g_assert_cmpint(ret, <, 0);
@@ -206,13 +204,11 @@ static void test_should_update_child(void)
 
     bdrv_set_backing_hd(target, bs, &error_abort);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     g_assert(target->backing->bs == bs);
     bdrv_attach_child(filter, target, "target", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
     bdrv_append(filter, bs, &error_abort);
 
     bdrv_graph_rdlock_main_loop();
@@ -248,8 +244,7 @@ static void test_parallel_exclusive_write(void)
     bdrv_ref(base);
     bdrv_ref(fl1);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(top, fl1, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
@@ -261,8 +256,7 @@ static void test_parallel_exclusive_write(void)
                       &error_abort);
 
     bdrv_replace_node(fl1, fl2, &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     bdrv_drained_end(fl2);
     bdrv_drained_end(fl1);
@@ -369,8 +363,7 @@ static void test_parallel_perm_update(void)
      */
     bdrv_ref(base);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(top, ws, "file", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
     c_fl1 = bdrv_attach_child(ws, fl1, "first", &child_of_bds,
@@ -383,8 +376,7 @@ static void test_parallel_perm_update(void)
     bdrv_attach_child(fl2, base, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     /* Select fl1 as first child to be active */
     s->selected = c_fl1;
@@ -438,13 +430,11 @@ static void test_append_greedy_filter(void)
     BlockDriverState *base = no_perm_node("base");
     BlockDriverState *fl = exclusive_writer_node("fl1");
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock(true);
     bdrv_attach_child(top, base, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
-    bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
+    bdrv_graph_wrunlock(true);
 
     bdrv_append(fl, base, &error_abort);
     bdrv_unref(fl);
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/11] iotests/graph-changes-while-io: add test case with removal of lower snapshot
  2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
                   ` (9 preceding siblings ...)
  2025-05-08 14:09 ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{, un}lock Fiona Ebner
@ 2025-05-08 14:09 ` Fiona Ebner
  2025-05-14 20:14   ` Kevin Wolf
  10 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-08 14:09 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

This case is catching potential deadlock which takes place when job-dismiss
is issued when I/O requests are processed in a separate iothread.

See https://mail.gnu.org/archive/html/qemu-devel/2025-04/msg04421.html

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++++++++--
 .../tests/graph-changes-while-io.out          |   4 +-
 2 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/tests/qemu-iotests/tests/graph-changes-while-io b/tests/qemu-iotests/tests/graph-changes-while-io
index 194fda500e..e30f823da4 100755
--- a/tests/qemu-iotests/tests/graph-changes-while-io
+++ b/tests/qemu-iotests/tests/graph-changes-while-io
@@ -27,6 +27,8 @@ from iotests import imgfmt, qemu_img, qemu_img_create, qemu_io, \
 
 
 top = os.path.join(iotests.test_dir, 'top.img')
+snap1 = os.path.join(iotests.test_dir, 'snap1.img')
+snap2 = os.path.join(iotests.test_dir, 'snap2.img')
 nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')
 
 
@@ -58,6 +60,15 @@ class TestGraphChangesWhileIO(QMPTestCase):
     def tearDown(self) -> None:
         self.qsd.stop()
 
+    def _wait_for_blockjob(self, status) -> None:
+        done = False
+        while not done:
+            for event in self.qsd.get_qmp().get_events(wait=10.0):
+                if event['event'] != 'JOB_STATUS_CHANGE':
+                    continue
+                if event['data']['status'] == status:
+                    done = True
+
     def test_blockdev_add_while_io(self) -> None:
         # Run qemu-img bench in the background
         bench_thr = Thread(target=do_qemu_img_bench)
@@ -116,13 +127,89 @@ class TestGraphChangesWhileIO(QMPTestCase):
                 'device': 'job0',
             })
 
-            cancelled = False
-            while not cancelled:
-                for event in self.qsd.get_qmp().get_events(wait=10.0):
-                    if event['event'] != 'JOB_STATUS_CHANGE':
-                        continue
-                    if event['data']['status'] == 'null':
-                        cancelled = True
+            self._wait_for_blockjob('null')
+
+        bench_thr.join()
+
+    def test_remove_lower_snapshot_while_io(self) -> None:
+        # Run qemu-img bench in the background
+        bench_thr = Thread(target=do_qemu_img_bench, args=(100000, ))
+        bench_thr.start()
+
+        # While I/O is performed on 'node0' node, consequently add 2 snapshots
+        # on top of it, then remove (commit) them starting from lower one.
+        while bench_thr.is_alive():
+            # Recreate snapshot images on every iteration
+            qemu_img_create('-f', imgfmt, snap1, '1G')
+            qemu_img_create('-f', imgfmt, snap2, '1G')
+
+            self.qsd.cmd('blockdev-add', {
+                'driver': imgfmt,
+                'node-name': 'snap1',
+                'file': {
+                    'driver': 'file',
+                    'filename': snap1
+                }
+            })
+
+            self.qsd.cmd('blockdev-snapshot', {
+                'node': 'node0',
+                'overlay': 'snap1',
+            })
+
+            self.qsd.cmd('blockdev-add', {
+                'driver': imgfmt,
+                'node-name': 'snap2',
+                'file': {
+                    'driver': 'file',
+                    'filename': snap2
+                }
+            })
+
+            self.qsd.cmd('blockdev-snapshot', {
+                'node': 'snap1',
+                'overlay': 'snap2',
+            })
+
+            self.qsd.cmd('block-commit', {
+                'job-id': 'commit-snap1',
+                'device': 'snap2',
+                'top-node': 'snap1',
+                'base-node': 'node0',
+                'auto-finalize': True,
+                'auto-dismiss': False,
+            })
+
+            self._wait_for_blockjob('concluded')
+            self.qsd.cmd('job-dismiss', {
+                'id': 'commit-snap1',
+            })
+
+            self.qsd.cmd('block-commit', {
+                'job-id': 'commit-snap2',
+                'device': 'snap2',
+                'top-node': 'snap2',
+                'base-node': 'node0',
+                'auto-finalize': True,
+                'auto-dismiss': False,
+            })
+
+            self._wait_for_blockjob('ready')
+            self.qsd.cmd('job-complete', {
+                'id': 'commit-snap2',
+            })
+
+            self._wait_for_blockjob('concluded')
+            self.qsd.cmd('job-dismiss', {
+                'id': 'commit-snap2',
+            })
+
+            self.qsd.cmd('blockdev-del', {
+                'node-name': 'snap1'
+            })
+            self.qsd.cmd('blockdev-del', {
+                'node-name': 'snap2'
+            })
 
         bench_thr.join()
 
diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out b/tests/qemu-iotests/tests/graph-changes-while-io.out
index fbc63e62f8..8d7e996700 100644
--- a/tests/qemu-iotests/tests/graph-changes-while-io.out
+++ b/tests/qemu-iotests/tests/graph-changes-while-io.out
@@ -1,5 +1,5 @@
-..
+...
 ----------------------------------------------------------------------
-Ran 2 tests
+Ran 3 tests
 
 OK
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/11] block: remove outdated comments about AioContext locking
  2025-05-08 14:09 ` [PATCH 01/11] block: remove outdated comments about AioContext locking Fiona Ebner
@ 2025-05-14 16:22   ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 16:22 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> AioContext locking was removed in commit b49f4755c7 ("block: remove
> AioContext locking").
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-08 14:09 ` [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
@ 2025-05-14 16:36   ` Kevin Wolf
  2025-05-15 11:48     ` Fiona Ebner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 16:36 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_reopen_queue_child() can recursively call itself.

Yeah, we would have to collect the affected nodes first and only then do
whatever the draining is needed for. And consider that draining could
invalidate the list of nodes so that we would have to start over...

Draining only selectively was probably premature optimisation anyway.

> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>  block.c | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 33f78dbe1c..0085dbfa74 100644
> --- a/block.c
> +++ b/block.c
> @@ -4358,7 +4358,7 @@ bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child)
>   * returns a pointer to bs_queue, which is either the newly allocated
>   * bs_queue, or the existing bs_queue being used.
>   *
> - * bs is drained here and undrained by bdrv_reopen_queue_free().
> + * bs needs to be drained
>   */
>  static BlockReopenQueue * GRAPH_RDLOCK
>  bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
> @@ -4368,6 +4368,7 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
>                          bool keep_old_opts)
>  {
>      assert(bs != NULL);
> +    assert(qatomic_read(&bs->quiesce_counter) > 0);

BlockDriverState.quiesce_counter isn't accessed with atomics elsewhere.
Did you confuse it with BlockBackend.quiesce_counter?

>  
>      BlockReopenQueueEntry *bs_entry;
>      BdrvChild *child;
> @@ -4377,13 +4378,6 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
>  
>      GLOBAL_STATE_CODE();
>  
> -    /*
> -     * Strictly speaking, draining is illegal under GRAPH_RDLOCK. We know that
> -     * we've been called with bdrv_graph_rdlock_main_loop(), though, so it's ok
> -     * in practice.
> -     */
> -    bdrv_drained_begin(bs);
> -
>      if (bs_queue == NULL) {
>          bs_queue = g_new0(BlockReopenQueue, 1);
>          QTAILQ_INIT(bs_queue);
> @@ -4522,6 +4516,14 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
>                                      QDict *options, bool keep_old_opts)
>  {
>      GLOBAL_STATE_CODE();
> +
> +    if (bs_queue == NULL) {
> +        /*
> +         * paired with bdrv_drain_all_end() in bdrv_reopen_queue_free().
> +         */
> +        bdrv_drain_all_begin();

Having this comment is a good idea. It fits on a single line, though.

> +    }
> +
>      GRAPH_RDLOCK_GUARD_MAINLOOP();
>  
>      return bdrv_reopen_queue_child(bs_queue, bs, options, NULL, 0, false,
> @@ -4534,12 +4536,16 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
>      if (bs_queue) {
>          BlockReopenQueueEntry *bs_entry, *next;
>          QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
> -            bdrv_drained_end(bs_entry->state.bs);
>              qobject_unref(bs_entry->state.explicit_options);
>              qobject_unref(bs_entry->state.options);
>              g_free(bs_entry);
>          }
>          g_free(bs_queue);
> +
> +        /*
> +         * paired with bdrv_drain_all_begin() in bdrv_reopen_queue().
> +         */
> +        bdrv_drain_all_end();

This could be a single line comment, too.

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-08 14:09 ` [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
@ 2025-05-14 16:52   ` Kevin Wolf
  2025-05-15 12:03     ` Fiona Ebner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 16:52 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_snapshot_delete() can recursively call itself.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>  block/snapshot.c | 18 ++++++++++++------
>  blockdev.c       | 25 +++++++++++++++++--------
>  qemu-img.c       |  2 ++
>  3 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/block/snapshot.c b/block/snapshot.c
> index 22567f1fb9..7788e1130b 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -327,7 +327,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>  
>  /**
>   * Delete an internal snapshot by @snapshot_id and @name.
> - * @bs: block device used in the operation
> + * @bs: block device used in the operation, needs to be drained

Forgot to add this piece of nitpicking on the previous patch: Other
places say "must be drained", which I slightly prefer because of how
RFC 2119 has "MUST", but not "NEEDS TO". Matter of taste, I guess, but
if you agree, we could change it for the non-RFC series.

>   * @snapshot_id: unique snapshot ID, or NULL
>   * @name: snapshot name, or NULL
>   * @errp: location to store error
> @@ -356,6 +356,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>      BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
>      int ret;
>  
> +    assert(qatomic_read(&bs->quiesce_counter) > 0);
> +
>      GLOBAL_STATE_CODE();
>  
>      if (!drv) {
> @@ -368,9 +370,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>          return -EINVAL;
>      }
>  
> -    /* drain all pending i/o before deleting snapshot */
> -    bdrv_drained_begin(bs);
> -
>      if (drv->bdrv_snapshot_delete) {
>          ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
>      } else if (fallback_bs) {
> @@ -382,7 +381,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>          ret = -ENOTSUP;
>      }
>  
> -    bdrv_drained_end(bs);
>      return ret;
>  }
>  
> @@ -573,9 +571,13 @@ int bdrv_all_delete_snapshot(const char *name,
>      GList *iterbdrvs;
>  
>      GLOBAL_STATE_CODE();
> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
> +
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
>  
>      if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
> +        bdrv_graph_rdunlock_main_loop();
> +        bdrv_drain_all_end();
>          return -1;
>      }

I think this wants to be changed into:

    ret = bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp);
    if (ret < 0) {
        goto out;
    }

(Changing the return value from -1 to -errno is fine for the callers.)

> @@ -594,12 +596,16 @@ int bdrv_all_delete_snapshot(const char *name,
>          if (ret < 0) {
>              error_prepend(errp, "Could not delete snapshot '%s' on '%s': ",
>                            name, bdrv_get_device_or_node_name(bs));
> +            bdrv_graph_rdunlock_main_loop();
> +            bdrv_drain_all_end();
>              return -1;
>          }

Same here.

>  
>          iterbdrvs = iterbdrvs->next;
>      }
>  
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>      return 0;
>  }

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/11] block: drain while unlocked in bdrv_reopen_parse_file_or_backing()
  2025-05-08 14:09 ` [PATCH 04/11] block: drain while unlocked in bdrv_reopen_parse_file_or_backing() Fiona Ebner
@ 2025-05-14 16:59   ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 16:59 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>  block.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 0085dbfa74..c7c26533c9 100644
> --- a/block.c
> +++ b/block.c
> @@ -4805,10 +4805,11 @@ bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
>  
>      if (old_child_bs) {
>          bdrv_ref(old_child_bs);
> +        bdrv_graph_rdunlock_main_loop();
>          bdrv_drained_begin(old_child_bs);
> +    } else {
> +        bdrv_graph_rdunlock_main_loop();
>      }

This pattern is a bit ugly, so good that you get rid of it again later
in the series.

> -
> -    bdrv_graph_rdunlock_main_loop();
>      bdrv_graph_wrlock();
>  
>      ret = bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/11] block: move drain outside of read-locked bdrv_inactivate_recurse()
  2025-05-08 14:09 ` [PATCH 05/11] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
@ 2025-05-14 17:06   ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 17:06 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_inactivate_recurse() can recursively call itself.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>  block.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index c7c26533c9..10052027cd 100644
> --- a/block.c
> +++ b/block.c
> @@ -6991,6 +6991,8 @@ bdrv_inactivate_recurse(BlockDriverState *bs, bool top_level)
>      int ret;
>      uint64_t cumulative_perms, cumulative_shared_perms;
>  
> +    assert(qatomic_read(&bs->quiesce_counter) > 0);
> +
>      GLOBAL_STATE_CODE();
>  
>      if (!bs->drv) {
> @@ -7036,9 +7038,7 @@ bdrv_inactivate_recurse(BlockDriverState *bs, bool top_level)
>          return -EPERM;
>      }
>  
> -    bdrv_drained_begin(bs);
>      bs->open_flags |= BDRV_O_INACTIVE;
> -    bdrv_drained_end(bs);
>  
>      /*
>       * Update permissions, they may differ for inactive nodes.
> @@ -7063,14 +7063,20 @@ int bdrv_inactivate(BlockDriverState *bs, Error **errp)
>      int ret;
>  
>      GLOBAL_STATE_CODE();
> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
> +
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
>  
>      if (bdrv_has_bds_parent(bs, true)) {
>          error_setg(errp, "Node has active parent node");
> +        bdrv_graph_rdunlock_main_loop();
> +        bdrv_drain_all_end();
>          return -EPERM;
>      }
>  
>      ret = bdrv_inactivate_recurse(bs, true);
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "Failed to inactivate node");
>          return ret;

This might be another one that would become a bit clearer if you put the
unlock/drain_end at the end and use a goto out for the error cases.

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action()
  2025-05-08 14:09 ` [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
@ 2025-05-14 17:26   ` Kevin Wolf
  2025-05-19 12:37     ` Fiona Ebner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 17:26 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Could the bs associated to the device change because of polling
> when draining? If yes, does that mean we need to drain all in the
> beginning and not temporarily unlock?

I'm starting to hate this pattern. :-)

Maybe it would eventually be a good idea to have some helpers to deal
with this common problem of "I want to drain all nodes that will be
affected by the operation, but draining could change which nodes are
part of it".

For this specific case, though, I think drain all is right simply
because we may already rely on it: I suppose qmp_transaction() calls
bdrv_drain_all() for a reason, but it should actually be a drain
section. I'm pretty sure we're not just waiting for some requests to
complete, but want to keep other users away. Maybe the graph lock
actually covers whatever this drain was initially supposed to solve, but
bdrv_drain_all() doesn't look right.

So how about making it bdrv_drain_all_begin/end() in qmp_transaction()
just because that's the safe thing to do, and then we can just drop any
individual drains in action implementations?

Kevin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/11] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED
  2025-05-08 14:09 ` [PATCH 08/11] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
@ 2025-05-14 17:50   ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 17:50 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> bdrv_drained_begin() polls and is not allowed to be called with the
> block graph lock held. Mark the function as such.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> This does not catch the issue reported by Andrey, because there
> is a bdrv_graph_rdunlock_main_loop() before bdrv_drained_begin() in
> the function bdrv_change_aio_context(). That unlock is of course
> ineffective if the exclusive lock is held, but it seems to prevent TSA
> from finiding the issue.

Yes, GRAPH_UNLOCKED is weak. It doesn't actually prove that the graph is
unlocked, but just flags direct callers that are known to hold the lock.
So because bdrv_try_change_aio_context() and bdrv_change_aio_context()
don't have TSA annotations, it misses the problem. If they had an
annotation, it would find the problem. (If they were GRAPH_UNLOCKED,
because they can't be called, or if they were GRAPH_RDLOCK/WRLOCK
because of double locking.)

TSA does have negative requirements that would prove it, but that would
essentially mean annotating everything in QEMU, even outside the block
layer, down to main() as GRAPH_UNLOCKED. I don't think that's practical.

Hmm, or maybe it wouldn't be that bad if we could have something like
assume_graph_unlocked()...? Not for this series anyway.

> The next patch is concerned with that.
> 
>  include/block/block-io.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index b49e0537dd..34b9f1cbfc 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -429,7 +429,8 @@ bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
>   *
>   * This function can be recursive.
>   */
> -void bdrv_drained_begin(BlockDriverState *bs);
> +void GRAPH_UNLOCKED
> +bdrv_drained_begin(BlockDriverState *bs);

No need for the line break, it fits easily on a single line.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/11] block: move drain out of bdrv_change_aio_context()
  2025-05-08 14:09 ` [PATCH 09/11] block: move drain out of bdrv_change_aio_context() Fiona Ebner
@ 2025-05-14 19:45   ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 19:45 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> While bdrv_drained_begin() is now marked as GRAPH_UNLOCKED, TSA is not
> able to catch a remaining instance where the function is called with
> the graph lock held in bdrv_change_aio_context(). This is because the
> call is preceded by a bdrv_graph_rdunlock_main_loop(), but for callers
> that hold the exclusive lock that does not actually release the lock.

And releasing it wouldn't be correct anyway.

> In combination with block-stream, there is a deadlock that can happen
> because of this [0]. In particular, it can happen that
> main thread              IO thread
> 1. acquires write lock
>                          in blk_co_do_preadv_part():
>                          2. have non-zero blk->in_flight
>                          3. try to acquire read lock
> 4. begin drain
> 
> Steps 3 and 4 might be switched. Draining will poll and get stuck,
> because it will see the non-zero in_flight counter. But the IO thread
> will not make any progress either, because it cannot acquire the read
> lock.
> 
> More granular draining is not trivially possible, because
> bdrv_change_aio_context() can recursively call itself e.g. via
> bdrv_child_change_aio_context().
> 
> [0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/
> 
> Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>  block.c                            | 91 ++++++++++++++++++++----------
>  block/backup.c                     |  2 +
>  block/blklogwrites.c               |  4 ++
>  block/blkverify.c                  |  2 +
>  block/block-backend.c              |  4 ++
>  block/commit.c                     |  4 ++
>  block/mirror.c                     |  5 ++
>  block/qcow2.c                      |  2 +
>  block/quorum.c                     |  4 ++
>  block/replication.c                |  7 +++
>  block/snapshot.c                   |  2 +
>  block/stream.c                     | 10 ++--
>  block/vmdk.c                       | 10 ++++
>  blockdev.c                         | 17 ++++--
>  blockjob.c                         |  6 ++
>  include/block/block-global-state.h |  8 ++-
>  tests/unit/test-bdrv-drain.c       | 18 ++++++
>  tests/unit/test-bdrv-graph-mod.c   | 10 ++++
>  18 files changed, 164 insertions(+), 42 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 10052027cd..d84ac2f394 100644
> --- a/block.c
> +++ b/block.c
> @@ -1720,12 +1720,14 @@ bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name,
>  open_failed:
>      bs->drv = NULL;
>  
> +    bdrv_drain_all_begin();
>      bdrv_graph_wrlock();
>      if (bs->file != NULL) {
>          bdrv_unref_child(bs, bs->file);
>          assert(!bs->file);
>      }
>      bdrv_graph_wrunlock();
> +    bdrv_drain_all_end();

I feel this patch is doing too much at once, pushing drain up three
levels to all callers of bdrv_unref_child().

Please split it, doing only one level per patch, i.e. the first patch
only marks bdrv_change_aio_context() GRAPH_RDLOCK, removes the double
locking and moves the drain (which becomes drain_all) to
bdrv_try_change_aio_context(). Then the next patch marks that one
GRAPH_RDLOCK, etc.

This will make it more obvious that each step is done correctly. For
example, checking this hunk I noticed that bdrv_unref_child() doesn't
document that child->bs must be drained, but doing everything at once,
it's not obvious if and which other functions have the same problem.

> @@ -7702,14 +7699,12 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>  }
>  
>  /*
> - * Change bs's and recursively all of its parents' and children's AioContext
> - * to the given new context, returning an error if that isn't possible.
> - *
> - * If ignore_child is not NULL, that child (and its subgraph) will not
> - * be touched.
> + * Use bdrv_try_change_aio_context() or bdrv_try_change_aio_context_locked().
>   */
> -int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> -                                BdrvChild *ignore_child, Error **errp)
> +static int bdrv_try_change_aio_context_common(BlockDriverState *bs,
> +                                              AioContext *ctx,
> +                                              BdrvChild *ignore_child,
> +                                              Error **errp)

This is exactly the same as bdrv_try_change_aio_context_locked(), so
this should be called bdrv_try_change_aio_context_locked() and the
wrapper below should go away.

Don't forget to add TSA annotations to functions whose interface you
touch (but I already mentioned this above).

>  {
>      Transaction *tran;
>      GHashTable *visited;

I'll defer review of the rest of the patch to the next version when it's
split. As it is, it's too hard to understand where each specific change
comes from. (That is, I could probably work it out for each individual
hunk, but I'd lose the big picture of what I'm really looking at and if
the changes are complete.)

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock
  2025-05-08 14:09 ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{, un}lock Fiona Ebner
@ 2025-05-14 19:54   ` Kevin Wolf
  2025-05-19 12:10     ` Fiona Ebner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 19:54 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> For convenience, allow indicating whether the write-locked section
> should also be a drained section directly when taking the write
> lock.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Good idea, the pattern seems to be common enough.

> In bdrv_graph_wrlock() there is a comment that it uses
> bdrv_drain_all_begin_nopoll() to make sure that constantly arriving
> new I/O doesn't cause starvation. The changes from this series are at
> odds with that, but there doesn't seem to be any (new) test failures.

I don't see why they are at odds with it? Draining an already drained
node isn't a problem, it just increases the counter without doing
anything else.

> The following script was used:
> 
> > find . -name '*.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\s*bdrv_graph_wrlock();/bdrv_graph_wrlock(true);/g' {} ';'
> > find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\s*bdrv_drain_all_end();/bdrv_graph_wrunlock(true);/g' {} ';'
> > find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrlock();/bdrv_graph_wrlock(false);/g' {} ';'
> > find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();/bdrv_graph_wrunlock(false);/g' {} ';'
> 
>  block.c                            | 62 ++++++++++++------------------
>  block/backup.c                     |  6 +--
>  block/blklogwrites.c               | 12 ++----
>  block/blkverify.c                  |  6 +--
>  block/block-backend.c              | 12 ++----
>  block/commit.c                     | 20 ++++------
>  block/graph-lock.c                 | 22 ++++++++---
>  block/mirror.c                     | 27 ++++++-------
>  block/qcow2.c                      |  6 +--
>  block/quorum.c                     | 12 ++----
>  block/replication.c                | 21 ++++------
>  block/snapshot.c                   |  6 +--
>  block/stream.c                     | 16 +++-----
>  block/vmdk.c                       | 30 +++++----------
>  blockdev.c                         | 10 ++---
>  blockjob.c                         | 18 +++------
>  include/block/graph-lock.h         |  8 +++-
>  scripts/block-coroutine-wrapper.py |  4 +-
>  tests/unit/test-bdrv-drain.c       | 58 ++++++++++------------------
>  tests/unit/test-bdrv-graph-mod.c   | 30 +++++----------
>  20 files changed, 152 insertions(+), 234 deletions(-)

> diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
> index 2c26c72108..f291ccbc97 100644
> --- a/include/block/graph-lock.h
> +++ b/include/block/graph-lock.h
> @@ -108,17 +108,21 @@ void unregister_aiocontext(AioContext *ctx);
>   *
>   * The wrlock can only be taken from the main loop, with BQL held, as only the
>   * main loop is allowed to modify the graph.
> + *
> + * @drain whether bdrv_drain_all_begin() should be called before taking the lock
>   */
>  void no_coroutine_fn TSA_ACQUIRE(graph_lock) TSA_NO_TSA
> -bdrv_graph_wrlock(void);
> +bdrv_graph_wrlock(bool drain);

I would prefer having two separate functions instead of a bool
parameter.

bdrv_graph_wrlock() could stay as it is, and bdrv_graph_wrlock_drained()
could be the convenience wrapper that drains first.

>  /*
>   * bdrv_graph_wrunlock:
>   * Write finished, reset global has_writer to 0 and restart
>   * all readers that are waiting.
> + *
> + * @drain whether bdrv_drain_all_end() should be called after releasing the lock
>   */
>  void no_coroutine_fn TSA_RELEASE(graph_lock) TSA_NO_TSA
> -bdrv_graph_wrunlock(void);
> +bdrv_graph_wrunlock(bool drain);

Here I would prefer to only keep the old bdrv_graph_wrunlock() without
a parameter. Can we just remember @drain from bdrv_graph_wrlock() in a
global variable? This would prevent callers from mismatching lock and
unlock variants (which TSA wouldn't be able to catch).

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/11] iotests/graph-changes-while-io: add test case with removal of lower snapshot
  2025-05-08 14:09 ` [PATCH 11/11] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
@ 2025-05-14 20:14   ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-14 20:14 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> 
> This case is catching potential deadlock which takes place when job-dismiss
> is issued when I/O requests are processed in a separate iothread.
> 
> See https://mail.gnu.org/archive/html/qemu-devel/2025-04/msg04421.html
> 
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

Keeping Andrey in From: is fine, but it does need your S-o-b if you
submit the patch.

> diff --git a/tests/qemu-iotests/tests/graph-changes-while-io b/tests/qemu-iotests/tests/graph-changes-while-io
> index 194fda500e..e30f823da4 100755
> --- a/tests/qemu-iotests/tests/graph-changes-while-io
> +++ b/tests/qemu-iotests/tests/graph-changes-while-io
> @@ -27,6 +27,8 @@ from iotests import imgfmt, qemu_img, qemu_img_create, qemu_io, \
>  
>  
>  top = os.path.join(iotests.test_dir, 'top.img')
> +snap1 = os.path.join(iotests.test_dir, 'snap1.img')
> +snap2 = os.path.join(iotests.test_dir, 'snap2.img')
>  nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')

top could be reused for snap2, it serves the same purpose in the
existing tests. Maybe mid for snap1 would make more sense then.

The only other problem I can see is that the test forgets to remove the
overlay image files (preexisting for top).

It did reproduce the problem on the first attempt, though, so:
Tested-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-14 16:36   ` Kevin Wolf
@ 2025-05-15 11:48     ` Fiona Ebner
  2025-05-15 12:28       ` Kevin Wolf
  0 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-15 11:48 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 14.05.25 um 18:36 schrieb Kevin Wolf:
> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
>> @@ -4368,6 +4368,7 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
>>                          bool keep_old_opts)
>>  {
>>      assert(bs != NULL);
>> +    assert(qatomic_read(&bs->quiesce_counter) > 0);
> 
> BlockDriverState.quiesce_counter isn't accessed with atomics elsewhere.
> Did you confuse it with BlockBackend.quiesce_counter?

No, but I saw that it is modified via qatomic_fetch_inc/dec(). And those
modifications are in bdrv_do_drained_begin/end() which are
IO_OR_GS_CODE(). So isn't it more correct to read via atomics here?

The documentation in include/block/block_int-common.h for struct
BlockDriverState also states:
>     /* Accessed with atomic ops.  */
>     int quiesce_counter;

Should I rather add a patch to have the other readers use atomics too?

>> @@ -4522,6 +4516,14 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
>>                                      QDict *options, bool keep_old_opts)
>>  {
>>      GLOBAL_STATE_CODE();
>> +
>> +    if (bs_queue == NULL) {
>> +        /*
>> +         * paired with bdrv_drain_all_end() in bdrv_reopen_queue_free().
>> +         */
>> +        bdrv_drain_all_begin();
> 
> Having this comment is a good idea. It fits on a single line, though.

Ack.

> 
>> +    }
>> +
>>      GRAPH_RDLOCK_GUARD_MAINLOOP();
>>  
>>      return bdrv_reopen_queue_child(bs_queue, bs, options, NULL, 0, false,
>> @@ -4534,12 +4536,16 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
>>      if (bs_queue) {
>>          BlockReopenQueueEntry *bs_entry, *next;
>>          QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
>> -            bdrv_drained_end(bs_entry->state.bs);
>>              qobject_unref(bs_entry->state.explicit_options);
>>              qobject_unref(bs_entry->state.options);
>>              g_free(bs_entry);
>>          }
>>          g_free(bs_queue);
>> +
>> +        /*
>> +         * paired with bdrv_drain_all_begin() in bdrv_reopen_queue().
>> +         */
>> +        bdrv_drain_all_end();
> 
> This could be a single line comment, too.

Ack.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-14 16:52   ` Kevin Wolf
@ 2025-05-15 12:03     ` Fiona Ebner
  0 siblings, 0 replies; 30+ messages in thread
From: Fiona Ebner @ 2025-05-15 12:03 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 14.05.25 um 18:52 schrieb Kevin Wolf:
> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
>> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
>>
>> More granular draining is not trivially possible, because
>> bdrv_snapshot_delete() can recursively call itself.
>>
>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>> ---
>>  block/snapshot.c | 18 ++++++++++++------
>>  blockdev.c       | 25 +++++++++++++++++--------
>>  qemu-img.c       |  2 ++
>>  3 files changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/block/snapshot.c b/block/snapshot.c
>> index 22567f1fb9..7788e1130b 100644
>> --- a/block/snapshot.c
>> +++ b/block/snapshot.c
>> @@ -327,7 +327,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>>  
>>  /**
>>   * Delete an internal snapshot by @snapshot_id and @name.
>> - * @bs: block device used in the operation
>> + * @bs: block device used in the operation, needs to be drained
> 
> Forgot to add this piece of nitpicking on the previous patch: Other
> places say "must be drained", which I slightly prefer because of how
> RFC 2119 has "MUST", but not "NEEDS TO". Matter of taste, I guess, but
> if you agree, we could change it for the non-RFC series.

Sure, will do!

>> @@ -573,9 +571,13 @@ int bdrv_all_delete_snapshot(const char *name,
>>      GList *iterbdrvs;
>>  
>>      GLOBAL_STATE_CODE();
>> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
>> +
>> +    bdrv_drain_all_begin();
>> +    bdrv_graph_rdlock_main_loop();
>>  
>>      if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
>> +        bdrv_graph_rdunlock_main_loop();
>> +        bdrv_drain_all_end();
>>          return -1;
>>      }
> 
> I think this wants to be changed into:
> 
>     ret = bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp);
>     if (ret < 0) {
>         goto out;
>     }
> 
> (Changing the return value from -1 to -errno is fine for the callers.)
> 
>> @@ -594,12 +596,16 @@ int bdrv_all_delete_snapshot(const char *name,
>>          if (ret < 0) {
>>              error_prepend(errp, "Could not delete snapshot '%s' on '%s': ",
>>                            name, bdrv_get_device_or_node_name(bs));
>> +            bdrv_graph_rdunlock_main_loop();
>> +            bdrv_drain_all_end();
>>              return -1;
>>          }
> 
> Same here.

Ack.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-15 11:48     ` Fiona Ebner
@ 2025-05-15 12:28       ` Kevin Wolf
  2025-05-15 13:41         ` Fiona Ebner
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Wolf @ 2025-05-15 12:28 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, pbonzini

Am 15.05.2025 um 13:48 hat Fiona Ebner geschrieben:
> Am 14.05.25 um 18:36 schrieb Kevin Wolf:
> > Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> >> @@ -4368,6 +4368,7 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
> >>                          bool keep_old_opts)
> >>  {
> >>      assert(bs != NULL);
> >> +    assert(qatomic_read(&bs->quiesce_counter) > 0);
> > 
> > BlockDriverState.quiesce_counter isn't accessed with atomics elsewhere.
> > Did you confuse it with BlockBackend.quiesce_counter?
> 
> No, but I saw that it is modified via qatomic_fetch_inc/dec(). And those
> modifications are in bdrv_do_drained_begin/end() which are
> IO_OR_GS_CODE(). So isn't it more correct to read via atomics here?

Aha, I missed these two places. Looks like Paolo's commit 414c2ec wasn't
very thorough with converting.

The commit message is also empty, so I don't know why we made this
change. Both places are GLOBAL_STATE_CODE(), so I don't think we
actually need atomics to synchronise these two places. Maybe there are
other accesses in iothreads, but then those should have been using
atomics, too.

> The documentation in include/block/block_int-common.h for struct
> BlockDriverState also states:
> >     /* Accessed with atomic ops.  */
> >     int quiesce_counter;
> 
> Should I rather add a patch to have the other readers use atomics too?

Either all accesses should use atomics or none of them. I'm not
completely sure which way is the right one. Using atomics everywhere is
the safe option, but I'm not sure if we ever access quiesce_counter
outside of the main thread.

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-15 12:28       ` Kevin Wolf
@ 2025-05-15 13:41         ` Fiona Ebner
  0 siblings, 0 replies; 30+ messages in thread
From: Fiona Ebner @ 2025-05-15 13:41 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, pbonzini

Am 15.05.25 um 14:28 schrieb Kevin Wolf:
> Am 15.05.2025 um 13:48 hat Fiona Ebner geschrieben:
>> Am 14.05.25 um 18:36 schrieb Kevin Wolf:
>>> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
>>>> @@ -4368,6 +4368,7 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
>>>>                          bool keep_old_opts)
>>>>  {
>>>>      assert(bs != NULL);
>>>> +    assert(qatomic_read(&bs->quiesce_counter) > 0);
>>>
>>> BlockDriverState.quiesce_counter isn't accessed with atomics elsewhere.
>>> Did you confuse it with BlockBackend.quiesce_counter?
>>
>> No, but I saw that it is modified via qatomic_fetch_inc/dec(). And those
>> modifications are in bdrv_do_drained_begin/end() which are
>> IO_OR_GS_CODE(). So isn't it more correct to read via atomics here?
> 
> Aha, I missed these two places. Looks like Paolo's commit 414c2ec wasn't
> very thorough with converting.
> 
> The commit message is also empty, so I don't know why we made this
> change. Both places are GLOBAL_STATE_CODE(), so I don't think we
> actually need atomics to synchronise these two places. Maybe there are
> other accesses in iothreads, but then those should have been using
> atomics, too.

AFAICT, all accesses are either in a function with GLOBAL_STATE_CODE()
directly or in a function with GRAPH_WRLOCK (in the cases of
bdrv_replace_child_tran() and bdrv_replace_child_noperm()).

This does not change for the new accesses added by the series here.

Back then, there was no GLOBAL_STATE_CODE() macro yet, so I suppose this
wasn't easy to verify.

>> The documentation in include/block/block_int-common.h for struct
>> BlockDriverState also states:
>>>     /* Accessed with atomic ops.  */
>>>     int quiesce_counter;
>>
>> Should I rather add a patch to have the other readers use atomics too?
> 
> Either all accesses should use atomics or none of them. I'm not
> completely sure which way is the right one. Using atomics everywhere is
> the safe option, but I'm not sure if we ever access quiesce_counter
> outside of the main thread.

I'd add a patch dropping the atomic accesses and document that
quiesce_counter can only be accessed in the main thread.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock
  2025-05-14 19:54   ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock Kevin Wolf
@ 2025-05-19 12:10     ` Fiona Ebner
  2025-05-20  6:09       ` Fiona Ebner
  0 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-19 12:10 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 14.05.25 um 21:54 schrieb Kevin Wolf:
> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
>> In bdrv_graph_wrlock() there is a comment that it uses
>> bdrv_drain_all_begin_nopoll() to make sure that constantly arriving
>> new I/O doesn't cause starvation. The changes from this series are at
>> odds with that, but there doesn't seem to be any (new) test failures.
> 
> I don't see why they are at odds with it? Draining an already drained
> node isn't a problem, it just increases the counter without doing
> anything else.

What I mean is: the introduction of calls to bdrv_drain_all_begin()
before bdrv_drain_all_begin_nopoll() could introduce potential for
starvation when there is constantly arriving new I/O. Or is this not true?

>> diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
>> index 2c26c72108..f291ccbc97 100644
>> --- a/include/block/graph-lock.h
>> +++ b/include/block/graph-lock.h
>> @@ -108,17 +108,21 @@ void unregister_aiocontext(AioContext *ctx);
>>   *
>>   * The wrlock can only be taken from the main loop, with BQL held, as only the
>>   * main loop is allowed to modify the graph.
>> + *
>> + * @drain whether bdrv_drain_all_begin() should be called before taking the lock
>>   */
>>  void no_coroutine_fn TSA_ACQUIRE(graph_lock) TSA_NO_TSA
>> -bdrv_graph_wrlock(void);
>> +bdrv_graph_wrlock(bool drain);
> 
> I would prefer having two separate functions instead of a bool
> parameter.
> 
> bdrv_graph_wrlock() could stay as it is, and bdrv_graph_wrlock_drained()
> could be the convenience wrapper that drains first.

Will do!

>>  /*
>>   * bdrv_graph_wrunlock:
>>   * Write finished, reset global has_writer to 0 and restart
>>   * all readers that are waiting.
>> + *
>> + * @drain whether bdrv_drain_all_end() should be called after releasing the lock
>>   */
>>  void no_coroutine_fn TSA_RELEASE(graph_lock) TSA_NO_TSA
>> -bdrv_graph_wrunlock(void);
>> +bdrv_graph_wrunlock(bool drain);
> 
> Here I would prefer to only keep the old bdrv_graph_wrunlock() without
> a parameter. Can we just remember @drain from bdrv_graph_wrlock() in a
> global variable? This would prevent callers from mismatching lock and
> unlock variants (which TSA wouldn't be able to catch).

Okay.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action()
  2025-05-14 17:26   ` Kevin Wolf
@ 2025-05-19 12:37     ` Fiona Ebner
  0 siblings, 0 replies; 30+ messages in thread
From: Fiona Ebner @ 2025-05-19 12:37 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 14.05.25 um 19:26 schrieb Kevin Wolf:
> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>> ---
>>
>> Could the bs associated to the device change because of polling
>> when draining? If yes, does that mean we need to drain all in the
>> beginning and not temporarily unlock?
> 
> I'm starting to hate this pattern. :-)
> 
> Maybe it would eventually be a good idea to have some helpers to deal
> with this common problem of "I want to drain all nodes that will be
> affected by the operation, but draining could change which nodes are
> part of it".

Yes, that would be nice :)

> For this specific case, though, I think drain all is right simply
> because we may already rely on it: I suppose qmp_transaction() calls
> bdrv_drain_all() for a reason, but it should actually be a drain
> section. I'm pretty sure we're not just waiting for some requests to
> complete, but want to keep other users away. Maybe the graph lock
> actually covers whatever this drain was initially supposed to solve, but
> bdrv_drain_all() doesn't look right.
>
> So how about making it bdrv_drain_all_begin/end() in qmp_transaction()
> just because that's the safe thing to do, and then we can just drop any
> individual drains in action implementations?

Unfortunately, this does not work on its own. Among others, iotest 096
for qcow2 will hang, because bdrv_img_create() is called by
external_snapshot_action() while drained and then qcow2_co_create()
coroutine will hang in
blk_co_do_pwritev_part()/blk_wait_while_drained(). So it seems having
the more targeted drain here is important.

Would it be sensible to go with the "temporarily unlock" solution, but
adding an extra check that the root bs of the device is still the one we
drained after re-locking? And returning an error if not. At least for
now, because the series is already getting quite big (in particular with
the split of 09/11).

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock
  2025-05-19 12:10     ` Fiona Ebner
@ 2025-05-20  6:09       ` Fiona Ebner
  2025-05-20  8:42         ` Kevin Wolf
  0 siblings, 1 reply; 30+ messages in thread
From: Fiona Ebner @ 2025-05-20  6:09 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

On 19.05.25 2:10 PM, Fiona Ebner wrote:
> Am 14.05.25 um 21:54 schrieb Kevin Wolf:
>> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
>>> In bdrv_graph_wrlock() there is a comment that it uses
>>> bdrv_drain_all_begin_nopoll() to make sure that constantly arriving
>>> new I/O doesn't cause starvation. The changes from this series are at
>>> odds with that, but there doesn't seem to be any (new) test failures.
>>
>> I don't see why they are at odds with it? Draining an already drained
>> node isn't a problem, it just increases the counter without doing
>> anything else.
> 
> What I mean is: the introduction of calls to bdrv_drain_all_begin()
> before bdrv_drain_all_begin_nopoll() could introduce potential for
> starvation when there is constantly arriving new I/O. Or is this not true?

Oh, I guess I know why I was confused now: I thought the comment is the
rationale for why the _nopoll variant is used, but the comment is the
rationale for the draining itself :)

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock
  2025-05-20  6:09       ` Fiona Ebner
@ 2025-05-20  8:42         ` Kevin Wolf
  0 siblings, 0 replies; 30+ messages in thread
From: Kevin Wolf @ 2025-05-20  8:42 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov

Am 20.05.2025 um 08:09 hat Fiona Ebner geschrieben:
> On 19.05.25 2:10 PM, Fiona Ebner wrote:
> > Am 14.05.25 um 21:54 schrieb Kevin Wolf:
> >> Am 08.05.2025 um 16:09 hat Fiona Ebner geschrieben:
> >>> In bdrv_graph_wrlock() there is a comment that it uses
> >>> bdrv_drain_all_begin_nopoll() to make sure that constantly arriving
> >>> new I/O doesn't cause starvation. The changes from this series are at
> >>> odds with that, but there doesn't seem to be any (new) test failures.
> >>
> >> I don't see why they are at odds with it? Draining an already drained
> >> node isn't a problem, it just increases the counter without doing
> >> anything else.
> > 
> > What I mean is: the introduction of calls to bdrv_drain_all_begin()
> > before bdrv_drain_all_begin_nopoll() could introduce potential for
> > starvation when there is constantly arriving new I/O. Or is this not true?
> 
> Oh, I guess I know why I was confused now: I thought the comment is the
> rationale for why the _nopoll variant is used, but the comment is the
> rationale for the draining itself :)

Ah, yes, it is! Or for both together, but in the sense why nopoll is
good enough, not why it does something that the normal drain wouldn't
do. The nopoll variant is used simply because we don't really need to
have all requests drained, it's good enough if no new request are coming
in.

Kevin



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-05-20  8:43 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08 14:09 [RFC 00/11] do not drain while holding the graph lock Fiona Ebner
2025-05-08 14:09 ` [PATCH 01/11] block: remove outdated comments about AioContext locking Fiona Ebner
2025-05-14 16:22   ` Kevin Wolf
2025-05-08 14:09 ` [PATCH 02/11] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
2025-05-14 16:36   ` Kevin Wolf
2025-05-15 11:48     ` Fiona Ebner
2025-05-15 12:28       ` Kevin Wolf
2025-05-15 13:41         ` Fiona Ebner
2025-05-08 14:09 ` [PATCH 03/11] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
2025-05-14 16:52   ` Kevin Wolf
2025-05-15 12:03     ` Fiona Ebner
2025-05-08 14:09 ` [PATCH 04/11] block: drain while unlocked in bdrv_reopen_parse_file_or_backing() Fiona Ebner
2025-05-14 16:59   ` Kevin Wolf
2025-05-08 14:09 ` [PATCH 05/11] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
2025-05-14 17:06   ` Kevin Wolf
2025-05-08 14:09 ` [PATCH 06/11] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
2025-05-14 17:26   ` Kevin Wolf
2025-05-19 12:37     ` Fiona Ebner
2025-05-08 14:09 ` [PATCH 07/11] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
2025-05-08 14:09 ` [PATCH 08/11] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
2025-05-14 17:50   ` Kevin Wolf
2025-05-08 14:09 ` [PATCH 09/11] block: move drain out of bdrv_change_aio_context() Fiona Ebner
2025-05-14 19:45   ` Kevin Wolf
2025-05-08 14:09 ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{, un}lock Fiona Ebner
2025-05-14 19:54   ` [PATCH 10/11] block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock Kevin Wolf
2025-05-19 12:10     ` Fiona Ebner
2025-05-20  6:09       ` Fiona Ebner
2025-05-20  8:42         ` Kevin Wolf
2025-05-08 14:09 ` [PATCH 11/11] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
2025-05-14 20:14   ` Kevin Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).