[PATCH v2 00/24] block: do not drain while holding the graph lock

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/24] block: do not drain while holding the graph lock
@ 2025-05-20 10:29 Fiona Ebner
  2025-05-20 10:29 ` [PATCH v2 01/24] block: remove outdated comments about AioContext locking Fiona Ebner
                   ` (24 more replies)
  0 siblings, 25 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Previous discussion [0].

Changes in v2:
* Split the big patch moving the drain outside of
  bdrv_change_aio_context(), mark functions along the way with graph
  lock annotations.
* In {internal,external}_snapshot_action, check that associated bs did
  not change after drain and re-acquiring the lock.
* Improve error handling using goto where appropriate.
* Add bdrv_graph_wrlock_drained() convenience wrapper rather than
  adding a flag argument.
* Don't use atomics to access bs->quiesce_counter field, add a patch
  to adapt the two existing places that used atomics.
* Re-use 'top' image for graph-changes-while-io test case and rename
  the other image to 'mid'. Remove the image files after the test.
* Use "must be" instead of "needs to be" in documentation, use single
  line comments where possible.
* Remove yet another outdated comment.
* I did not add Kevin's R-b for the patch marking bdrv_drained_begin()
  GRAPH_RDLOCK, as the earlier patches/preconditions changed.

This series is an attempt to fix a deadlock issue reported by Andrey
here [1].

bdrv_drained_begin() polls and is not allowed to be called with the
block graph lock held. Mark the function as GRAPH_UNLOCKED.

This alone does not catch the issue reported by Andrey, because there
is a bdrv_graph_rdunlock_main_loop() before bdrv_drained_begin() in
the function bdrv_change_aio_context(). That unlock is of course
ineffective if the exclusive lock is held, but it prevents TSA from
finding the issue.

Thus the bdrv_drained_begin() call from inside
bdrv_change_aio_context() needs to be moved up the call stack before
acquiring the locks. This is the bulk of the series.

Granular draining is not trivially possible, because many of the
affected functions can recursively call themselves.

In place where bdrv_drained_begin() calls were removed, assertions
are added, checking the quiesced_counter to ensure that the nodes
already got drained further up in the call stack.

NOTE:
there are pre-existing test failures on current master, e.g. '240' for
all formats, '295 296 inactive-node-nbd luks-detached-header' for luks
and 'mirror-sparse' for raw. For me, the failures do not change after
this series.

[0]: https://lore.kernel.org/qemu-devel/20250508140936.3344485-1-f.ebner@proxmox.com/
[1]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/

Andrey Drobyshev (1):
  iotests/graph-changes-while-io: add test case with removal of lower
    snapshot

Fiona Ebner (23):
  block: remove outdated comments about AioContext locking
  block: move drain outside of read-locked bdrv_reopen_queue_child()
  block/snapshot: move drain outside of read-locked
    bdrv_snapshot_delete()
  block: move drain outside of read-locked bdrv_inactivate_recurse()
  block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK
  block: mark change_aio_ctx() callback and instances as
    GRAPH_RDLOCK(_PTR)
  block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
  block: move drain outside of bdrv_change_aio_context() and mark
    GRAPH_RDLOCK
  block: move drain outside of bdrv_try_change_aio_context()
  block: move drain outside of bdrv_attach_child_common(_abort)()
  block: move drain outside of bdrv_set_backing_hd_drained()
  block: move drain outside of bdrv_root_attach_child()
  block: move drain outside of bdrv_attach_child()
  block: move drain outside of quorum_add_child()
  block: move drain outside of bdrv_root_unref_child()
  block: move drain outside of quorum_del_child()
  blockdev: drain while unlocked in internal_snapshot_action()
  blockdev: drain while unlocked in external_snapshot_action()
  block: mark bdrv_drained_begin() as GRAPH_UNLOCKED
  iotests/graph-changes-while-io: remove image file after test
  block/io: remove duplicate GLOBAL_STATE_CODE() in
    bdrv_do_drained_end()
  block: never use atomics to access bs->quiesce_counter
  block: add bdrv_graph_wrlock_drained() convenience wrapper

 block.c                                       | 156 ++++++++++--------
 block/backup.c                                |   2 +-
 block/blklogwrites.c                          |   4 +-
 block/blkverify.c                             |   2 +-
 block/block-backend.c                         |  10 +-
 block/commit.c                                |   2 +-
 block/graph-lock.c                            |  40 ++++-
 block/io.c                                    |   8 +-
 block/mirror.c                                |   2 +-
 block/qcow2.c                                 |   2 +-
 block/quorum.c                                |   4 +-
 block/replication.c                           |   4 +-
 block/snapshot.c                              |  28 ++--
 block/stream.c                                |   8 +-
 block/vmdk.c                                  |  10 +-
 blockdev.c                                    |  78 +++++++--
 blockjob.c                                    |  10 +-
 include/block/block-global-state.h            |  15 +-
 include/block/block-io.h                      |   2 +-
 include/block/block_int-common.h              |   8 +-
 include/block/graph-lock.h                    |  11 ++
 qemu-img.c                                    |   2 +
 .../qemu-iotests/tests/graph-changes-while-io | 102 +++++++++++-
 .../tests/graph-changes-while-io.out          |   4 +-
 tests/unit/test-bdrv-drain.c                  |  22 +--
 tests/unit/test-bdrv-graph-mod.c              |  10 +-
 26 files changed, 373 insertions(+), 173 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 01/24] block: remove outdated comments about AioContext locking
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21  9:41   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 02/24] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

AioContext locking was removed in commit b49f4755c7 ("block: remove
AioContext locking").

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Remove yet another one.

 block.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/block.c b/block.c
index f222e1a50a..a5399888ba 100644
--- a/block.c
+++ b/block.c
@@ -4359,8 +4359,6 @@ bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child)
  * bs_queue, or the existing bs_queue being used.
  *
  * bs is drained here and undrained by bdrv_reopen_queue_free().
- *
- * To be called with bs->aio_context locked.
  */
 static BlockReopenQueue * GRAPH_RDLOCK
 bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
@@ -4519,7 +4517,6 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
     return bs_queue;
 }
 
-/* To be called with bs->aio_context locked */
 BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
                                     BlockDriverState *bs,
                                     QDict *options, bool keep_old_opts)
@@ -7278,10 +7275,6 @@ bool bdrv_op_blocker_is_empty(BlockDriverState *bs)
     return true;
 }
 
-/*
- * Must not be called while holding the lock of an AioContext other than the
- * current one.
- */
 void bdrv_img_create(const char *filename, const char *fmt,
                      const char *base_filename, const char *base_fmt,
                      char *options, uint64_t img_size, int flags, bool quiet,
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 01/24] block: remove outdated comments about AioContext locking
  2025-05-20 10:29 ` [PATCH v2 01/24] block: remove outdated comments about AioContext locking Fiona Ebner
@ 2025-05-21  9:41   ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21  9:41 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> AioContext locking was removed in commit b49f4755c7 ("block: remove
> AioContext locking").
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 02/24] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
  2025-05-20 10:29 ` [PATCH v2 01/24] block: remove outdated comments about AioContext locking Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21  9:45   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

More granular draining is not trivially possible, because
bdrv_reopen_queue_child() can recursively call itself.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Use 'must be drained' instead of 'needs to be drained'.
* Use single-line comments when the text fits in a single line.
* Don't use atomics to access bs->quiesce_counter.

 block.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index a5399888ba..3065d5c91e 100644
--- a/block.c
+++ b/block.c
@@ -4358,7 +4358,7 @@ bdrv_recurse_has_child(BlockDriverState *bs, BlockDriverState *child)
  * returns a pointer to bs_queue, which is either the newly allocated
  * bs_queue, or the existing bs_queue being used.
  *
- * bs is drained here and undrained by bdrv_reopen_queue_free().
+ * bs must be drained.
  */
 static BlockReopenQueue * GRAPH_RDLOCK
 bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
@@ -4377,12 +4377,7 @@ bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs,
 
     GLOBAL_STATE_CODE();
 
-    /*
-     * Strictly speaking, draining is illegal under GRAPH_RDLOCK. We know that
-     * we've been called with bdrv_graph_rdlock_main_loop(), though, so it's ok
-     * in practice.
-     */
-    bdrv_drained_begin(bs);
+    assert(bs->quiesce_counter > 0);
 
     if (bs_queue == NULL) {
         bs_queue = g_new0(BlockReopenQueue, 1);
@@ -4522,6 +4517,12 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
                                     QDict *options, bool keep_old_opts)
 {
     GLOBAL_STATE_CODE();
+
+    if (bs_queue == NULL) {
+        /* Paired with bdrv_drain_all_end() in bdrv_reopen_queue_free(). */
+        bdrv_drain_all_begin();
+    }
+
     GRAPH_RDLOCK_GUARD_MAINLOOP();
 
     return bdrv_reopen_queue_child(bs_queue, bs, options, NULL, 0, false,
@@ -4534,12 +4535,14 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
     if (bs_queue) {
         BlockReopenQueueEntry *bs_entry, *next;
         QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
-            bdrv_drained_end(bs_entry->state.bs);
             qobject_unref(bs_entry->state.explicit_options);
             qobject_unref(bs_entry->state.options);
             g_free(bs_entry);
         }
         g_free(bs_queue);
+
+        /* Paired with bdrv_drain_all_begin() in bdrv_reopen_queue(). */
+        bdrv_drain_all_end();
     }
 }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 02/24] block: move drain outside of read-locked bdrv_reopen_queue_child()
  2025-05-20 10:29 ` [PATCH v2 02/24] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
@ 2025-05-21  9:45   ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21  9:45 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_reopen_queue_child() can recursively call itself.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
  2025-05-20 10:29 ` [PATCH v2 01/24] block: remove outdated comments about AioContext locking Fiona Ebner
  2025-05-20 10:29 ` [PATCH v2 02/24] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21  9:53   ` Kevin Wolf
  2025-05-23 18:12   ` Andrey Drobyshev
  2025-05-20 10:29 ` [PATCH v2 04/24] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
                   ` (21 subsequent siblings)
  24 siblings, 2 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

More granular draining is not trivially possible, because
bdrv_snapshot_delete() can recursively call itself.

The return value of bdrv_all_delete_snapshot() changes from -1 to
-errno propagated from failed sub-calls. This is fine for the existing
callers of bdrv_all_delete_snapshot().

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Use 'must be drained' instead of 'needs to be drained'.
* Use goto for cleanup/error handling in bdrv_all_delete_snapshot().
* Don't use atomics to access bs->quiesce_counter.

 block/snapshot.c | 26 +++++++++++++++-----------
 blockdev.c       | 25 +++++++++++++++++--------
 qemu-img.c       |  2 ++
 3 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index 22567f1fb9..9f300a78bd 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -327,7 +327,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 
 /**
  * Delete an internal snapshot by @snapshot_id and @name.
- * @bs: block device used in the operation
+ * @bs: block device used in the operation, must be drained
  * @snapshot_id: unique snapshot ID, or NULL
  * @name: snapshot name, or NULL
  * @errp: location to store error
@@ -358,6 +358,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
 
     GLOBAL_STATE_CODE();
 
+    assert(bs->quiesce_counter > 0);
+
     if (!drv) {
         error_setg(errp, "Device '%s' has no medium",
                    bdrv_get_device_name(bs));
@@ -368,9 +370,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
         return -EINVAL;
     }
 
-    /* drain all pending i/o before deleting snapshot */
-    bdrv_drained_begin(bs);
-
     if (drv->bdrv_snapshot_delete) {
         ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
     } else if (fallback_bs) {
@@ -382,7 +381,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
         ret = -ENOTSUP;
     }
 
-    bdrv_drained_end(bs);
     return ret;
 }
 
@@ -571,19 +569,22 @@ int bdrv_all_delete_snapshot(const char *name,
     ERRP_GUARD();
     g_autoptr(GList) bdrvs = NULL;
     GList *iterbdrvs;
+    int ret = 0;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
 
-    if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
-        return -1;
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
+
+    ret = bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp);
+    if (ret < 0) {
+        goto out;
     }
 
     iterbdrvs = bdrvs;
     while (iterbdrvs) {
         BlockDriverState *bs = iterbdrvs->data;
         QEMUSnapshotInfo sn1, *snapshot = &sn1;
-        int ret = 0;
 
         if ((devices || bdrv_all_snapshots_includes_bs(bs)) &&
             bdrv_snapshot_find(bs, snapshot, name) >= 0)
@@ -594,13 +595,16 @@ int bdrv_all_delete_snapshot(const char *name,
         if (ret < 0) {
             error_prepend(errp, "Could not delete snapshot '%s' on '%s': ",
                           name, bdrv_get_device_or_node_name(bs));
-            return -1;
+            goto out;
         }
 
         iterbdrvs = iterbdrvs->next;
     }
 
-    return 0;
+out:
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
+    return ret;
 }
 
 
diff --git a/blockdev.c b/blockdev.c
index 21443b4514..3982f9776b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1132,39 +1132,41 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
     int ret;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     bs = qmp_get_root_bs(device, errp);
     if (!bs) {
-        return NULL;
+        goto error;
     }
 
     if (!id && !name) {
         error_setg(errp, "Name or id must be provided");
-        return NULL;
+        goto error;
     }
 
     if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INTERNAL_SNAPSHOT_DELETE, errp)) {
-        return NULL;
+        goto error;
     }
 
     ret = bdrv_snapshot_find_by_id_and_name(bs, id, name, &sn, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
-        return NULL;
+        goto error;
     }
     if (!ret) {
         error_setg(errp,
                    "Snapshot with id '%s' and name '%s' does not exist on "
                    "device '%s'",
                    STR_OR_NULL(id), STR_OR_NULL(name), device);
-        return NULL;
+        goto error;
     }
 
     bdrv_snapshot_delete(bs, id, name, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
-        return NULL;
+        goto error;
     }
 
     info = g_new0(SnapshotInfo, 1);
@@ -1180,6 +1182,9 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
         info->has_icount = true;
     }
 
+error:
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
     return info;
 }
 
@@ -1295,12 +1300,14 @@ static void internal_snapshot_abort(void *opaque)
     Error *local_error = NULL;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
 
     if (!state->created) {
         return;
     }
 
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
+
     if (bdrv_snapshot_delete(bs, sn->id_str, sn->name, &local_error) < 0) {
         error_reportf_err(local_error,
                           "Failed to delete snapshot with id '%s' and "
@@ -1308,6 +1315,8 @@ static void internal_snapshot_abort(void *opaque)
                           sn->id_str, sn->name,
                           bdrv_get_device_name(bs));
     }
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
 }
 
 static void internal_snapshot_clean(void *opaque)
diff --git a/qemu-img.c b/qemu-img.c
index 76ac5d3028..e81b0fbb6c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3505,6 +3505,7 @@ static int img_snapshot(int argc, char **argv)
         break;
 
     case SNAPSHOT_DELETE:
+        bdrv_drain_all_begin();
         bdrv_graph_rdlock_main_loop();
         ret = bdrv_snapshot_find(bs, &sn, snapshot_name);
         if (ret < 0) {
@@ -3520,6 +3521,7 @@ static int img_snapshot(int argc, char **argv)
             }
         }
         bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_end();
         break;
     }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-20 10:29 ` [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
@ 2025-05-21  9:53   ` Kevin Wolf
  2025-05-23 18:12   ` Andrey Drobyshev
  1 sibling, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21  9:53 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_snapshot_delete() can recursively call itself.
> 
> The return value of bdrv_all_delete_snapshot() changes from -1 to
> -errno propagated from failed sub-calls. This is fine for the existing
> callers of bdrv_all_delete_snapshot().
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Changes in v2:
> * Use 'must be drained' instead of 'needs to be drained'.
> * Use goto for cleanup/error handling in bdrv_all_delete_snapshot().
> * Don't use atomics to access bs->quiesce_counter.
> 
>  block/snapshot.c | 26 +++++++++++++++-----------
>  blockdev.c       | 25 +++++++++++++++++--------
>  qemu-img.c       |  2 ++
>  3 files changed, 34 insertions(+), 19 deletions(-)

> @@ -571,19 +569,22 @@ int bdrv_all_delete_snapshot(const char *name,
>      ERRP_GUARD();
>      g_autoptr(GList) bdrvs = NULL;
>      GList *iterbdrvs;
> +    int ret = 0;

The initialisation here is technically not needed...

>      GLOBAL_STATE_CODE();
> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
>  
> -    if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
> -        return -1;
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
> +
> +    ret = bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp);
> +    if (ret < 0) {
> +        goto out;
>      }

...because ret is assigned here before anyone reads it. But it doesn't
hurt either, so:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-20 10:29 ` [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
  2025-05-21  9:53   ` Kevin Wolf
@ 2025-05-23 18:12   ` Andrey Drobyshev
  2025-05-26  9:10     ` Kevin Wolf
  1 sibling, 1 reply; 49+ messages in thread
From: Andrey Drobyshev @ 2025-05-23 18:12 UTC (permalink / raw)
  To: Fiona Ebner, qemu-block
  Cc: qemu-devel, kwolf, den, hreitz, stefanha, eblake, jsnow,
	vsementsov, xiechanglong.d, wencongyang2, berto, fam, ari

On 5/20/25 1:29 PM, Fiona Ebner wrote:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_snapshot_delete() can recursively call itself.
> 
> The return value of bdrv_all_delete_snapshot() changes from -1 to
> -errno propagated from failed sub-calls. This is fine for the existing
> callers of bdrv_all_delete_snapshot().
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Changes in v2:
> * Use 'must be drained' instead of 'needs to be drained'.
> * Use goto for cleanup/error handling in bdrv_all_delete_snapshot().
> * Don't use atomics to access bs->quiesce_counter.
> 
>  block/snapshot.c | 26 +++++++++++++++-----------
>  blockdev.c       | 25 +++++++++++++++++--------
>  qemu-img.c       |  2 ++
>  3 files changed, 34 insertions(+), 19 deletions(-)
> 
> diff --git a/block/snapshot.c b/block/snapshot.c
> index 22567f1fb9..9f300a78bd 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -327,7 +327,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
>  
>  /**
>   * Delete an internal snapshot by @snapshot_id and @name.
> - * @bs: block device used in the operation
> + * @bs: block device used in the operation, must be drained
>   * @snapshot_id: unique snapshot ID, or NULL
>   * @name: snapshot name, or NULL
>   * @errp: location to store error
> @@ -358,6 +358,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>  
>      GLOBAL_STATE_CODE();
>  
> +    assert(bs->quiesce_counter > 0);
> +
>      if (!drv) {
>          error_setg(errp, "Device '%s' has no medium",
>                     bdrv_get_device_name(bs));
> @@ -368,9 +370,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>          return -EINVAL;
>      }
>  
> -    /* drain all pending i/o before deleting snapshot */
> -    bdrv_drained_begin(bs);
> -
>      if (drv->bdrv_snapshot_delete) {
>          ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
>      } else if (fallback_bs) {
> @@ -382,7 +381,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
>          ret = -ENOTSUP;
>      }
>  
> -    bdrv_drained_end(bs);
>      return ret;
>  }
>  
> @@ -571,19 +569,22 @@ int bdrv_all_delete_snapshot(const char *name,
>      ERRP_GUARD();
>      g_autoptr(GList) bdrvs = NULL;
>      GList *iterbdrvs;
> +    int ret = 0;
>  
>      GLOBAL_STATE_CODE();
> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
>  
> -    if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
> -        return -1;
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
> +
> +    ret = bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp);
> +    if (ret < 0) {
> +        goto out;
>      }
>  
>      iterbdrvs = bdrvs;
>      while (iterbdrvs) {
>          BlockDriverState *bs = iterbdrvs->data;
>          QEMUSnapshotInfo sn1, *snapshot = &sn1;
> -        int ret = 0;
>  
>          if ((devices || bdrv_all_snapshots_includes_bs(bs)) &&
>              bdrv_snapshot_find(bs, snapshot, name) >= 0)
> @@ -594,13 +595,16 @@ int bdrv_all_delete_snapshot(const char *name,
>          if (ret < 0) {
>              error_prepend(errp, "Could not delete snapshot '%s' on '%s': ",
>                            name, bdrv_get_device_or_node_name(bs));
> -            return -1;
> +            goto out;
>          }
>  
>          iterbdrvs = iterbdrvs->next;
>      }
>  
> -    return 0;
> +out:
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
> +    return ret;
>  }
>  
>  
> diff --git a/blockdev.c b/blockdev.c
> index 21443b4514..3982f9776b 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1132,39 +1132,41 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
>      int ret;
>  
>      GLOBAL_STATE_CODE();
> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
> +
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
>  
>      bs = qmp_get_root_bs(device, errp);
>      if (!bs) {
> -        return NULL;
> +        goto error;
>      }
>  
>      if (!id && !name) {
>          error_setg(errp, "Name or id must be provided");
> -        return NULL;
> +        goto error;
>      }
>  
>      if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INTERNAL_SNAPSHOT_DELETE, errp)) {
> -        return NULL;
> +        goto error;
>      }
>  
>      ret = bdrv_snapshot_find_by_id_and_name(bs, id, name, &sn, &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
> -        return NULL;
> +        goto error;
>      }
>      if (!ret) {
>          error_setg(errp,
>                     "Snapshot with id '%s' and name '%s' does not exist on "
>                     "device '%s'",
>                     STR_OR_NULL(id), STR_OR_NULL(name), device);
> -        return NULL;
> +        goto error;
>      }
>  
>      bdrv_snapshot_delete(bs, id, name, &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
> -        return NULL;
> +        goto error;
>      }
>  
>      info = g_new0(SnapshotInfo, 1);
> @@ -1180,6 +1182,9 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
>          info->has_icount = true;
>      }
>  
> +error:
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>      return info;
>  }
>  
> @@ -1295,12 +1300,14 @@ static void internal_snapshot_abort(void *opaque)
>      Error *local_error = NULL;
>  
>      GLOBAL_STATE_CODE();
> -    GRAPH_RDLOCK_GUARD_MAINLOOP();
>  
>      if (!state->created) {
>          return;
>      }
>  
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
> +
>      if (bdrv_snapshot_delete(bs, sn->id_str, sn->name, &local_error) < 0) {
>          error_reportf_err(local_error,
>                            "Failed to delete snapshot with id '%s' and "
> @@ -1308,6 +1315,8 @@ static void internal_snapshot_abort(void *opaque)
>                            sn->id_str, sn->name,
>                            bdrv_get_device_name(bs));
>      }
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>  }
>

Okay, I've got a very simple and naive question to ask.  We've got this
pattern recurring throughout the series:

> GLOBAL_STATE_CODE();
> bdrv_drain_all_begin();
> bdrv_graph_rdlock_main_loop();
> 
> ...
> 
> bdrv_graph_rdunlock_main_loop();
> bdrv_drain_all_end();

bdrv_graph_rdlock_main_loop() doesn't actually take any locks, it
asserts that we're running in the main thread and not in a coroutine.
bdrv_graph_rdunlock_main_loop() does the same.
GRAPH_RDLOCK_GUARD_MAINLOOP() does both those calls, in the beginning of
a function and when leaving its scope, so essentially it also just does
assert(qemu_in_main_thread() && !qemu_in_coroutine()).

Therefore:

1. Is there any real benefit from using those
{rdlock/rdunlock}_main_loop() constructions, or they're here due to
historical reasons only?
2. Would it hurt if we only leave GRAPH_RDLOCK_GUARD_MAINLOOP() in all
such occurrences?  At least when it's obvious we can't get out of the
main thread.  That would simply deliver us from performing same checks
several times, similar to what's done in commit 22/24 ("block/io: remove
duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end()").

>  static void internal_snapshot_clean(void *opaque)
> diff --git a/qemu-img.c b/qemu-img.c
> index 76ac5d3028..e81b0fbb6c 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3505,6 +3505,7 @@ static int img_snapshot(int argc, char **argv)
>          break;
>  
>      case SNAPSHOT_DELETE:
> +        bdrv_drain_all_begin();
>          bdrv_graph_rdlock_main_loop();
>          ret = bdrv_snapshot_find(bs, &sn, snapshot_name);
>          if (ret < 0) {
> @@ -3520,6 +3521,7 @@ static int img_snapshot(int argc, char **argv)
>              }
>          }
>          bdrv_graph_rdunlock_main_loop();
> +        bdrv_drain_all_end();
>          break;
>      }
>  



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-23 18:12   ` Andrey Drobyshev
@ 2025-05-26  9:10     ` Kevin Wolf
  2025-05-26 10:33       ` Fiona Ebner
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2025-05-26  9:10 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: Fiona Ebner, qemu-block, qemu-devel, den, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 23.05.2025 um 20:12 hat Andrey Drobyshev geschrieben:
> On 5/20/25 1:29 PM, Fiona Ebner wrote:
> > This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> > 
> > More granular draining is not trivially possible, because
> > bdrv_snapshot_delete() can recursively call itself.
> > 
> > The return value of bdrv_all_delete_snapshot() changes from -1 to
> > -errno propagated from failed sub-calls. This is fine for the existing
> > callers of bdrv_all_delete_snapshot().
> > 
> > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> > ---
> > 
> > Changes in v2:
> > * Use 'must be drained' instead of 'needs to be drained'.
> > * Use goto for cleanup/error handling in bdrv_all_delete_snapshot().
> > * Don't use atomics to access bs->quiesce_counter.
> > 
> >  block/snapshot.c | 26 +++++++++++++++-----------
> >  blockdev.c       | 25 +++++++++++++++++--------
> >  qemu-img.c       |  2 ++
> >  3 files changed, 34 insertions(+), 19 deletions(-)
> > 
> > diff --git a/block/snapshot.c b/block/snapshot.c
> > index 22567f1fb9..9f300a78bd 100644
> > --- a/block/snapshot.c
> > +++ b/block/snapshot.c
> > @@ -327,7 +327,7 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
> >  
> >  /**
> >   * Delete an internal snapshot by @snapshot_id and @name.
> > - * @bs: block device used in the operation
> > + * @bs: block device used in the operation, must be drained
> >   * @snapshot_id: unique snapshot ID, or NULL
> >   * @name: snapshot name, or NULL
> >   * @errp: location to store error
> > @@ -358,6 +358,8 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
> >  
> >      GLOBAL_STATE_CODE();
> >  
> > +    assert(bs->quiesce_counter > 0);
> > +
> >      if (!drv) {
> >          error_setg(errp, "Device '%s' has no medium",
> >                     bdrv_get_device_name(bs));
> > @@ -368,9 +370,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
> >          return -EINVAL;
> >      }
> >  
> > -    /* drain all pending i/o before deleting snapshot */
> > -    bdrv_drained_begin(bs);
> > -
> >      if (drv->bdrv_snapshot_delete) {
> >          ret = drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
> >      } else if (fallback_bs) {
> > @@ -382,7 +381,6 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
> >          ret = -ENOTSUP;
> >      }
> >  
> > -    bdrv_drained_end(bs);
> >      return ret;
> >  }
> >  
> > @@ -571,19 +569,22 @@ int bdrv_all_delete_snapshot(const char *name,
> >      ERRP_GUARD();
> >      g_autoptr(GList) bdrvs = NULL;
> >      GList *iterbdrvs;
> > +    int ret = 0;
> >  
> >      GLOBAL_STATE_CODE();
> > -    GRAPH_RDLOCK_GUARD_MAINLOOP();
> >  
> > -    if (bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp) < 0) {
> > -        return -1;
> > +    bdrv_drain_all_begin();
> > +    bdrv_graph_rdlock_main_loop();
> > +
> > +    ret = bdrv_all_get_snapshot_devices(has_devices, devices, &bdrvs, errp);
> > +    if (ret < 0) {
> > +        goto out;
> >      }
> >  
> >      iterbdrvs = bdrvs;
> >      while (iterbdrvs) {
> >          BlockDriverState *bs = iterbdrvs->data;
> >          QEMUSnapshotInfo sn1, *snapshot = &sn1;
> > -        int ret = 0;
> >  
> >          if ((devices || bdrv_all_snapshots_includes_bs(bs)) &&
> >              bdrv_snapshot_find(bs, snapshot, name) >= 0)
> > @@ -594,13 +595,16 @@ int bdrv_all_delete_snapshot(const char *name,
> >          if (ret < 0) {
> >              error_prepend(errp, "Could not delete snapshot '%s' on '%s': ",
> >                            name, bdrv_get_device_or_node_name(bs));
> > -            return -1;
> > +            goto out;
> >          }
> >  
> >          iterbdrvs = iterbdrvs->next;
> >      }
> >  
> > -    return 0;
> > +out:
> > +    bdrv_graph_rdunlock_main_loop();
> > +    bdrv_drain_all_end();
> > +    return ret;
> >  }
> >  
> >  
> > diff --git a/blockdev.c b/blockdev.c
> > index 21443b4514..3982f9776b 100644
> > --- a/blockdev.c
> > +++ b/blockdev.c
> > @@ -1132,39 +1132,41 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
> >      int ret;
> >  
> >      GLOBAL_STATE_CODE();
> > -    GRAPH_RDLOCK_GUARD_MAINLOOP();
> > +
> > +    bdrv_drain_all_begin();
> > +    bdrv_graph_rdlock_main_loop();
> >  
> >      bs = qmp_get_root_bs(device, errp);
> >      if (!bs) {
> > -        return NULL;
> > +        goto error;
> >      }
> >  
> >      if (!id && !name) {
> >          error_setg(errp, "Name or id must be provided");
> > -        return NULL;
> > +        goto error;
> >      }
> >  
> >      if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INTERNAL_SNAPSHOT_DELETE, errp)) {
> > -        return NULL;
> > +        goto error;
> >      }
> >  
> >      ret = bdrv_snapshot_find_by_id_and_name(bs, id, name, &sn, &local_err);
> >      if (local_err) {
> >          error_propagate(errp, local_err);
> > -        return NULL;
> > +        goto error;
> >      }
> >      if (!ret) {
> >          error_setg(errp,
> >                     "Snapshot with id '%s' and name '%s' does not exist on "
> >                     "device '%s'",
> >                     STR_OR_NULL(id), STR_OR_NULL(name), device);
> > -        return NULL;
> > +        goto error;
> >      }
> >  
> >      bdrv_snapshot_delete(bs, id, name, &local_err);
> >      if (local_err) {
> >          error_propagate(errp, local_err);
> > -        return NULL;
> > +        goto error;
> >      }
> >  
> >      info = g_new0(SnapshotInfo, 1);
> > @@ -1180,6 +1182,9 @@ SnapshotInfo *qmp_blockdev_snapshot_delete_internal_sync(const char *device,
> >          info->has_icount = true;
> >      }
> >  
> > +error:
> > +    bdrv_graph_rdunlock_main_loop();
> > +    bdrv_drain_all_end();
> >      return info;
> >  }
> >  
> > @@ -1295,12 +1300,14 @@ static void internal_snapshot_abort(void *opaque)
> >      Error *local_error = NULL;
> >  
> >      GLOBAL_STATE_CODE();
> > -    GRAPH_RDLOCK_GUARD_MAINLOOP();
> >  
> >      if (!state->created) {
> >          return;
> >      }
> >  
> > +    bdrv_drain_all_begin();
> > +    bdrv_graph_rdlock_main_loop();
> > +
> >      if (bdrv_snapshot_delete(bs, sn->id_str, sn->name, &local_error) < 0) {
> >          error_reportf_err(local_error,
> >                            "Failed to delete snapshot with id '%s' and "
> > @@ -1308,6 +1315,8 @@ static void internal_snapshot_abort(void *opaque)
> >                            sn->id_str, sn->name,
> >                            bdrv_get_device_name(bs));
> >      }
> > +    bdrv_graph_rdunlock_main_loop();
> > +    bdrv_drain_all_end();
> >  }
> >
> 
> Okay, I've got a very simple and naive question to ask.  We've got this
> pattern recurring throughout the series:
> 
> > GLOBAL_STATE_CODE();
> > bdrv_drain_all_begin();
> > bdrv_graph_rdlock_main_loop();
> > 
> > ...
> > 
> > bdrv_graph_rdunlock_main_loop();
> > bdrv_drain_all_end();
> 
> bdrv_graph_rdlock_main_loop() doesn't actually take any locks, it
> asserts that we're running in the main thread and not in a coroutine.
> bdrv_graph_rdunlock_main_loop() does the same.
> GRAPH_RDLOCK_GUARD_MAINLOOP() does both those calls, in the beginning of
> a function and when leaving its scope, so essentially it also just does
> assert(qemu_in_main_thread() && !qemu_in_coroutine()).
> 
> Therefore:
> 
> 1. Is there any real benefit from using those
> {rdlock/rdunlock}_main_loop() constructions, or they're here due to
> historical reasons only?

It's the price we pay for the compiler to verify our locking rules.

> 2. Would it hurt if we only leave GRAPH_RDLOCK_GUARD_MAINLOOP() in all
> such occurrences?  At least when it's obvious we can't get out of the
> main thread.  That would simply deliver us from performing same checks
> several times, similar to what's done in commit 22/24 ("block/io: remove
> duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end()").

Once bdrv_drain_all_begin() is marked GRAPH_UNLOCKED, calling it after
GRAPH_RDLOCK_GUARD_MAINLOOP() would be wrong according to TSA rules
(which don't know anything about this being only a fake lock) and the
build would fail.

Kevin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-26  9:10     ` Kevin Wolf
@ 2025-05-26 10:33       ` Fiona Ebner
  2025-05-26 11:42         ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-26 10:33 UTC (permalink / raw)
  To: Kevin Wolf, Andrey Drobyshev
  Cc: qemu-block, qemu-devel, den, hreitz, stefanha, eblake, jsnow,
	vsementsov, xiechanglong.d, wencongyang2, berto, fam, ari

Am 26.05.25 um 11:10 schrieb Kevin Wolf:
> Am 23.05.2025 um 20:12 hat Andrey Drobyshev geschrieben:
>> Okay, I've got a very simple and naive question to ask.  We've got this
>> pattern recurring throughout the series:
>>
>>> GLOBAL_STATE_CODE();
>>> bdrv_drain_all_begin();
>>> bdrv_graph_rdlock_main_loop();
>>>
>>> ...
>>>
>>> bdrv_graph_rdunlock_main_loop();
>>> bdrv_drain_all_end();
>>
>> bdrv_graph_rdlock_main_loop() doesn't actually take any locks, it
>> asserts that we're running in the main thread and not in a coroutine.
>> bdrv_graph_rdunlock_main_loop() does the same.
>> GRAPH_RDLOCK_GUARD_MAINLOOP() does both those calls, in the beginning of
>> a function and when leaving its scope, so essentially it also just does
>> assert(qemu_in_main_thread() && !qemu_in_coroutine()).
>>
>> Therefore:
>>
>> 1. Is there any real benefit from using those
>> {rdlock/rdunlock}_main_loop() constructions, or they're here due to
>> historical reasons only?
> 
> It's the price we pay for the compiler to verify our locking rules.
> 
>> 2. Would it hurt if we only leave GRAPH_RDLOCK_GUARD_MAINLOOP() in all
>> such occurrences?  At least when it's obvious we can't get out of the
>> main thread.  That would simply deliver us from performing same checks
>> several times, similar to what's done in commit 22/24 ("block/io: remove
>> duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end()").
> 
> Once bdrv_drain_all_begin() is marked GRAPH_UNLOCKED, calling it after
> GRAPH_RDLOCK_GUARD_MAINLOOP() would be wrong according to TSA rules
> (which don't know anything about this being only a fake lock) and the
> build would fail.

Note that I did not mark bdrv_drain_all_begin() as GRAPH_UNLOCKED in the
series yet. The reason is that I wasn't fully sure if that is okay,
given that it also can be called from a coroutine and does
bdrv_co_yield_to_drain() then. But I suppose that doesn't do anything
with the graph lock, so I'll add the GRAPH_UNLOCKED marker in v3.

I don't see any callers that actually are in coroutine context, except
test_quiesce_co_drain_all() and test_drv_cb_co_drain_all() in
tests/unit/test-bdrv-drain.c

Adding
 GLOBAL_STATE_CODE();
 assert(!qemu_in_coroutine());
to the beginning of the function seems to not cause any test failures
except for the two unit tests already mentioned.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete()
  2025-05-26 10:33       ` Fiona Ebner
@ 2025-05-26 11:42         ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-26 11:42 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Andrey Drobyshev, qemu-block, qemu-devel, den, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 26.05.2025 um 12:33 hat Fiona Ebner geschrieben:
> Am 26.05.25 um 11:10 schrieb Kevin Wolf:
> > Am 23.05.2025 um 20:12 hat Andrey Drobyshev geschrieben:
> >> Okay, I've got a very simple and naive question to ask.  We've got this
> >> pattern recurring throughout the series:
> >>
> >>> GLOBAL_STATE_CODE();
> >>> bdrv_drain_all_begin();
> >>> bdrv_graph_rdlock_main_loop();
> >>>
> >>> ...
> >>>
> >>> bdrv_graph_rdunlock_main_loop();
> >>> bdrv_drain_all_end();
> >>
> >> bdrv_graph_rdlock_main_loop() doesn't actually take any locks, it
> >> asserts that we're running in the main thread and not in a coroutine.
> >> bdrv_graph_rdunlock_main_loop() does the same.
> >> GRAPH_RDLOCK_GUARD_MAINLOOP() does both those calls, in the beginning of
> >> a function and when leaving its scope, so essentially it also just does
> >> assert(qemu_in_main_thread() && !qemu_in_coroutine()).
> >>
> >> Therefore:
> >>
> >> 1. Is there any real benefit from using those
> >> {rdlock/rdunlock}_main_loop() constructions, or they're here due to
> >> historical reasons only?
> > 
> > It's the price we pay for the compiler to verify our locking rules.
> > 
> >> 2. Would it hurt if we only leave GRAPH_RDLOCK_GUARD_MAINLOOP() in all
> >> such occurrences?  At least when it's obvious we can't get out of the
> >> main thread.  That would simply deliver us from performing same checks
> >> several times, similar to what's done in commit 22/24 ("block/io: remove
> >> duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end()").
> > 
> > Once bdrv_drain_all_begin() is marked GRAPH_UNLOCKED, calling it after
> > GRAPH_RDLOCK_GUARD_MAINLOOP() would be wrong according to TSA rules
> > (which don't know anything about this being only a fake lock) and the
> > build would fail.
> 
> Note that I did not mark bdrv_drain_all_begin() as GRAPH_UNLOCKED in the
> series yet. The reason is that I wasn't fully sure if that is okay,
> given that it also can be called from a coroutine and does
> bdrv_co_yield_to_drain() then. But I suppose that doesn't do anything
> with the graph lock, so I'll add the GRAPH_UNLOCKED marker in v3.

I think it's still GRAPH_UNLOCKED even when called from a coroutine,
because otherwise the polling could wait for the calling coroutine to
make progress and release the lock, resulting in a deadlock.

Kevin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 04/24] block: move drain outside of read-locked bdrv_inactivate_recurse()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (2 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21  9:56   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 05/24] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK Fiona Ebner
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

More granular draining is not trivially possible, because
bdrv_inactivate_recurse() can recursively call itself.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Don't use atomics to access bs->quiesce_counter.
* Use goto for error handling/cleanup in bdrv_inactivate().

 block.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index 3065d5c91e..fa55dfba68 100644
--- a/block.c
+++ b/block.c
@@ -6989,6 +6989,8 @@ bdrv_inactivate_recurse(BlockDriverState *bs, bool top_level)
 
     GLOBAL_STATE_CODE();
 
+    assert(bs->quiesce_counter > 0);
+
     if (!bs->drv) {
         return -ENOMEDIUM;
     }
@@ -7032,9 +7034,7 @@ bdrv_inactivate_recurse(BlockDriverState *bs, bool top_level)
         return -EPERM;
     }
 
-    bdrv_drained_begin(bs);
     bs->open_flags |= BDRV_O_INACTIVE;
-    bdrv_drained_end(bs);
 
     /*
      * Update permissions, they may differ for inactive nodes.
@@ -7059,20 +7059,26 @@ int bdrv_inactivate(BlockDriverState *bs, Error **errp)
     int ret;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     if (bdrv_has_bds_parent(bs, true)) {
         error_setg(errp, "Node has active parent node");
-        return -EPERM;
+        ret = -EPERM;
+        goto out;
     }
 
     ret = bdrv_inactivate_recurse(bs, true);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to inactivate node");
-        return ret;
+        goto out;
     }
 
-    return 0;
+out:
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
+    return ret;
 }
 
 int bdrv_inactivate_all(void)
@@ -7082,7 +7088,9 @@ int bdrv_inactivate_all(void)
     int ret = 0;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
         /* Nodes with BDS parents are covered by recursion from the last
@@ -7098,6 +7106,9 @@ int bdrv_inactivate_all(void)
         }
     }
 
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
+
     return ret;
 }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 04/24] block: move drain outside of read-locked bdrv_inactivate_recurse()
  2025-05-20 10:29 ` [PATCH v2 04/24] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
@ 2025-05-21  9:56   ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21  9:56 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> More granular draining is not trivially possible, because
> bdrv_inactivate_recurse() can recursively call itself.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 05/24] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (3 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 04/24] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21 10:04   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) Fiona Ebner
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is a small step in preparation to mark bdrv_drained_begin() as
GRAPH_UNLOCKED. More concretely, it allows marking the
change_aio_ctx() callback GRAPH_RDLOCK_PTR, which is the next step.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

New in v2.

 block.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index fa55dfba68..7207978e53 100644
--- a/block.c
+++ b/block.c
@@ -7575,10 +7575,10 @@ typedef struct BdrvStateSetAioContext {
     BlockDriverState *bs;
 } BdrvStateSetAioContext;
 
-static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
-                                           GHashTable *visited,
-                                           Transaction *tran,
-                                           Error **errp)
+static bool GRAPH_RDLOCK
+bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
+                               GHashTable *visited, Transaction *tran,
+                               Error **errp)
 {
     GLOBAL_STATE_CODE();
     if (g_hash_table_contains(visited, c)) {
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 05/24] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK
  2025-05-20 10:29 ` [PATCH v2 05/24] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK Fiona Ebner
@ 2025-05-21 10:04   ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21 10:04 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is a small step in preparation to mark bdrv_drained_begin() as
> GRAPH_UNLOCKED. More concretely, it allows marking the
> change_aio_ctx() callback GRAPH_RDLOCK_PTR, which is the next step.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR)
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (4 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 05/24] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-20 15:47   ` Eric Blake
  2025-05-21 15:23   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
                   ` (18 subsequent siblings)
  24 siblings, 2 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is a small step in preparation to mark bdrv_drained_begin() as
GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the
drain out of bdrv_change_aio_context() and marking that function as
GRAPH_RDLOCK.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

New in v2.

Checkpatch seems to report a false positive here, but it's the same
for other callbacks in that header file:

ERROR: space prohibited between function name and open parenthesis '('
#86: FILE: include/block/block_int-common.h:986:
+    bool GRAPH_RDLOCK_PTR (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,


 block.c                          | 7 ++++---
 block/block-backend.c            | 6 +++---
 blockjob.c                       | 6 +++---
 include/block/block_int-common.h | 6 +++---
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 7207978e53..01144c895e 100644
--- a/block.c
+++ b/block.c
@@ -1226,9 +1226,10 @@ static int bdrv_child_cb_inactivate(BdrvChild *child)
     return 0;
 }
 
-static bool bdrv_child_cb_change_aio_ctx(BdrvChild *child, AioContext *ctx,
-                                         GHashTable *visited, Transaction *tran,
-                                         Error **errp)
+static bool GRAPH_RDLOCK
+bdrv_child_cb_change_aio_ctx(BdrvChild *child, AioContext *ctx,
+                             GHashTable *visited, Transaction *tran,
+                             Error **errp)
 {
     BlockDriverState *bs = child->opaque;
     return bdrv_change_aio_context(bs, ctx, visited, tran, errp);
diff --git a/block/block-backend.c b/block/block-backend.c
index a402db13f2..6a6949edeb 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -136,9 +136,9 @@ static void blk_root_drained_end(BdrvChild *child);
 static void blk_root_change_media(BdrvChild *child, bool load);
 static void blk_root_resize(BdrvChild *child);
 
-static bool blk_root_change_aio_ctx(BdrvChild *child, AioContext *ctx,
-                                    GHashTable *visited, Transaction *tran,
-                                    Error **errp);
+static bool GRAPH_RDLOCK
+blk_root_change_aio_ctx(BdrvChild *child, AioContext *ctx, GHashTable *visited,
+                        Transaction *tran, Error **errp);
 
 static char *blk_root_get_parent_desc(BdrvChild *child)
 {
diff --git a/blockjob.c b/blockjob.c
index 32007f31a9..34185d7715 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -144,9 +144,9 @@ static TransactionActionDrv change_child_job_context = {
     .clean = g_free,
 };
 
-static bool child_job_change_aio_ctx(BdrvChild *c, AioContext *ctx,
-                                     GHashTable *visited, Transaction *tran,
-                                     Error **errp)
+static bool GRAPH_RDLOCK
+child_job_change_aio_ctx(BdrvChild *c, AioContext *ctx, GHashTable *visited,
+                         Transaction *tran, Error **errp)
 {
     BlockJob *job = c->opaque;
     BdrvStateChildJobContext *s;
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 2982dd3118..37466c7841 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -983,9 +983,9 @@ struct BdrvChildClass {
                            bool backing_mask_protocol,
                            Error **errp);
 
-    bool (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
-                           GHashTable *visited, Transaction *tran,
-                           Error **errp);
+    bool GRAPH_RDLOCK_PTR (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
+                                            GHashTable *visited,
+                                            Transaction *tran, Error **errp);
 
     /*
      * I/O API functions. These functions are thread-safe.
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR)
  2025-05-20 10:29 ` [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) Fiona Ebner
@ 2025-05-20 15:47   ` Eric Blake
  2025-05-21 15:23   ` Kevin Wolf
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Blake @ 2025-05-20 15:47 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, kwolf, den, andrey.drobyshev, hreitz,
	stefanha, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

On Tue, May 20, 2025 at 12:29:54PM +0200, Fiona Ebner wrote:
> This is a small step in preparation to mark bdrv_drained_begin() as
> GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the

preparation

> drain out of bdrv_change_aio_context() and marking that function as
> GRAPH_RDLOCK.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR)
  2025-05-20 10:29 ` [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) Fiona Ebner
  2025-05-20 15:47   ` Eric Blake
@ 2025-05-21 15:23   ` Kevin Wolf
  1 sibling, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21 15:23 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is a small step in preparation to mark bdrv_drained_begin() as
> GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the
> drain out of bdrv_change_aio_context() and marking that function as
> GRAPH_RDLOCK.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

I noticed that all existing callers of .change_aio_ctx() apart from the
one annotated only in the previous patch are in fact even GRAPH_WRLOCK.

Maybe it's not important any more nowadays given that the assigned
AioContext of a block node is more of a default, but traditionally I
think it would have made sense to treat changing the AioContext as
changing the graph, i.e. bs->aio_context could have been GRAPH_WRLOCK.

Just posting this in case someone has more thoughts on it. I don't
necessarily disagree with using GRAPH_RDLOCK.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (5 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-20 15:48   ` Eric Blake
                     ` (2 more replies)
  2025-05-20 10:29 ` [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK Fiona Ebner
                   ` (17 subsequent siblings)
  24 siblings, 3 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is a small step in preparation to mark bdrv_drained_begin() as
GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the
drain out of bdrv_change_aio_context() and marking that function as
GRAPH_RDLOCK.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

New in v2.

 include/block/block-global-state.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 9be34b3c99..aad160956a 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -274,9 +274,10 @@ int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag);
 int bdrv_debug_resume(BlockDriverState *bs, const char *tag);
 bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
 
-bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
-                                   GHashTable *visited, Transaction *tran,
-                                   Error **errp);
+bool GRAPH_RDLOCK
+bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
+                              GHashTable *visited, Transaction *tran,
+                              Error **errp);
 int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
                                 BdrvChild *ignore_child, Error **errp);
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
  2025-05-20 10:29 ` [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
@ 2025-05-20 15:48   ` Eric Blake
  2025-05-21 15:28   ` Kevin Wolf
  2025-05-23 17:20   ` Andrey Drobyshev
  2 siblings, 0 replies; 49+ messages in thread
From: Eric Blake @ 2025-05-20 15:48 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, kwolf, den, andrey.drobyshev, hreitz,
	stefanha, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

On Tue, May 20, 2025 at 12:29:55PM +0200, Fiona Ebner wrote:
> This is a small step in preparation to mark bdrv_drained_begin() as
> GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the

preparation

> drain out of bdrv_change_aio_context() and marking that function as
> GRAPH_RDLOCK.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
  2025-05-20 10:29 ` [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
  2025-05-20 15:48   ` Eric Blake
@ 2025-05-21 15:28   ` Kevin Wolf
  2025-05-23 17:20   ` Andrey Drobyshev
  2 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21 15:28 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is a small step in preparation to mark bdrv_drained_begin() as
> GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the
> drain out of bdrv_change_aio_context() and marking that function as
> GRAPH_RDLOCK.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

With the typo mentioned by Eric fixed:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
  2025-05-20 10:29 ` [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
  2025-05-20 15:48   ` Eric Blake
  2025-05-21 15:28   ` Kevin Wolf
@ 2025-05-23 17:20   ` Andrey Drobyshev
  2025-05-26  9:05     ` Kevin Wolf
  2 siblings, 1 reply; 49+ messages in thread
From: Andrey Drobyshev @ 2025-05-23 17:20 UTC (permalink / raw)
  To: Fiona Ebner, qemu-block
  Cc: qemu-devel, kwolf, den, hreitz, stefanha, eblake, jsnow,
	vsementsov, xiechanglong.d, wencongyang2, berto, fam, ari

On 5/20/25 1:29 PM, Fiona Ebner wrote:
> This is a small step in preparation to mark bdrv_drained_begin() as
> GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the
> drain out of bdrv_change_aio_context() and marking that function as
> GRAPH_RDLOCK.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> New in v2.
> 
>  include/block/block-global-state.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
> index 9be34b3c99..aad160956a 100644
> --- a/include/block/block-global-state.h
> +++ b/include/block/block-global-state.h
> @@ -274,9 +274,10 @@ int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag);
>  int bdrv_debug_resume(BlockDriverState *bs, const char *tag);
>  bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
>  
> -bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
> -                                   GHashTable *visited, Transaction *tran,
> -                                   Error **errp);
> +bool GRAPH_RDLOCK
> +bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
> +                              GHashTable *visited, Transaction *tran,
> +                              Error **errp);
>  int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>                                  BdrvChild *ignore_child, Error **errp);
>  

I think we might as well add the GRAPH_RDLOCK mark to the actual
function definition in block.c for better readability.

Andrey


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
  2025-05-23 17:20   ` Andrey Drobyshev
@ 2025-05-26  9:05     ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-26  9:05 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: Fiona Ebner, qemu-block, qemu-devel, den, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari, tbaeder

Am 23.05.2025 um 19:20 hat Andrey Drobyshev geschrieben:
> On 5/20/25 1:29 PM, Fiona Ebner wrote:
> > This is a small step in preparation to mark bdrv_drained_begin() as
> > GRAPH_UNLOCKED. More concretely, it is in preparatoin to move the
> > drain out of bdrv_change_aio_context() and marking that function as
> > GRAPH_RDLOCK.
> > 
> > Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> > ---
> > 
> > New in v2.
> > 
> >  include/block/block-global-state.h | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
> > index 9be34b3c99..aad160956a 100644
> > --- a/include/block/block-global-state.h
> > +++ b/include/block/block-global-state.h
> > @@ -274,9 +274,10 @@ int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag);
> >  int bdrv_debug_resume(BlockDriverState *bs, const char *tag);
> >  bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag);
> >  
> > -bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
> > -                                   GHashTable *visited, Transaction *tran,
> > -                                   Error **errp);
> > +bool GRAPH_RDLOCK
> > +bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
> > +                              GHashTable *visited, Transaction *tran,
> > +                              Error **errp);
> >  int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> >                                  BdrvChild *ignore_child, Error **errp);
> >  
> 
> I think we might as well add the GRAPH_RDLOCK mark to the actual
> function definition in block.c for better readability.

We don't do this for public functions. The reason is that TSA can't
catch it if you annotate the function definition, but forget it in the
header file. So to the human reader (and in code after the definition in
the same file) it would look like the lock is taken, but in reality
callers in other source files can call the function without holding the
lock.

If https://github.com/llvm/llvm-project/pull/67520 is merged eventually,
we can reconsider this rule, but for now, it's better to keep it.

Kevin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (6 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21 16:05   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context() Fiona Ebner
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

Note that even if bdrv_drained_begin() would already be marked as
GRAPH_UNLOCKED, TSA would not complain about the instance in
bdrv_change_aio_context() before this change, because it is preceded
by a bdrv_graph_rdunlock_main_loop() call. It is not correct to
release the lock here, and in case the caller holds a write lock, it
wouldn't actually release the lock.

In combination with block-stream, there is a deadlock that can happen
because of this [0]. In particular, it can happen that
main thread              IO thread
1. acquires write lock
                         in blk_co_do_preadv_part():
                         2. have non-zero blk->in_flight
                         3. try to acquire read lock
4. begin drain

Steps 3 and 4 might be switched. Draining will poll and get stuck,
because it will see the non-zero in_flight counter. But the IO thread
will not make any progress either, because it cannot acquire the read
lock.

After this change, all paths to bdrv_change_aio_context() drain:
bdrv_change_aio_context() is called by:
1. bdrv_child_cb_change_aio_ctx() which is only called via the
   change_aio_ctx() callback, see below.
2. bdrv_child_change_aio_context(), see below.
3. bdrv_try_change_aio_context(), where a drained section is
   introduced.

The change_aio_ctx() callback is called by:
1. bdrv_attach_child_common_abort(), where a drained section is
   introduced.
2. bdrv_attach_child_common(), where a drained section is introduced.
3. bdrv_parent_change_aio_context(), see below.

bdrv_child_change_aio_context() is called by:
1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
   section is invariant.
2. child_job_change_aio_ctx(), which is only called via the
   change_aio_ctx() callback, see above.

bdrv_parent_change_aio_context() is called by:
1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
   section is invariant.

This resolves all code paths. Note that bdrv_attach_child_common()
and bdrv_attach_child_common_abort() hold the graph write lock and
callers of bdrv_try_change_aio_context() might too, so they are not
actually allowed to drain either. This will be addressed in the
following commits.

More granular draining is not trivially possible, because
bdrv_change_aio_context() can recursively call itself e.g. via
bdrv_child_change_aio_context().

[0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/

Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/block.c b/block.c
index 01144c895e..7148618504 100644
--- a/block.c
+++ b/block.c
@@ -106,9 +106,9 @@ static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
 
 static bool bdrv_backing_overridden(BlockDriverState *bs);
 
-static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                    GHashTable *visited, Transaction *tran,
-                                    Error **errp);
+static bool GRAPH_RDLOCK
+bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                        GHashTable *visited, Transaction *tran, Error **errp);
 
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
@@ -3040,8 +3040,10 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
 
         /* No need to visit `child`, because it has been detached already */
         visited = g_hash_table_new(NULL, NULL);
+        bdrv_drain_all_begin();
         ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
                                               visited, tran, &error_abort);
+        bdrv_drain_all_end();
         g_hash_table_destroy(visited);
 
         /* transaction is supposed to always succeed */
@@ -3122,9 +3124,11 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
             bool ret_child;
 
             g_hash_table_add(visited, new_child);
+            bdrv_drain_all_begin();
             ret_child = child_class->change_aio_ctx(new_child, child_ctx,
                                                     visited, aio_ctx_tran,
                                                     NULL);
+            bdrv_drain_all_end();
             if (ret_child == true) {
                 error_free(local_err);
                 ret = 0;
@@ -7619,10 +7623,6 @@ bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
 static void bdrv_set_aio_context_clean(void *opaque)
 {
     BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
-    BlockDriverState *bs = (BlockDriverState *) state->bs;
-
-    /* Paired with bdrv_drained_begin in bdrv_change_aio_context() */
-    bdrv_drained_end(bs);
 
     g_free(state);
 }
@@ -7650,6 +7650,8 @@ static TransactionActionDrv set_aio_context = {
  *
  * @visited will accumulate all visited BdrvChild objects. The caller is
  * responsible for freeing the list afterwards.
+ *
+ * @bs must be drained.
  */
 static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
                                     GHashTable *visited, Transaction *tran,
@@ -7664,21 +7666,17 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
         return true;
     }
 
-    bdrv_graph_rdlock_main_loop();
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
-            bdrv_graph_rdunlock_main_loop();
             return false;
         }
     }
 
     QLIST_FOREACH(c, &bs->children, next) {
         if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
-            bdrv_graph_rdunlock_main_loop();
             return false;
         }
     }
-    bdrv_graph_rdunlock_main_loop();
 
     state = g_new(BdrvStateSetAioContext, 1);
     *state = (BdrvStateSetAioContext) {
@@ -7686,8 +7684,7 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
         .bs = bs,
     };
 
-    /* Paired with bdrv_drained_end in bdrv_set_aio_context_clean() */
-    bdrv_drained_begin(bs);
+    assert(bs->quiesce_counter > 0);
 
     tran_add(tran, &set_aio_context, state);
 
@@ -7720,7 +7717,11 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
     if (ignore_child) {
         g_hash_table_add(visited, ignore_child);
     }
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
     ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
     g_hash_table_destroy(visited);
 
     /*
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK
  2025-05-20 10:29 ` [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK Fiona Ebner
@ 2025-05-21 16:05   ` Kevin Wolf
  2025-05-22  9:09     ` Fiona Ebner
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21 16:05 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> 
> Note that even if bdrv_drained_begin() would already be marked as

"if ... were already marked"

> GRAPH_UNLOCKED, TSA would not complain about the instance in
> bdrv_change_aio_context() before this change, because it is preceded
> by a bdrv_graph_rdunlock_main_loop() call. It is not correct to
> release the lock here, and in case the caller holds a write lock, it
> wouldn't actually release the lock.
> 
> In combination with block-stream, there is a deadlock that can happen
> because of this [0]. In particular, it can happen that
> main thread              IO thread
> 1. acquires write lock
>                          in blk_co_do_preadv_part():
>                          2. have non-zero blk->in_flight
>                          3. try to acquire read lock
> 4. begin drain
> 
> Steps 3 and 4 might be switched. Draining will poll and get stuck,
> because it will see the non-zero in_flight counter. But the IO thread
> will not make any progress either, because it cannot acquire the read
> lock.
> 
> After this change, all paths to bdrv_change_aio_context() drain:
> bdrv_change_aio_context() is called by:
> 1. bdrv_child_cb_change_aio_ctx() which is only called via the
>    change_aio_ctx() callback, see below.
> 2. bdrv_child_change_aio_context(), see below.
> 3. bdrv_try_change_aio_context(), where a drained section is
>    introduced.
> 
> The change_aio_ctx() callback is called by:
> 1. bdrv_attach_child_common_abort(), where a drained section is
>    introduced.
> 2. bdrv_attach_child_common(), where a drained section is introduced.
> 3. bdrv_parent_change_aio_context(), see below.
> 
> bdrv_child_change_aio_context() is called by:
> 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
>    section is invariant.
> 2. child_job_change_aio_ctx(), which is only called via the
>    change_aio_ctx() callback, see above.
> 
> bdrv_parent_change_aio_context() is called by:
> 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
>    section is invariant.
> 
> This resolves all code paths. Note that bdrv_attach_child_common()
> and bdrv_attach_child_common_abort() hold the graph write lock and
> callers of bdrv_try_change_aio_context() might too, so they are not
> actually allowed to drain either. This will be addressed in the
> following commits.
> 
> More granular draining is not trivially possible, because
> bdrv_change_aio_context() can recursively call itself e.g. via
> bdrv_child_change_aio_context().
> 
> [0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/
> 
> Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> Changes in v2:
> * Split up into smaller pieces, flesh out commit messages.
> 
>  block.c | 27 ++++++++++++++-------------
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 01144c895e..7148618504 100644
> --- a/block.c
> +++ b/block.c
> @@ -106,9 +106,9 @@ static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
>  
>  static bool bdrv_backing_overridden(BlockDriverState *bs);
>  
> -static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> -                                    GHashTable *visited, Transaction *tran,
> -                                    Error **errp);
> +static bool GRAPH_RDLOCK
> +bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> +                        GHashTable *visited, Transaction *tran, Error **errp);

For static functions, we should have rhe GRAPH_RDLOCK annotation both
here and in the actual definition, too.

>  /* If non-zero, use only whitelisted block drivers */
>  static int use_bdrv_whitelist;
> @@ -3040,8 +3040,10 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
>  
>          /* No need to visit `child`, because it has been detached already */
>          visited = g_hash_table_new(NULL, NULL);
> +        bdrv_drain_all_begin();
>          ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
>                                                visited, tran, &error_abort);
> +        bdrv_drain_all_end();
>          g_hash_table_destroy(visited);
>  
>          /* transaction is supposed to always succeed */
> @@ -3122,9 +3124,11 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
>              bool ret_child;
>  
>              g_hash_table_add(visited, new_child);
> +            bdrv_drain_all_begin();
>              ret_child = child_class->change_aio_ctx(new_child, child_ctx,
>                                                      visited, aio_ctx_tran,
>                                                      NULL);
> +            bdrv_drain_all_end();
>              if (ret_child == true) {
>                  error_free(local_err);
>                  ret = 0;

Should we document in the header file that BdrvChildClass.change_aio_ctx
is called with the node drained?

We could add assertions to bdrv_child/parent_change_aio_context or at
least comments to this effect. (Assertions might be over the top because
it's easy to verify that both are only called from
bdrv_change_aio_context().)

> @@ -7619,10 +7623,6 @@ bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
>  static void bdrv_set_aio_context_clean(void *opaque)
>  {
>      BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
> -    BlockDriverState *bs = (BlockDriverState *) state->bs;
> -
> -    /* Paired with bdrv_drained_begin in bdrv_change_aio_context() */
> -    bdrv_drained_end(bs);
>  
>      g_free(state);
>  }
> @@ -7650,6 +7650,8 @@ static TransactionActionDrv set_aio_context = {
>   *
>   * @visited will accumulate all visited BdrvChild objects. The caller is
>   * responsible for freeing the list afterwards.
> + *
> + * @bs must be drained.
>   */
>  static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>                                      GHashTable *visited, Transaction *tran,
> @@ -7664,21 +7666,17 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>          return true;
>      }
>  
> -    bdrv_graph_rdlock_main_loop();
>      QLIST_FOREACH(c, &bs->parents, next_parent) {
>          if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
> -            bdrv_graph_rdunlock_main_loop();
>              return false;
>          }
>      }
>  
>      QLIST_FOREACH(c, &bs->children, next) {
>          if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
> -            bdrv_graph_rdunlock_main_loop();
>              return false;
>          }
>      }
> -    bdrv_graph_rdunlock_main_loop();
>  
>      state = g_new(BdrvStateSetAioContext, 1);
>      *state = (BdrvStateSetAioContext) {
> @@ -7686,8 +7684,7 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>          .bs = bs,
>      };
>  
> -    /* Paired with bdrv_drained_end in bdrv_set_aio_context_clean() */
> -    bdrv_drained_begin(bs);
> +    assert(bs->quiesce_counter > 0);
>  
>      tran_add(tran, &set_aio_context, state);
>  
> @@ -7720,7 +7717,11 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>      if (ignore_child) {
>          g_hash_table_add(visited, ignore_child);
>      }
> +    bdrv_drain_all_begin();
> +    bdrv_graph_rdlock_main_loop();
>      ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
> +    bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>      g_hash_table_destroy(visited);

I think you're ending the drained section too early here. Previously,
the nodes were kept drained until after tran_abort/commit(), and I think
that's important (tran_commit() is the thing that actually switches the
AioContext).

Kevin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK
  2025-05-21 16:05   ` Kevin Wolf
@ 2025-05-22  9:09     ` Fiona Ebner
  2025-05-22 12:05       ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-22  9:09 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 21.05.25 um 18:05 schrieb Kevin Wolf:
> Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
>> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
>>
>> Note that even if bdrv_drained_begin() would already be marked as
> 
> "if ... were already marked"

Ack.

---snip 8<---

>> diff --git a/block.c b/block.c
>> index 01144c895e..7148618504 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -106,9 +106,9 @@ static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>  
>>  static bool bdrv_backing_overridden(BlockDriverState *bs);
>>  
>> -static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>> -                                    GHashTable *visited, Transaction *tran,
>> -                                    Error **errp);
>> +static bool GRAPH_RDLOCK
>> +bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>> +                        GHashTable *visited, Transaction *tran, Error **errp);
> 
> For static functions, we should have rhe GRAPH_RDLOCK annotation both
> here and in the actual definition, too.

Will fix in v3!

>>  /* If non-zero, use only whitelisted block drivers */
>>  static int use_bdrv_whitelist;
>> @@ -3040,8 +3040,10 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
>>  
>>          /* No need to visit `child`, because it has been detached already */
>>          visited = g_hash_table_new(NULL, NULL);
>> +        bdrv_drain_all_begin();
>>          ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
>>                                                visited, tran, &error_abort);
>> +        bdrv_drain_all_end();
>>          g_hash_table_destroy(visited);
>>  
>>          /* transaction is supposed to always succeed */
>> @@ -3122,9 +3124,11 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
>>              bool ret_child;
>>  
>>              g_hash_table_add(visited, new_child);
>> +            bdrv_drain_all_begin();
>>              ret_child = child_class->change_aio_ctx(new_child, child_ctx,
>>                                                      visited, aio_ctx_tran,
>>                                                      NULL);
>> +            bdrv_drain_all_end();
>>              if (ret_child == true) {
>>                  error_free(local_err);
>>                  ret = 0;
> 
> Should we document in the header file that BdrvChildClass.change_aio_ctx
> is called with the node drained?
> 
> We could add assertions to bdrv_child/parent_change_aio_context or at
> least comments to this effect. (Assertions might be over the top because
> it's easy to verify that both are only called from
> bdrv_change_aio_context().)

Sounds good. Would the following be okay?

For bdrv_parent_change_aio_context() and for change_aio_ctx() the same
(except the name of @child is @c for the former):

> /*
>  * Changes the AioContext of for @child to @ctx and recursively for the
>  * associated block nodes and all their children and parents. Returns true if
>  * the change is possible and the transaction @tran can be continued. Returns
>  * false and sets @errp if not and the transaction must be aborted.
>  *
>  * @visited will accumulate all visited BdrvChild objects. The caller is
>  * responsible for freeing the list afterwards.
>  *
>  * Must be called with the affected block nodes drained.
>  */

and for bdrv_child_change_aio_context() slightly different:

> /*
>  * Changes the AioContext of for @c->bs to @ctx and recursively for all its
>  * children and parents. Returns true if the change is possible and the
>  * transaction @tran can be continued. Returns false and sets @errp if not and
>  * the transaction must be aborted.
>  *
>  * @visited will accumulate all visited BdrvChild objects. The caller is
>  * responsible for freeing the list afterwards.
>  *
>  * Must be called with the affected block nodes drained.
>  */

---snip 8<---

>> @@ -7720,7 +7717,11 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>>      if (ignore_child) {
>>          g_hash_table_add(visited, ignore_child);
>>      }
>> +    bdrv_drain_all_begin();
>> +    bdrv_graph_rdlock_main_loop();
>>      ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
>> +    bdrv_graph_rdunlock_main_loop();
>> +    bdrv_drain_all_end();
>>      g_hash_table_destroy(visited);
> 
> I think you're ending the drained section too early here. Previously,
> the nodes were kept drained until after tran_abort/commit(), and I think
> that's important (tran_commit() is the thing that actually switches the
> AioContext).

Right, I'll change this in v3.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK
  2025-05-22  9:09     ` Fiona Ebner
@ 2025-05-22 12:05       ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-22 12:05 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 22.05.2025 um 11:09 hat Fiona Ebner geschrieben:
> Am 21.05.25 um 18:05 schrieb Kevin Wolf:
> > Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> >> This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.
> >>
> >> Note that even if bdrv_drained_begin() would already be marked as
> > 
> > "if ... were already marked"
> 
> Ack.
> 
> ---snip 8<---
> 
> >> diff --git a/block.c b/block.c
> >> index 01144c895e..7148618504 100644
> >> --- a/block.c
> >> +++ b/block.c
> >> @@ -106,9 +106,9 @@ static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
> >>  
> >>  static bool bdrv_backing_overridden(BlockDriverState *bs);
> >>  
> >> -static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> >> -                                    GHashTable *visited, Transaction *tran,
> >> -                                    Error **errp);
> >> +static bool GRAPH_RDLOCK
> >> +bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
> >> +                        GHashTable *visited, Transaction *tran, Error **errp);
> > 
> > For static functions, we should have rhe GRAPH_RDLOCK annotation both
> > here and in the actual definition, too.
> 
> Will fix in v3!
> 
> >>  /* If non-zero, use only whitelisted block drivers */
> >>  static int use_bdrv_whitelist;
> >> @@ -3040,8 +3040,10 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
> >>  
> >>          /* No need to visit `child`, because it has been detached already */
> >>          visited = g_hash_table_new(NULL, NULL);
> >> +        bdrv_drain_all_begin();
> >>          ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
> >>                                                visited, tran, &error_abort);
> >> +        bdrv_drain_all_end();
> >>          g_hash_table_destroy(visited);
> >>  
> >>          /* transaction is supposed to always succeed */
> >> @@ -3122,9 +3124,11 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
> >>              bool ret_child;
> >>  
> >>              g_hash_table_add(visited, new_child);
> >> +            bdrv_drain_all_begin();
> >>              ret_child = child_class->change_aio_ctx(new_child, child_ctx,
> >>                                                      visited, aio_ctx_tran,
> >>                                                      NULL);
> >> +            bdrv_drain_all_end();
> >>              if (ret_child == true) {
> >>                  error_free(local_err);
> >>                  ret = 0;
> > 
> > Should we document in the header file that BdrvChildClass.change_aio_ctx
> > is called with the node drained?
> > 
> > We could add assertions to bdrv_child/parent_change_aio_context or at
> > least comments to this effect. (Assertions might be over the top because
> > it's easy to verify that both are only called from
> > bdrv_change_aio_context().)
> 
> Sounds good. Would the following be okay?
> 
> For bdrv_parent_change_aio_context() and for change_aio_ctx() the same
> (except the name of @child is @c for the former):
> 
> > /*
> >  * Changes the AioContext of for @child to @ctx and recursively for the
> >  * associated block nodes and all their children and parents. Returns true if

"of for"?

This might be a bit too specific for child_of_bds, while the function is
also implemented for BlockBackends and block jobs. Keeping the
description more in line with other BdrvChildClass functions, maybe
something generic like:

    Notifies the parent that the child is trying to change its
    AioContext. The parent may in turn change the AioContext of other
    nodes in the same transaction.

> >  * the change is possible and the transaction @tran can be continued. Returns
> >  * false and sets @errp if not and the transaction must be aborted.
> >  *
> >  * @visited will accumulate all visited BdrvChild objects. The caller is
> >  * responsible for freeing the list afterwards.
> >  *
> >  * Must be called with the affected block nodes drained.
> >  */
> 
> and for bdrv_child_change_aio_context() slightly different:
> 
> > /*
> >  * Changes the AioContext of for @c->bs to @ctx and recursively for all its

Again "of for".

> >  * children and parents. Returns true if the change is possible and the
> >  * transaction @tran can be continued. Returns false and sets @errp if not and
> >  * the transaction must be aborted.
> >  *
> >  * @visited will accumulate all visited BdrvChild objects. The caller is
> >  * responsible for freeing the list afterwards.
> >  *
> >  * Must be called with the affected block nodes drained.
> >  */

The rest looks good. (Much better than what I would have done, I was
only thinking of the last line without a proper documentation of the
whole function.)

Kevin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (7 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-21 16:36   ` Kevin Wolf
  2025-05-20 10:29 ` [PATCH v2 10/24] block: move drain outside of bdrv_attach_child_common(_abort)() Fiona Ebner
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

Convert the function to a _locked() version that has to be called with
the graph lock held and add a convenience wrapper that has to be
called with the graph unlocked, which drains and takes the lock
itself. Since bdrv_try_change_aio_context() is global state code, the
wrapper is too.

Callers are adapted to use the appropriate variant. In the
test_set_aio_context() unit test, prior drains can be removed, because
draining already happens inside the new wrapper.

Note that bdrv_attach_child_common_abort(), bdrv_attach_child_common()
and bdrv_root_unref_child() hold the graph lock and are not actually
allowed to drain either. This will be addressed in the following
commits.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c                            | 58 ++++++++++++++++++++++--------
 blockdev.c                         | 15 +++++---
 include/block/block-global-state.h |  8 +++--
 tests/unit/test-bdrv-drain.c       |  4 ---
 4 files changed, 60 insertions(+), 25 deletions(-)

diff --git a/block.c b/block.c
index 7148618504..5281ad1313 100644
--- a/block.c
+++ b/block.c
@@ -3028,7 +3028,10 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
     bdrv_replace_child_noperm(s->child, NULL);
 
     if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
-        bdrv_try_change_aio_context(bs, s->old_child_ctx, NULL, &error_abort);
+        bdrv_drain_all_begin();
+        bdrv_try_change_aio_context_locked(bs, s->old_child_ctx, NULL,
+                                           &error_abort);
+        bdrv_drain_all_end();
     }
 
     if (bdrv_child_get_parent_aio_context(s->child) != s->old_parent_ctx) {
@@ -3115,8 +3118,10 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
     parent_ctx = bdrv_child_get_parent_aio_context(new_child);
     if (child_ctx != parent_ctx) {
         Error *local_err = NULL;
-        int ret = bdrv_try_change_aio_context(child_bs, parent_ctx, NULL,
-                                              &local_err);
+        bdrv_drain_all_begin();
+        int ret = bdrv_try_change_aio_context_locked(child_bs, parent_ctx, NULL,
+                                                     &local_err);
+        bdrv_drain_all_end();
 
         if (ret < 0 && child_class->change_aio_ctx) {
             Transaction *aio_ctx_tran = tran_new();
@@ -3319,8 +3324,10 @@ void bdrv_root_unref_child(BdrvChild *child)
          * When the parent requiring a non-default AioContext is removed, the
          * node moves back to the main AioContext
          */
-        bdrv_try_change_aio_context(child_bs, qemu_get_aio_context(), NULL,
-                                    NULL);
+        bdrv_drain_all_begin();
+        bdrv_try_change_aio_context_locked(child_bs, qemu_get_aio_context(),
+                                           NULL, NULL);
+        bdrv_drain_all_end();
     }
 
     bdrv_schedule_unref(child_bs);
@@ -7697,9 +7704,13 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
  *
  * If ignore_child is not NULL, that child (and its subgraph) will not
  * be touched.
+ *
+ * Called with the graph lock held.
+ *
+ * Called while all bs are drained.
  */
-int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                BdrvChild *ignore_child, Error **errp)
+int bdrv_try_change_aio_context_locked(BlockDriverState *bs, AioContext *ctx,
+                                       BdrvChild *ignore_child, Error **errp)
 {
     Transaction *tran;
     GHashTable *visited;
@@ -7708,20 +7719,16 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
 
     /*
      * Recursion phase: go through all nodes of the graph.
-     * Take care of checking that all nodes support changing AioContext
-     * and drain them, building a linear list of callbacks to run if everything
-     * is successful (the transaction itself).
+     * Take care of checking that all nodes support changing AioContext,
+     * building a linear list of callbacks to run if everything is successful
+     * (the transaction itself).
      */
     tran = tran_new();
     visited = g_hash_table_new(NULL, NULL);
     if (ignore_child) {
         g_hash_table_add(visited, ignore_child);
     }
-    bdrv_drain_all_begin();
-    bdrv_graph_rdlock_main_loop();
     ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
-    bdrv_graph_rdunlock_main_loop();
-    bdrv_drain_all_end();
     g_hash_table_destroy(visited);
 
     /*
@@ -7741,6 +7748,29 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
     return 0;
 }
 
+/*
+ * Change bs's and recursively all of its parents' and children's AioContext
+ * to the given new context, returning an error if that isn't possible.
+ *
+ * If ignore_child is not NULL, that child (and its subgraph) will not
+ * be touched.
+ */
+int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                                BdrvChild *ignore_child, Error **errp)
+{
+    int ret;
+
+    GLOBAL_STATE_CODE();
+
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
+    ret = bdrv_try_change_aio_context_locked(bs, ctx, ignore_child, errp);
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
+
+    return ret;
+}
+
 void bdrv_add_aio_context_notifier(BlockDriverState *bs,
         void (*attached_aio_context)(AioContext *new_context, void *opaque),
         void (*detach_aio_context)(void *opaque), void *opaque)
diff --git a/blockdev.c b/blockdev.c
index 3982f9776b..750beba41f 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3601,12 +3601,13 @@ void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
     AioContext *new_context;
     BlockDriverState *bs;
 
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
 
     bs = bdrv_find_node(node_name);
     if (!bs) {
         error_setg(errp, "Failed to find node with node-name='%s'", node_name);
-        return;
+        goto out;
     }
 
     /* Protects against accidents. */
@@ -3614,14 +3615,14 @@ void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
         error_setg(errp, "Node %s is associated with a BlockBackend and could "
                          "be in use (use force=true to override this check)",
                          node_name);
-        return;
+        goto out;
     }
 
     if (iothread->type == QTYPE_QSTRING) {
         IOThread *obj = iothread_by_id(iothread->u.s);
         if (!obj) {
             error_setg(errp, "Cannot find iothread %s", iothread->u.s);
-            return;
+            goto out;
         }
 
         new_context = iothread_get_aio_context(obj);
@@ -3629,7 +3630,11 @@ void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
         new_context = qemu_get_aio_context();
     }
 
-    bdrv_try_change_aio_context(bs, new_context, NULL, errp);
+    bdrv_try_change_aio_context_locked(bs, new_context, NULL, errp);
+
+out:
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
 }
 
 QemuOptsList qemu_common_drive_opts = {
diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index aad160956a..91f249b5ad 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -278,8 +278,12 @@ bool GRAPH_RDLOCK
 bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
                               GHashTable *visited, Transaction *tran,
                               Error **errp);
-int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                BdrvChild *ignore_child, Error **errp);
+int GRAPH_UNLOCKED
+bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                            BdrvChild *ignore_child, Error **errp);
+int GRAPH_RDLOCK
+bdrv_try_change_aio_context_locked(BlockDriverState *bs, AioContext *ctx,
+                                   BdrvChild *ignore_child, Error **errp);
 
 int GRAPH_RDLOCK bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 290cd2a70e..3185f3f429 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1396,14 +1396,10 @@ static void test_set_aio_context(void)
     bs = bdrv_new_open_driver(&bdrv_test, "test-node", BDRV_O_RDWR,
                               &error_abort);
 
-    bdrv_drained_begin(bs);
     bdrv_try_change_aio_context(bs, ctx_a, NULL, &error_abort);
-    bdrv_drained_end(bs);
 
-    bdrv_drained_begin(bs);
     bdrv_try_change_aio_context(bs, ctx_b, NULL, &error_abort);
     bdrv_try_change_aio_context(bs, qemu_get_aio_context(), NULL, &error_abort);
-    bdrv_drained_end(bs);
 
     bdrv_unref(bs);
     iothread_join(a);
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context()
  2025-05-20 10:29 ` [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context() Fiona Ebner
@ 2025-05-21 16:36   ` Kevin Wolf
  2025-05-22  9:56     ` Fiona Ebner
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2025-05-21 16:36 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> This is part of resolving the deadlock mentioned in commit "block:
> move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".
> 
> Convert the function to a _locked() version that has to be called with
> the graph lock held and add a convenience wrapper that has to be
> called with the graph unlocked, which drains and takes the lock
> itself. Since bdrv_try_change_aio_context() is global state code, the
> wrapper is too.
> 
> Callers are adapted to use the appropriate variant. In the
> test_set_aio_context() unit test, prior drains can be removed, because
> draining already happens inside the new wrapper.

I'm not sure what's the difference between qmp_x_blockdev_set_iothread()
which is converted to _locked() and e.g. qmp_blockdev_mirror().

The reason I could see in qmp_x_blockdev_set_iothread() is that (as we
discussed in the RFC version of this series) draining can change the
graph and that could in theory invalidate bs. But the same is true for
other functions in blockdev.c.

If it's just that qmp_x_blockdev_set_iothread() was easy to fix and the
others are more complex, that's fine with me, but maybe worth mentioning
in the commit message.

> Note that bdrv_attach_child_common_abort(), bdrv_attach_child_common()
> and bdrv_root_unref_child() hold the graph lock and are not actually
> allowed to drain either. This will be addressed in the following
> commits.
> 
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context()
  2025-05-21 16:36   ` Kevin Wolf
@ 2025-05-22  9:56     ` Fiona Ebner
  2025-05-22 12:36       ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-22  9:56 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 21.05.25 um 18:36 schrieb Kevin Wolf:
> Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
>> This is part of resolving the deadlock mentioned in commit "block:
>> move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".
>>
>> Convert the function to a _locked() version that has to be called with
>> the graph lock held and add a convenience wrapper that has to be
>> called with the graph unlocked, which drains and takes the lock
>> itself. Since bdrv_try_change_aio_context() is global state code, the
>> wrapper is too.
>>
>> Callers are adapted to use the appropriate variant. In the
>> test_set_aio_context() unit test, prior drains can be removed, because
>> draining already happens inside the new wrapper.
> 
> I'm not sure what's the difference between qmp_x_blockdev_set_iothread()
> which is converted to _locked() and e.g. qmp_blockdev_mirror().

For deciding on the variant, I only looked at whether the caller holds
the graph lock or not.

> The reason I could see in qmp_x_blockdev_set_iothread() is that (as we
> discussed in the RFC version of this series) draining can change the
> graph and that could in theory invalidate bs. But the same is true for
> other functions in blockdev.c.

Right, my patches are not enough to address that kind of problem.

> If it's just that qmp_x_blockdev_set_iothread() was easy to fix and the
> others are more complex, that's fine with me, but maybe worth mentioning
> in the commit message.

Yes, I can mention something in the commit message, maybe:

Functions like qmp_blockdev_mirror() query the nodes to act on before
draining and locking. In theory, draining could invalidate those nodes.
This kind of issue is not addressed by these commits.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context()
  2025-05-22  9:56     ` Fiona Ebner
@ 2025-05-22 12:36       ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2025-05-22 12:36 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Am 22.05.2025 um 11:56 hat Fiona Ebner geschrieben:
> Am 21.05.25 um 18:36 schrieb Kevin Wolf:
> > Am 20.05.2025 um 12:29 hat Fiona Ebner geschrieben:
> >> This is part of resolving the deadlock mentioned in commit "block:
> >> move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".
> >>
> >> Convert the function to a _locked() version that has to be called with
> >> the graph lock held and add a convenience wrapper that has to be
> >> called with the graph unlocked, which drains and takes the lock
> >> itself. Since bdrv_try_change_aio_context() is global state code, the
> >> wrapper is too.
> >>
> >> Callers are adapted to use the appropriate variant. In the
> >> test_set_aio_context() unit test, prior drains can be removed, because
> >> draining already happens inside the new wrapper.
> > 
> > I'm not sure what's the difference between qmp_x_blockdev_set_iothread()
> > which is converted to _locked() and e.g. qmp_blockdev_mirror().
> 
> For deciding on the variant, I only looked at whether the caller holds
> the graph lock or not.

Oh, I see. I was thinking much too complicated once again.

Maybe just add "the appropriate variant, depending on whether the caller
already holds the lock"?

> > The reason I could see in qmp_x_blockdev_set_iothread() is that (as we
> > discussed in the RFC version of this series) draining can change the
> > graph and that could in theory invalidate bs. But the same is true for
> > other functions in blockdev.c.
> 
> Right, my patches are not enough to address that kind of problem.
> 
> > If it's just that qmp_x_blockdev_set_iothread() was easy to fix and the
> > others are more complex, that's fine with me, but maybe worth mentioning
> > in the commit message.
> 
> Yes, I can mention something in the commit message, maybe:
> 
> Functions like qmp_blockdev_mirror() query the nodes to act on before
> draining and locking. In theory, draining could invalidate those nodes.
> This kind of issue is not addressed by these commits.

Doesn't hurt, but it's also not really needed.

Kevin



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 10/24] block: move drain outside of bdrv_attach_child_common(_abort)()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (8 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context() Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-20 10:29 ` [PATCH v2 11/24] block: move drain outside of bdrv_set_backing_hd_drained() Fiona Ebner
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The function bdrv_attach_child_common_abort() is used only as the
abort callback in bdrv_attach_child_common_drv transactions, so the
tran_finalize() calls of such transactions need to be in drained
sections too.

All code paths are covered:
The bdrv_attach_child_common_drv transactions are only used in
bdrv_attach_child_common(), so it is enough to check callers of
bdrv_attach_child_common() following the transactions.

bdrv_attach_child_common() is called by:
1. bdrv_attach_child_noperm(), which does not finalize the
   transaction yet.
2. bdrv_root_attach_child(), where a drained section is introduced.

bdrv_attach_child_noperm() is called by:
1. bdrv_attach_child(), where a drained section is introduced.
2. bdrv_set_file_or_backing_noperm(), which does not finalize the
   transaction yet.
3. bdrv_append(), where a drained section is introduced.

bdrv_set_file_or_backing_noperm() is called by:
1. bdrv_set_backing_hd_drained(), where a drained section is
   introduced.
2. bdrv_reopen_parse_file_or_backing(), which does not finalize the
   transaction yet. Draining the old child bs currently happens under
   the graph lock there. This is replaced with an assertion, because
   the drain will be moved further up to the caller.

bdrv_reopen_parse_file_or_backing() is called by:
1. bdrv_reopen_prepare(), which does not finalize the transaction yet.

bdrv_reopen_prepare() is called by:
1. bdrv_reopen_multiple(), which does finalize the transaction. It is
   called after bdrv_reopen_queue(), which starts a drained section.
   The drained section ends, when bdrv_reopen_queue_free() is called
   at the end of bdrv_reopen_multiple().

This resolves all code paths.

The functions bdrv_set_backing_hd_drained(), bdrv_attach_child() and
bdrv_root_attach_child() run under the graph lock, so they are not
actually allowed to drain. This will be addressed in the following
commits.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/block.c b/block.c
index 5281ad1313..702a51ff4b 100644
--- a/block.c
+++ b/block.c
@@ -3028,10 +3028,8 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
     bdrv_replace_child_noperm(s->child, NULL);
 
     if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
-        bdrv_drain_all_begin();
         bdrv_try_change_aio_context_locked(bs, s->old_child_ctx, NULL,
                                            &error_abort);
-        bdrv_drain_all_end();
     }
 
     if (bdrv_child_get_parent_aio_context(s->child) != s->old_parent_ctx) {
@@ -3043,10 +3041,8 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
 
         /* No need to visit `child`, because it has been detached already */
         visited = g_hash_table_new(NULL, NULL);
-        bdrv_drain_all_begin();
         ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
                                               visited, tran, &error_abort);
-        bdrv_drain_all_end();
         g_hash_table_destroy(visited);
 
         /* transaction is supposed to always succeed */
@@ -3118,10 +3114,8 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
     parent_ctx = bdrv_child_get_parent_aio_context(new_child);
     if (child_ctx != parent_ctx) {
         Error *local_err = NULL;
-        bdrv_drain_all_begin();
         int ret = bdrv_try_change_aio_context_locked(child_bs, parent_ctx, NULL,
                                                      &local_err);
-        bdrv_drain_all_end();
 
         if (ret < 0 && child_class->change_aio_ctx) {
             Transaction *aio_ctx_tran = tran_new();
@@ -3129,11 +3123,9 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
             bool ret_child;
 
             g_hash_table_add(visited, new_child);
-            bdrv_drain_all_begin();
             ret_child = child_class->change_aio_ctx(new_child, child_ctx,
                                                     visited, aio_ctx_tran,
                                                     NULL);
-            bdrv_drain_all_end();
             if (ret_child == true) {
                 error_free(local_err);
                 ret = 0;
@@ -3244,6 +3236,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
 
     GLOBAL_STATE_CODE();
 
+    bdrv_drain_all_begin();
     child = bdrv_attach_child_common(child_bs, child_name, child_class,
                                    child_role, perm, shared_perm, opaque,
                                    tran, errp);
@@ -3256,6 +3249,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
 
 out:
     tran_finalize(tran, ret);
+    bdrv_drain_all_end();
 
     bdrv_schedule_unref(child_bs);
 
@@ -3283,6 +3277,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
     GLOBAL_STATE_CODE();
 
+    bdrv_drain_all_begin();
     child = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
                                      child_class, child_role, tran, errp);
     if (!child) {
@@ -3297,6 +3292,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
 out:
     tran_finalize(tran, ret);
+    bdrv_drain_all_end();
 
     bdrv_schedule_unref(child_bs);
 
@@ -3573,6 +3569,7 @@ int bdrv_set_backing_hd_drained(BlockDriverState *bs,
         assert(bs->backing->bs->quiesce_counter > 0);
     }
 
+    bdrv_drain_all_begin();
     ret = bdrv_set_file_or_backing_noperm(bs, backing_hd, true, tran, errp);
     if (ret < 0) {
         goto out;
@@ -3581,6 +3578,7 @@ int bdrv_set_backing_hd_drained(BlockDriverState *bs,
     ret = bdrv_refresh_perms(bs, tran, errp);
 out:
     tran_finalize(tran, ret);
+    bdrv_drain_all_end();
     return ret;
 }
 
@@ -4721,6 +4719,8 @@ int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
  * Return 0 on success, otherwise return < 0 and set @errp.
  *
  * @reopen_state->bs can move to a different AioContext in this function.
+ *
+ * The old child bs must be drained.
  */
 static int GRAPH_UNLOCKED
 bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
@@ -4814,7 +4814,7 @@ bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
 
     if (old_child_bs) {
         bdrv_ref(old_child_bs);
-        bdrv_drained_begin(old_child_bs);
+        assert(old_child_bs->quiesce_counter > 0);
     }
 
     bdrv_graph_rdunlock_main_loop();
@@ -4826,7 +4826,6 @@ bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
     bdrv_graph_wrunlock();
 
     if (old_child_bs) {
-        bdrv_drained_end(old_child_bs);
         bdrv_unref(old_child_bs);
     }
 
@@ -5501,9 +5500,7 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     assert(!bs_new->backing);
     bdrv_graph_rdunlock_main_loop();
 
-    bdrv_drained_begin(bs_top);
-    bdrv_drained_begin(bs_new);
-
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     child = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
@@ -5525,9 +5522,7 @@ out:
 
     bdrv_refresh_limits(bs_top, NULL, NULL);
     bdrv_graph_wrunlock();
-
-    bdrv_drained_end(bs_top);
-    bdrv_drained_end(bs_new);
+    bdrv_drain_all_end();
 
     return ret;
 }
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 11/24] block: move drain outside of bdrv_set_backing_hd_drained()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (9 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 10/24] block: move drain outside of bdrv_attach_child_common(_abort)() Fiona Ebner
@ 2025-05-20 10:29 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 12/24] block: move drain outside of bdrv_root_attach_child() Fiona Ebner
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:29 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The function bdrv_set_backing_hd_drained() holds the graph lock, so it
is not allowed to drain. It is called by:
1. bdrv_set_backing_hd(), where a drained section is introduced,
   replacing the previously present bs-specific drains.
2. stream_prepare(), where a drained section is introduced replacing
   the previously present bs-specific drains.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c        | 6 ++----
 block/stream.c | 6 ++----
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 702a51ff4b..f323588d15 100644
--- a/block.c
+++ b/block.c
@@ -3569,7 +3569,6 @@ int bdrv_set_backing_hd_drained(BlockDriverState *bs,
         assert(bs->backing->bs->quiesce_counter > 0);
     }
 
-    bdrv_drain_all_begin();
     ret = bdrv_set_file_or_backing_noperm(bs, backing_hd, true, tran, errp);
     if (ret < 0) {
         goto out;
@@ -3578,7 +3577,6 @@ int bdrv_set_backing_hd_drained(BlockDriverState *bs,
     ret = bdrv_refresh_perms(bs, tran, errp);
 out:
     tran_finalize(tran, ret);
-    bdrv_drain_all_end();
     return ret;
 }
 
@@ -3594,11 +3592,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     bdrv_graph_rdunlock_main_loop();
 
     bdrv_ref(drain_bs);
-    bdrv_drained_begin(drain_bs);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp);
     bdrv_graph_wrunlock();
-    bdrv_drained_end(drain_bs);
+    bdrv_drain_all_end();
     bdrv_unref(drain_bs);
 
     return ret;
diff --git a/block/stream.c b/block/stream.c
index 999d9e56d4..6ba49cffd3 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -80,11 +80,10 @@ static int stream_prepare(Job *job)
      * may end up working with the wrong base node (or it might even have gone
      * away by the time we want to use it).
      */
-    bdrv_drained_begin(unfiltered_bs);
     if (unfiltered_bs_cow) {
         bdrv_ref(unfiltered_bs_cow);
-        bdrv_drained_begin(unfiltered_bs_cow);
     }
+    bdrv_drain_all_begin();
 
     bdrv_graph_rdlock_main_loop();
     base = bdrv_filter_or_cow_bs(s->above_base);
@@ -123,11 +122,10 @@ static int stream_prepare(Job *job)
     }
 
 out:
+    bdrv_drain_all_end();
     if (unfiltered_bs_cow) {
-        bdrv_drained_end(unfiltered_bs_cow);
         bdrv_unref(unfiltered_bs_cow);
     }
-    bdrv_drained_end(unfiltered_bs);
     return ret;
 }
 
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 12/24] block: move drain outside of bdrv_root_attach_child()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (10 preceding siblings ...)
  2025-05-20 10:29 ` [PATCH v2 11/24] block: move drain outside of bdrv_set_backing_hd_drained() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 13/24] block: move drain outside of bdrv_attach_child() Fiona Ebner
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The function bdrv_root_attach_child() runs under the graph lock, so it
is not allowed to drain. It is called by:
1. blk_insert_bs(), where a drained section is introduced.
2. block_job_add_bdrv(), which holds the graph lock itself.

block_job_add_bdrv() is called by:
1. mirror_start_job()
2. stream_start()
3. commit_start()
4. backup_job_create()
5. block_job_create()
6. In the test_blockjob_common_drain_node() unit test

In all callers, a drained section is introduced.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c                      | 2 --
 block/backup.c               | 2 ++
 block/block-backend.c        | 2 ++
 block/commit.c               | 4 ++++
 block/mirror.c               | 5 +++++
 block/stream.c               | 4 ++++
 blockjob.c                   | 4 ++++
 tests/unit/test-bdrv-drain.c | 2 ++
 8 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index f323588d15..805e6955b3 100644
--- a/block.c
+++ b/block.c
@@ -3236,7 +3236,6 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
 
     GLOBAL_STATE_CODE();
 
-    bdrv_drain_all_begin();
     child = bdrv_attach_child_common(child_bs, child_name, child_class,
                                    child_role, perm, shared_perm, opaque,
                                    tran, errp);
@@ -3249,7 +3248,6 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
 
 out:
     tran_finalize(tran, ret);
-    bdrv_drain_all_end();
 
     bdrv_schedule_unref(child_bs);
 
diff --git a/block/backup.c b/block/backup.c
index 0151e84395..909027c17a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -498,10 +498,12 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     block_copy_set_speed(bcs, speed);
 
     /* Required permissions are taken by copy-before-write filter target */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     block_job_add_bdrv(&job->common, "target", target, 0, BLK_PERM_ALL,
                        &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     return &job->common;
 
diff --git a/block/block-backend.c b/block/block-backend.c
index 6a6949edeb..24cae3cb55 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -904,6 +904,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
 
     GLOBAL_STATE_CODE();
     bdrv_ref(bs);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     if ((bs->open_flags & BDRV_O_INACTIVE) && blk_can_inactivate(blk)) {
@@ -919,6 +920,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
                                        BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                                        perm, shared_perm, blk, errp);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     if (blk->root == NULL) {
         return -EPERM;
     }
diff --git a/block/commit.c b/block/commit.c
index 7cc8c0f0df..6c4b736ff8 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -392,6 +392,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
      * this is the responsibility of the interface (i.e. whoever calls
      * commit_start()).
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     s->base_overlay = bdrv_find_overlay(top, base);
     assert(s->base_overlay);
@@ -424,18 +425,21 @@ void commit_start(const char *job_id, BlockDriverState *bs,
                                  iter_shared_perms, errp);
         if (ret < 0) {
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             goto fail;
         }
     }
 
     if (bdrv_freeze_backing_chain(commit_top_bs, base, errp) < 0) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         goto fail;
     }
     s->chain_frozen = true;
 
     ret = block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_ALL, errp);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     if (ret < 0) {
         goto fail;
diff --git a/block/mirror.c b/block/mirror.c
index c2c5099c95..6e8caf4b49 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2014,6 +2014,7 @@ static BlockJob *mirror_start_job(
      */
     bdrv_disable_dirty_bitmap(s->dirty_bitmap);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     ret = block_job_add_bdrv(&s->common, "source", bs, 0,
                              BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE |
@@ -2021,6 +2022,7 @@ static BlockJob *mirror_start_job(
                              errp);
     if (ret < 0) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         goto fail;
     }
 
@@ -2066,16 +2068,19 @@ static BlockJob *mirror_start_job(
                                      iter_shared_perms, errp);
             if (ret < 0) {
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 goto fail;
             }
         }
 
         if (bdrv_freeze_backing_chain(mirror_top_bs, target, errp) < 0) {
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             goto fail;
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     QTAILQ_INIT(&s->ops_in_flight);
 
diff --git a/block/stream.c b/block/stream.c
index 6ba49cffd3..f5441f27f4 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -371,10 +371,12 @@ void stream_start(const char *job_id, BlockDriverState *bs,
      * already have our own plans. Also don't allow resize as the image size is
      * queried only at the job start and then cached.
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     if (block_job_add_bdrv(&s->common, "active node", bs, 0,
                            basic_flags | BLK_PERM_WRITE, errp)) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         goto fail;
     }
 
@@ -395,10 +397,12 @@ void stream_start(const char *job_id, BlockDriverState *bs,
                                  basic_flags, errp);
         if (ret < 0) {
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             goto fail;
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     s->base_overlay = base_overlay;
     s->above_base = above_base;
diff --git a/blockjob.c b/blockjob.c
index 34185d7715..44991e3ff7 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -496,6 +496,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     int ret;
     GLOBAL_STATE_CODE();
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     if (job_id == NULL && !(flags & JOB_INTERNAL)) {
@@ -506,6 +507,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
                      flags, cb, opaque, errp);
     if (job == NULL) {
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         return NULL;
     }
 
@@ -544,10 +546,12 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     }
 
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     return job;
 
 fail:
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     job_early_fail(&job->job);
     return NULL;
 }
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 3185f3f429..4f3057844b 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -772,9 +772,11 @@ static void test_blockjob_common_drain_node(enum drain_type drain_type,
     tjob->bs = src;
     job = &tjob->common;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     block_job_add_bdrv(job, "target", target, 0, BLK_PERM_ALL, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     switch (result) {
     case TEST_JOB_SUCCESS:
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 13/24] block: move drain outside of bdrv_attach_child()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (11 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 12/24] block: move drain outside of bdrv_root_attach_child() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 14/24] block: move drain outside of quorum_add_child() Fiona Ebner
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The function bdrv_attach_child() runs under the graph lock, so it is
not allowed to drain. It is called by:
1. replication_start()
2. quorum_add_child()
3. bdrv_open_child_common()
4. Throughout test-bdrv-graph-mod.c and test-bdrv-drain.c unit tests.

In all callers, a drained section is introduced.

The function quorum_add_child() runs under the graph lock, so it is
not actually allowed to drain. This will be addressed by the following
commit.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c                          |  4 ++--
 block/quorum.c                   |  2 ++
 block/replication.c              |  5 +++++
 tests/unit/test-bdrv-drain.c     | 14 ++++++++++++++
 tests/unit/test-bdrv-graph-mod.c | 10 ++++++++++
 5 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 805e6955b3..b9de3a25cf 100644
--- a/block.c
+++ b/block.c
@@ -3275,7 +3275,6 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
     GLOBAL_STATE_CODE();
 
-    bdrv_drain_all_begin();
     child = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
                                      child_class, child_role, tran, errp);
     if (!child) {
@@ -3290,7 +3289,6 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
 out:
     tran_finalize(tran, ret);
-    bdrv_drain_all_end();
 
     bdrv_schedule_unref(child_bs);
 
@@ -3786,10 +3784,12 @@ static BdrvChild *bdrv_open_child_common(const char *filename,
         return NULL;
     }
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     child = bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
                               errp);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     return child;
 }
diff --git a/block/quorum.c b/block/quorum.c
index ed8ce801ee..ea17b0ec13 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1096,8 +1096,10 @@ quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs, Error **errp)
     /* We can safely add the child now */
     bdrv_ref(child_bs);
 
+    bdrv_drain_all_begin();
     child = bdrv_attach_child(bs, child_bs, indexstr, &child_of_bds,
                               BDRV_CHILD_DATA, errp);
+    bdrv_drain_all_end();
     if (child == NULL) {
         s->next_child_index--;
         return;
diff --git a/block/replication.c b/block/replication.c
index 07f274de9e..54cbd03e00 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -540,6 +540,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             return;
         }
 
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
 
         bdrv_ref(hidden_disk->bs);
@@ -549,6 +550,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (local_err) {
             error_propagate(errp, local_err);
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             return;
         }
 
@@ -559,6 +561,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (local_err) {
             error_propagate(errp, local_err);
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             return;
         }
 
@@ -571,12 +574,14 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             !check_top_bs(top_bs, bs)) {
             error_setg(errp, "No top_bs or it is invalid");
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             reopen_backing_file(bs, false, NULL);
             return;
         }
         bdrv_op_block_all(top_bs, s->blocker);
 
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
 
         s->backup_job = backup_job_create(
                                 NULL, s->secondary_disk->bs, s->hidden_disk->bs,
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 4f3057844b..ac76525e5a 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1049,10 +1049,12 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
 
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     /* This child will be the one to pass to requests through to, and
      * it will stall until a drain occurs */
@@ -1060,21 +1062,25 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
                                     &error_abort);
     child_bs->total_sectors = 65536 >> BDRV_SECTOR_BITS;
     /* Takes our reference to child_bs */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     tts->wait_child = bdrv_attach_child(bs, child_bs, "wait-child",
                                         &child_of_bds,
                                         BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY,
                                         &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     /* This child is just there to be deleted
      * (for detach_instead_of_delete == true) */
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     blk = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
     blk_insert_bs(blk, bs, &error_abort);
@@ -1157,6 +1163,7 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
 
     bdrv_dec_in_flight(data->child_b->bs);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_unref_child(data->parent_b, data->child_b);
 
@@ -1165,6 +1172,7 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
                                       &child_of_bds, BDRV_CHILD_DATA,
                                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 static void coroutine_mixed_fn detach_by_parent_aio_cb(void *opaque, int ret)
@@ -1262,6 +1270,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
     /* Set child relationships */
     bdrv_ref(b);
     bdrv_ref(a);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     child_b = bdrv_attach_child(parent_b, b, "PB-B", &child_of_bds,
                                 BDRV_CHILD_DATA, &error_abort);
@@ -1273,6 +1282,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
                       by_parent_cb ? &child_of_bds : &detach_by_driver_cb_class,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_assert_cmpint(parent_a->refcnt, ==, 1);
     g_assert_cmpint(parent_b->refcnt, ==, 1);
@@ -1685,6 +1695,7 @@ static void test_drop_intermediate_poll(void)
      * Establish the chain last, so the chain links are the first
      * elements in the BDS.parents lists
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < 3; i++) {
         if (i) {
@@ -1694,6 +1705,7 @@ static void test_drop_intermediate_poll(void)
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     job = block_job_create("job", &test_simple_job_driver, NULL, job_node,
                            0, BLK_PERM_ALL, 0, 0, NULL, NULL, &error_abort);
@@ -1940,10 +1952,12 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
     new_child_bs->total_sectors = 1;
 
     bdrv_ref(old_child_bs);
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(parent_bs, old_child_bs, "child", &child_of_bds,
                       BDRV_CHILD_COW, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     parent_s->setup_completed = true;
 
     for (i = 0; i < old_drain_count; i++) {
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index d743abb4bb..7b03ebe4b0 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -137,10 +137,12 @@ static void test_update_perm_tree(void)
 
     blk_insert_bs(root, bs, &error_abort);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(filter, bs, "child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     ret = bdrv_append(filter, bs, NULL);
     g_assert_cmpint(ret, <, 0);
@@ -204,11 +206,13 @@ static void test_should_update_child(void)
 
     bdrv_set_backing_hd(target, bs, &error_abort);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     g_assert(target->backing->bs == bs);
     bdrv_attach_child(filter, target, "target", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     bdrv_append(filter, bs, &error_abort);
 
     bdrv_graph_rdlock_main_loop();
@@ -244,6 +248,7 @@ static void test_parallel_exclusive_write(void)
     bdrv_ref(base);
     bdrv_ref(fl1);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(top, fl1, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -257,6 +262,7 @@ static void test_parallel_exclusive_write(void)
 
     bdrv_replace_node(fl1, fl2, &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     bdrv_drained_end(fl2);
     bdrv_drained_end(fl1);
@@ -363,6 +369,7 @@ static void test_parallel_perm_update(void)
      */
     bdrv_ref(base);
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(top, ws, "file", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
@@ -377,6 +384,7 @@ static void test_parallel_perm_update(void)
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     /* Select fl1 as first child to be active */
     s->selected = c_fl1;
@@ -430,11 +438,13 @@ static void test_append_greedy_filter(void)
     BlockDriverState *base = no_perm_node("base");
     BlockDriverState *fl = exclusive_writer_node("fl1");
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_attach_child(top, base, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     bdrv_append(fl, base, &error_abort);
     bdrv_unref(fl);
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 14/24] block: move drain outside of quorum_add_child()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (12 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 13/24] block: move drain outside of bdrv_attach_child() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 15/24] block: move drain outside of bdrv_root_unref_child() Fiona Ebner
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The quorum_add_child() callback runs under the graph lock, so it is
not allowed to drain. It is only called as the .bdrv_add_child()
callback, which is only called in the bdrv_add_child() function, which
also runs under the graph lock.

The bdrv_add_child() function is called by qmp_x_blockdev_change(),
where a drained section is introduced.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block/quorum.c | 2 --
 blockdev.c     | 2 ++
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index ea17b0ec13..ed8ce801ee 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1096,10 +1096,8 @@ quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs, Error **errp)
     /* We can safely add the child now */
     bdrv_ref(child_bs);
 
-    bdrv_drain_all_begin();
     child = bdrv_attach_child(bs, child_bs, indexstr, &child_of_bds,
                               BDRV_CHILD_DATA, errp);
-    bdrv_drain_all_end();
     if (child == NULL) {
         s->next_child_index--;
         return;
diff --git a/blockdev.c b/blockdev.c
index 750beba41f..bd5ca77619 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3531,6 +3531,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
     BlockDriverState *parent_bs, *new_bs = NULL;
     BdrvChild *p_child;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
 
     parent_bs = bdrv_lookup_bs(parent, parent, errp);
@@ -3568,6 +3569,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
 
 out:
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 15/24] block: move drain outside of bdrv_root_unref_child()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (13 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 14/24] block: move drain outside of quorum_add_child() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 16/24] block: move drain outside of quorum_del_child() Fiona Ebner
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

bdrv_root_unref_child() is called by:
1. blk_remove_bs(), where a drained section is introduced.
2. bdrv_unref_child(), which runs under the graph lock, so the drain
   will be moved further up to its callers.
3. block_job_remove_all_bdrv(), where a drained section is introduced.

For all callers of bdrv_unref_child() a drained section is introduced,
they are not explicilty listed here. The caller quorum_del_child()
holds the graph lock, so it is not actually allowed to drain. This
will be addressed in the next commit.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block.c                      |  6 ++++--
 block/blklogwrites.c         |  4 ++++
 block/blkverify.c            |  2 ++
 block/block-backend.c        |  2 ++
 block/qcow2.c                |  2 ++
 block/quorum.c               |  6 ++++++
 block/replication.c          |  2 ++
 block/snapshot.c             |  2 ++
 block/vmdk.c                 | 10 ++++++++++
 blockjob.c                   |  2 ++
 tests/unit/test-bdrv-drain.c |  2 ++
 11 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index b9de3a25cf..dfe665e457 100644
--- a/block.c
+++ b/block.c
@@ -1721,12 +1721,14 @@ bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name,
 open_failed:
     bs->drv = NULL;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     if (bs->file != NULL) {
         bdrv_unref_child(bs, bs->file);
         assert(!bs->file);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -3316,10 +3318,8 @@ void bdrv_root_unref_child(BdrvChild *child)
          * When the parent requiring a non-default AioContext is removed, the
          * node moves back to the main AioContext
          */
-        bdrv_drain_all_begin();
         bdrv_try_change_aio_context_locked(child_bs, qemu_get_aio_context(),
                                            NULL, NULL);
-        bdrv_drain_all_end();
     }
 
     bdrv_schedule_unref(child_bs);
@@ -5163,6 +5163,7 @@ static void bdrv_close(BlockDriverState *bs)
         bs->drv = NULL;
     }
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
         bdrv_unref_child(bs, child);
@@ -5171,6 +5172,7 @@ static void bdrv_close(BlockDriverState *bs)
     assert(!bs->backing);
     assert(!bs->file);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(bs->opaque);
     bs->opaque = NULL;
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index b0f78c4bc7..70ac76f401 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -281,9 +281,11 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
     ret = 0;
 fail_log:
     if (ret < 0) {
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, s->log_file);
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         s->log_file = NULL;
         qemu_mutex_destroy(&s->mutex);
     }
@@ -296,10 +298,12 @@ static void blk_log_writes_close(BlockDriverState *bs)
 {
     BDRVBlkLogWritesState *s = bs->opaque;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     qemu_mutex_destroy(&s->mutex);
 }
 
diff --git a/block/blkverify.c b/block/blkverify.c
index db79a36681..3a71f7498c 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -151,10 +151,12 @@ static void blkverify_close(BlockDriverState *bs)
 {
     BDRVBlkverifyState *s = bs->opaque;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_unref_child(bs, s->test_file);
     s->test_file = NULL;
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
diff --git a/block/block-backend.c b/block/block-backend.c
index 24cae3cb55..68209bb2f7 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -889,9 +889,11 @@ void blk_remove_bs(BlockBackend *blk)
     root = blk->root;
     blk->root = NULL;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     bdrv_root_unref_child(root);
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 /*
diff --git a/block/qcow2.c b/block/qcow2.c
index 66fba89b41..420918b3c3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2821,9 +2821,11 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
     if (close_data_file && has_data_file(bs)) {
         GLOBAL_STATE_CODE();
         bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, s->data_file);
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
         s->data_file = NULL;
         bdrv_graph_rdlock_main_loop();
     }
diff --git a/block/quorum.c b/block/quorum.c
index ed8ce801ee..81407a38ee 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1037,6 +1037,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
 close_exit:
     /* cleanup on error */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < s->num_children; i++) {
         if (!opened[i]) {
@@ -1045,6 +1046,7 @@ close_exit:
         bdrv_unref_child(bs, s->children[i]);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
     g_free(s->children);
     g_free(opened);
 exit:
@@ -1057,11 +1059,13 @@ static void quorum_close(BlockDriverState *bs)
     BDRVQuorumState *s = bs->opaque;
     int i;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < s->num_children; i++) {
         bdrv_unref_child(bs, s->children[i]);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(s->children);
 }
@@ -1143,7 +1147,9 @@ quorum_del_child(BlockDriverState *bs, BdrvChild *child, Error **errp)
             (s->num_children - i - 1) * sizeof(BdrvChild *));
     s->children = g_renew(BdrvChild *, s->children, --s->num_children);
 
+    bdrv_drain_all_begin();
     bdrv_unref_child(bs, child);
+    bdrv_drain_all_end();
 
     quorum_refresh_flags(bs);
 }
diff --git a/block/replication.c b/block/replication.c
index 54cbd03e00..0879718854 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -656,12 +656,14 @@ static void replication_done(void *opaque, int ret)
     if (ret == 0) {
         s->stage = BLOCK_REPLICATION_DONE;
 
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, s->secondary_disk);
         s->secondary_disk = NULL;
         bdrv_unref_child(bs, s->hidden_disk);
         s->hidden_disk = NULL;
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
 
         s->error = 0;
     } else {
diff --git a/block/snapshot.c b/block/snapshot.c
index 9f300a78bd..28c9c43621 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -291,9 +291,11 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
         }
 
         /* .bdrv_open() will re-attach it */
+        bdrv_drain_all_begin();
         bdrv_graph_wrlock();
         bdrv_unref_child(bs, fallback);
         bdrv_graph_wrunlock();
+        bdrv_drain_all_end();
 
         ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
         memset(bs->opaque, 0, drv->instance_size);
diff --git a/block/vmdk.c b/block/vmdk.c
index 9c7ab037e1..89a7250120 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -271,6 +271,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
     BDRVVmdkState *s = bs->opaque;
     VmdkExtent *e;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     for (i = 0; i < s->num_extents; i++) {
         e = &s->extents[i];
@@ -283,6 +284,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
         }
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 
     g_free(s->extents);
 }
@@ -1247,9 +1249,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
                             0, 0, 0, 0, 0, &extent, errp);
             if (ret < 0) {
                 bdrv_graph_rdunlock_main_loop();
+                bdrv_drain_all_begin();
                 bdrv_graph_wrlock();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1266,9 +1270,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             g_free(buf);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
+                bdrv_drain_all_begin();
                 bdrv_graph_wrlock();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1277,9 +1283,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
+                bdrv_drain_all_begin();
                 bdrv_graph_wrlock();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
+                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1287,9 +1295,11 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
         } else {
             error_setg(errp, "Unsupported extent type '%s'", type);
             bdrv_graph_rdunlock_main_loop();
+            bdrv_drain_all_begin();
             bdrv_graph_wrlock();
             bdrv_unref_child(bs, extent_file);
             bdrv_graph_wrunlock();
+            bdrv_drain_all_end();
             bdrv_graph_rdlock_main_loop();
             ret = -ENOTSUP;
             goto out;
diff --git a/blockjob.c b/blockjob.c
index 44991e3ff7..e68181a35b 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
      * one to make sure that such a concurrent access does not attempt
      * to process an already freed BdrvChild.
      */
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     while (job->nodes) {
         GSList *l = job->nodes;
@@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
         g_slist_free_1(l);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index ac76525e5a..c33f7d31c2 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -955,11 +955,13 @@ static void bdrv_test_top_close(BlockDriverState *bs)
 {
     BdrvChild *c, *next_c;
 
+    bdrv_drain_all_begin();
     bdrv_graph_wrlock();
     QLIST_FOREACH_SAFE(c, &bs->children, next, next_c) {
         bdrv_unref_child(bs, c);
     }
     bdrv_graph_wrunlock();
+    bdrv_drain_all_end();
 }
 
 static int coroutine_fn GRAPH_RDLOCK
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 16/24] block: move drain outside of quorum_del_child()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (14 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 15/24] block: move drain outside of bdrv_root_unref_child() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 17/24] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

The quorum_del_child() callback runs under the graph lock, so it is
not allowed to drain. It is only called as the .bdrv_del_child()
callback, which is only called in the bdrv_del_child() function, which
also runs under the graph lock.

The bdrv_del_child() function is called by qmp_x_blockdev_change().
A drained section was already introduced there by commit "block: move
drain out of quorum_add_child()".

This finally finishes moving out the drain to places that are not
under the graph lock started in "block: move draining out of
bdrv_change_aio_context() and mark GRAPH_RDLOCK".

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Split up into smaller pieces, flesh out commit messages.

 block/quorum.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 81407a38ee..cc3bc5f4e7 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1147,9 +1147,7 @@ quorum_del_child(BlockDriverState *bs, BdrvChild *child, Error **errp)
             (s->num_children - i - 1) * sizeof(BdrvChild *));
     s->children = g_renew(BdrvChild *, s->children, --s->num_children);

-    bdrv_drain_all_begin();
     bdrv_unref_child(bs, child);
-    bdrv_drain_all_end();

     quorum_refresh_flags(bs);
 }
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 17/24] blockdev: drain while unlocked in internal_snapshot_action()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (15 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 16/24] block: move drain outside of quorum_del_child() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 18/24] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Check that associated bs did not change with the drain.

 blockdev.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index bd5ca77619..506755bef1 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1208,7 +1208,7 @@ static void internal_snapshot_action(BlockdevSnapshotInternal *internal,
     Error *local_err = NULL;
     const char *device;
     const char *name;
-    BlockDriverState *bs;
+    BlockDriverState *bs, *check_bs;
     QEMUSnapshotInfo old_sn, *sn;
     bool ret;
     int64_t rt;
@@ -1216,7 +1216,7 @@ static void internal_snapshot_action(BlockdevSnapshotInternal *internal,
     int ret1;
 
     GLOBAL_STATE_CODE();
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+    bdrv_graph_rdlock_main_loop();
 
     tran_add(tran, &internal_snapshot_drv, state);
 
@@ -1225,14 +1225,29 @@ static void internal_snapshot_action(BlockdevSnapshotInternal *internal,
 
     bs = qmp_get_root_bs(device, errp);
     if (!bs) {
+        bdrv_graph_rdunlock_main_loop();
         return;
     }
 
     state->bs = bs;
 
+    /* Need to drain while unlocked. */
+    bdrv_graph_rdunlock_main_loop();
     /* Paired with .clean() */
     bdrv_drained_begin(bs);
 
+    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    /* Make sure the root bs did not change with the drain. */
+    check_bs = qmp_get_root_bs(device, errp);
+    if (bs != check_bs) {
+        if (check_bs) {
+            error_setg(errp, "Block node of device '%s' unexpectedly changed",
+                       device);
+        } /* else errp is already set */
+        return;
+    }
+
     if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INTERNAL_SNAPSHOT, errp)) {
         return;
     }
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 18/24] blockdev: drain while unlocked in external_snapshot_action()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (16 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 17/24] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 19/24] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Check that associated bs did not change with the drain.

 blockdev.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index 506755bef1..2e7fda6780 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1377,9 +1377,10 @@ static void external_snapshot_action(TransactionAction *action,
     const char *new_image_file;
     ExternalSnapshotState *state = g_new0(ExternalSnapshotState, 1);
     uint64_t perm, shared;
+    BlockDriverState *check_bs;
 
     /* TODO We'll eventually have to take a writer lock in this function */
-    GRAPH_RDLOCK_GUARD_MAINLOOP();
+    bdrv_graph_rdlock_main_loop();
 
     tran_add(tran, &external_snapshot_drv, state);
 
@@ -1412,11 +1413,25 @@ static void external_snapshot_action(TransactionAction *action,
 
     state->old_bs = bdrv_lookup_bs(device, node_name, errp);
     if (!state->old_bs) {
+        bdrv_graph_rdunlock_main_loop();
         return;
     }
 
+    /* Need to drain while unlocked. */
+    bdrv_graph_rdunlock_main_loop();
     /* Paired with .clean() */
     bdrv_drained_begin(state->old_bs);
+    GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+    /* Make sure the associated bs did not change with the drain. */
+    check_bs = bdrv_lookup_bs(device, node_name, errp);
+    if (state->old_bs != check_bs) {
+        if (check_bs) {
+            error_setg(errp, "Block node of device '%s' unexpectedly changed",
+                       device);
+        } /* else errp is already set */
+        return;
+    }
 
     if (!bdrv_is_inserted(state->old_bs)) {
         error_setg(errp, "Device '%s' has no medium",
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 19/24] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (17 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 18/24] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-23 17:20   ` Andrey Drobyshev
  2025-05-20 10:30 ` [PATCH v2 20/24] iotests/graph-changes-while-io: remove image file after test Fiona Ebner
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

bdrv_drained_begin() polls and is not allowed to be called with the
block graph lock held. Mark the function as such.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

No changes in v2, but ordered differently (in particular, it avoids
the need for patch 04/11 from last time).

 include/block/block-io.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index b99cc98d26..4cf83fb367 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -431,7 +431,7 @@ bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
  *
  * This function can be recursive.
  */
-void bdrv_drained_begin(BlockDriverState *bs);
+void GRAPH_UNLOCKED bdrv_drained_begin(BlockDriverState *bs);
 
 /**
  * bdrv_do_drained_begin_quiesce:
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 19/24] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED
  2025-05-20 10:30 ` [PATCH v2 19/24] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
@ 2025-05-23 17:20   ` Andrey Drobyshev
  0 siblings, 0 replies; 49+ messages in thread
From: Andrey Drobyshev @ 2025-05-23 17:20 UTC (permalink / raw)
  To: Fiona Ebner, qemu-block
  Cc: qemu-devel, kwolf, den, hreitz, stefanha, eblake, jsnow,
	vsementsov, xiechanglong.d, wencongyang2, berto, fam, ari

On 5/20/25 1:30 PM, Fiona Ebner wrote:
> bdrv_drained_begin() polls and is not allowed to be called with the
> block graph lock held. Mark the function as such.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> No changes in v2, but ordered differently (in particular, it avoids
> the need for patch 04/11 from last time).
> 
>  include/block/block-io.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index b99cc98d26..4cf83fb367 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -431,7 +431,7 @@ bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
>   *
>   * This function can be recursive.
>   */
> -void bdrv_drained_begin(BlockDriverState *bs);
> +void GRAPH_UNLOCKED bdrv_drained_begin(BlockDriverState *bs);
>  
>  /**
>   * bdrv_do_drained_begin_quiesce:

Again, for readability we might as well add the GRAPH_UNLOCKED mark to
the definition in block/io.c.

Andrey


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 20/24] iotests/graph-changes-while-io: remove image file after test
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (18 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 19/24] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 21/24] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

New in v2.

 tests/qemu-iotests/tests/graph-changes-while-io | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/qemu-iotests/tests/graph-changes-while-io b/tests/qemu-iotests/tests/graph-changes-while-io
index 194fda500e..35489e3b5e 100755
--- a/tests/qemu-iotests/tests/graph-changes-while-io
+++ b/tests/qemu-iotests/tests/graph-changes-while-io
@@ -57,6 +57,7 @@ class TestGraphChangesWhileIO(QMPTestCase):
 
     def tearDown(self) -> None:
         self.qsd.stop()
+        os.remove(top)
 
     def test_blockdev_add_while_io(self) -> None:
         # Run qemu-img bench in the background
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 21/24] iotests/graph-changes-while-io: add test case with removal of lower snapshot
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (19 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 20/24] iotests/graph-changes-while-io: remove image file after test Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 22/24] block/io: remove duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end() Fiona Ebner
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

This case is catching potential deadlock which takes place when job-dismiss
is issued when I/O requests are processed in a separate iothread.

See https://mail.gnu.org/archive/html/qemu-devel/2025-04/msg04421.html

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
[FE: re-use top image and rename snap1->mid as suggested by Kevin Wolf
     remove image file after test as suggested by Kevin Wolf
     add type annotation for function argument to make mypy happy]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

Changes in v2:
* Re-use 'top' image, rename other image to 'mid'.
* Remove image after test.
* Add type annotation for function argument for mypy.

 .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++++++++--
 .../tests/graph-changes-while-io.out          |   4 +-
 2 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/tests/qemu-iotests/tests/graph-changes-while-io b/tests/qemu-iotests/tests/graph-changes-while-io
index 35489e3b5e..dca1167b6d 100755
--- a/tests/qemu-iotests/tests/graph-changes-while-io
+++ b/tests/qemu-iotests/tests/graph-changes-while-io
@@ -27,6 +27,7 @@ from iotests import imgfmt, qemu_img, qemu_img_create, qemu_io, \
 
 
 top = os.path.join(iotests.test_dir, 'top.img')
+mid = os.path.join(iotests.test_dir, 'mid.img')
 nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')
 
 
@@ -59,6 +60,15 @@ class TestGraphChangesWhileIO(QMPTestCase):
         self.qsd.stop()
         os.remove(top)
 
+    def _wait_for_blockjob(self, status: str) -> None:
+        done = False
+        while not done:
+            for event in self.qsd.get_qmp().get_events(wait=10.0):
+                if event['event'] != 'JOB_STATUS_CHANGE':
+                    continue
+                if event['data']['status'] == status:
+                    done = True
+
     def test_blockdev_add_while_io(self) -> None:
         # Run qemu-img bench in the background
         bench_thr = Thread(target=do_qemu_img_bench)
@@ -117,16 +127,93 @@ class TestGraphChangesWhileIO(QMPTestCase):
                 'device': 'job0',
             })
 
-            cancelled = False
-            while not cancelled:
-                for event in self.qsd.get_qmp().get_events(wait=10.0):
-                    if event['event'] != 'JOB_STATUS_CHANGE':
-                        continue
-                    if event['data']['status'] == 'null':
-                        cancelled = True
+            self._wait_for_blockjob('null')
 
         bench_thr.join()
 
+    def test_remove_lower_snapshot_while_io(self) -> None:
+        # Run qemu-img bench in the background
+        bench_thr = Thread(target=do_qemu_img_bench, args=(100000, ))
+        bench_thr.start()
+
+        # While I/O is performed on 'node0' node, consequently add 2 snapshots
+        # on top of it, then remove (commit) them starting from lower one.
+        while bench_thr.is_alive():
+            # Recreate snapshot images on every iteration
+            qemu_img_create('-f', imgfmt, mid, '1G')
+            qemu_img_create('-f', imgfmt, top, '1G')
+
+            self.qsd.cmd('blockdev-add', {
+                'driver': imgfmt,
+                'node-name': 'mid',
+                'file': {
+                    'driver': 'file',
+                    'filename': mid
+                }
+            })
+
+            self.qsd.cmd('blockdev-snapshot', {
+                'node': 'node0',
+                'overlay': 'mid',
+            })
+
+            self.qsd.cmd('blockdev-add', {
+                'driver': imgfmt,
+                'node-name': 'top',
+                'file': {
+                    'driver': 'file',
+                    'filename': top
+                }
+            })
+
+            self.qsd.cmd('blockdev-snapshot', {
+                'node': 'mid',
+                'overlay': 'top',
+            })
+
+            self.qsd.cmd('block-commit', {
+                'job-id': 'commit-mid',
+                'device': 'top',
+                'top-node': 'mid',
+                'base-node': 'node0',
+                'auto-finalize': True,
+                'auto-dismiss': False,
+            })
+
+            self._wait_for_blockjob('concluded')
+            self.qsd.cmd('job-dismiss', {
+                'id': 'commit-mid',
+            })
+
+            self.qsd.cmd('block-commit', {
+                'job-id': 'commit-top',
+                'device': 'top',
+                'top-node': 'top',
+                'base-node': 'node0',
+                'auto-finalize': True,
+                'auto-dismiss': False,
+            })
+
+            self._wait_for_blockjob('ready')
+            self.qsd.cmd('job-complete', {
+                'id': 'commit-top',
+            })
+
+            self._wait_for_blockjob('concluded')
+            self.qsd.cmd('job-dismiss', {
+                'id': 'commit-top',
+            })
+
+            self.qsd.cmd('blockdev-del', {
+                'node-name': 'mid'
+            })
+            self.qsd.cmd('blockdev-del', {
+                'node-name': 'top'
+            })
+
+        bench_thr.join()
+        os.remove(mid)
+
 if __name__ == '__main__':
     # Format must support raw backing files
     iotests.main(supported_fmts=['qcow', 'qcow2', 'qed'],
diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out b/tests/qemu-iotests/tests/graph-changes-while-io.out
index fbc63e62f8..8d7e996700 100644
--- a/tests/qemu-iotests/tests/graph-changes-while-io.out
+++ b/tests/qemu-iotests/tests/graph-changes-while-io.out
@@ -1,5 +1,5 @@
-..
+...
 ----------------------------------------------------------------------
-Ran 2 tests
+Ran 3 tests
 
 OK
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 22/24] block/io: remove duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end()
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (20 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 21/24] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 23/24] block: never use atomics to access bs->quiesce_counter Fiona Ebner
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Both commit ab61335025 ("block: drain from main loop thread in
bdrv_co_yield_to_drain()") and commit d05ab380db ("block: Mark drain
related functions GRAPH_RDLOCK") introduced a GLOBAL_STATE_CODE()
macro in bdrv_do_drained_end(). The assertion of being in the main
thread cannot change here, so keep only the earlier instance.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

New in v2.

 block/io.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 4fd7768f9c..ac5c7174f6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -413,7 +413,6 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent)
     /* At this point, we should be always running in the main loop. */
     GLOBAL_STATE_CODE();
     assert(bs->quiesce_counter > 0);
-    GLOBAL_STATE_CODE();
 
     /* Re-enable things in child-to-parent order */
     old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 23/24] block: never use atomics to access bs->quiesce_counter
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (21 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 22/24] block/io: remove duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end() Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 10:30 ` [PATCH v2 24/24] block: add bdrv_graph_wrlock_drained() convenience wrapper Fiona Ebner
  2025-05-20 15:44 ` [PATCH v2 00/24] block: do not drain while holding the graph lock Eric Blake
  24 siblings, 0 replies; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

All accesses of bs->quiesce_counter are in the main thread, either
after a GLOBAL_STATE_CODE() macro or in a function with GRAPH_WRLOCK
annotation.

This is essentially a revert of 414c2ec358 ("block: access
quiesce_counter with atomic ops"). At that time, neither the
GLOBAL_STATE_CODE() macro nor the GRAPH_WRLOCK annotation existed.
Even if the field was only accessed in the main thread back then (did
not check if that is actually the case), it wouldn't have been easy to
verify.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

New in v2.

 block/io.c                       | 7 ++-----
 include/block/block_int-common.h | 2 +-
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index ac5c7174f6..9bd8ba8431 100644
--- a/block/io.c
+++ b/block/io.c
@@ -361,7 +361,7 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
     GLOBAL_STATE_CODE();
 
     /* Stop things in parent-to-child order */
-    if (qatomic_fetch_inc(&bs->quiesce_counter) == 0) {
+    if (bs->quiesce_counter++ == 0) {
         GRAPH_RDLOCK_GUARD_MAINLOOP();
         bdrv_parent_drained_begin(bs, parent);
         if (bs->drv && bs->drv->bdrv_drain_begin) {
@@ -401,8 +401,6 @@ bdrv_drained_begin(BlockDriverState *bs)
  */
 static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent)
 {
-    int old_quiesce_counter;
-
     IO_OR_GS_CODE();
 
     if (qemu_in_coroutine()) {
@@ -415,8 +413,7 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent)
     assert(bs->quiesce_counter > 0);
 
     /* Re-enable things in child-to-parent order */
-    old_quiesce_counter = qatomic_fetch_dec(&bs->quiesce_counter);
-    if (old_quiesce_counter == 1) {
+    if (--bs->quiesce_counter == 0) {
         GRAPH_RDLOCK_GUARD_MAINLOOP();
         if (bs->drv && bs->drv->bdrv_drain_end) {
             bs->drv->bdrv_drain_end(bs);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 37466c7841..f9c60c1ae0 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -1227,7 +1227,7 @@ struct BlockDriverState {
     /* do we need to tell the quest if we have a volatile write cache? */
     int enable_write_cache;
 
-    /* Accessed with atomic ops.  */
+    /* Accessed only in the main thread. */
     int quiesce_counter;
 
     unsigned int write_gen;               /* Current data generation */
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 24/24] block: add bdrv_graph_wrlock_drained() convenience wrapper
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (22 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 23/24] block: never use atomics to access bs->quiesce_counter Fiona Ebner
@ 2025-05-20 10:30 ` Fiona Ebner
  2025-05-20 15:54   ` Eric Blake
  2025-05-20 15:44 ` [PATCH v2 00/24] block: do not drain while holding the graph lock Eric Blake
  24 siblings, 1 reply; 49+ messages in thread
From: Fiona Ebner @ 2025-05-20 10:30 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, kwolf, den, andrey.drobyshev, hreitz, stefanha,
	eblake, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

Many write-lockded sections are also drained sections. A new
bdrv_graph_wrunlock_drained() wrapper around bdrv_graph_wrunlock() is
introduced, which will begin a drained section first. A global
variable is used so bdrv_graph_wrunlock() knows if it also needs
to end such a drained section. Both the aio_poll call in
bdrv_graph_wrlock() and the aio_bh_poll() in bdrv_graph_wrunlock()
can re-enter a write-locked section. While for the latter, ending the
drain could be moved to before the call, the former requires that the
variable is a counter and not just a boolean.

The switch to the new helpers was generated with the following
commands and then manually checked:
find . -name '*.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\s*bdrv_graph_wrlock();/bdrv_graph_wrlock_drained();/g' {} ';'
find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\s*bdrv_drain_all_end();/bdrv_graph_wrunlock();/g' {} ';'

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

In the RFC, I had tried Coccinelle before sed, but it seemed to not
work in certain source files, maybe it was not happy with some macros?

Changes in v2:
* Add convenience wrapper rather than introducing a new flag argument.

 block.c                          | 20 ++++------------
 block/backup.c                   |  4 +---
 block/blklogwrites.c             |  8 ++-----
 block/blkverify.c                |  4 +---
 block/block-backend.c            |  8 ++-----
 block/commit.c                   |  6 +----
 block/graph-lock.c               | 40 +++++++++++++++++++++++++++++---
 block/mirror.c                   |  7 +-----
 block/qcow2.c                    |  4 +---
 block/quorum.c                   |  8 ++-----
 block/replication.c              | 11 ++-------
 block/snapshot.c                 |  4 +---
 block/stream.c                   |  6 +----
 block/vmdk.c                     | 20 ++++------------
 blockdev.c                       |  4 +---
 blockjob.c                       | 10 ++------
 include/block/graph-lock.h       | 11 +++++++++
 tests/unit/test-bdrv-drain.c     | 36 +++++++---------------------
 tests/unit/test-bdrv-graph-mod.c | 20 ++++------------
 19 files changed, 90 insertions(+), 141 deletions(-)

diff --git a/block.c b/block.c
index dfe665e457..dbd07b1a42 100644
--- a/block.c
+++ b/block.c
@@ -1721,14 +1721,12 @@ bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name,
 open_failed:
     bs->drv = NULL;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     if (bs->file != NULL) {
         bdrv_unref_child(bs, bs->file);
         assert(!bs->file);
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -3588,11 +3586,9 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     bdrv_graph_rdunlock_main_loop();
 
     bdrv_ref(drain_bs);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     bdrv_unref(drain_bs);
 
     return ret;
@@ -3784,12 +3780,10 @@ static BdrvChild *bdrv_open_child_common(const char *filename,
         return NULL;
     }
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     child = bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
                               errp);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     return child;
 }
@@ -5163,8 +5157,7 @@ static void bdrv_close(BlockDriverState *bs)
         bs->drv = NULL;
     }
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
         bdrv_unref_child(bs, child);
     }
@@ -5172,7 +5165,6 @@ static void bdrv_close(BlockDriverState *bs)
     assert(!bs->backing);
     assert(!bs->file);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     g_free(bs->opaque);
     bs->opaque = NULL;
@@ -5498,8 +5490,7 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     assert(!bs_new->backing);
     bdrv_graph_rdunlock_main_loop();
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
 
     child = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
                                      &child_of_bds, bdrv_backing_role(bs_new),
@@ -5520,7 +5511,6 @@ out:
 
     bdrv_refresh_limits(bs_top, NULL, NULL);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     return ret;
 }
diff --git a/block/backup.c b/block/backup.c
index 909027c17a..d4713fa1cd 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -498,12 +498,10 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
     block_copy_set_speed(bcs, speed);
 
     /* Required permissions are taken by copy-before-write filter target */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     block_job_add_bdrv(&job->common, "target", target, 0, BLK_PERM_ALL,
                        &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     return &job->common;
 
diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index 70ac76f401..aa1f888869 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -281,11 +281,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
     ret = 0;
 fail_log:
     if (ret < 0) {
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock_drained();
         bdrv_unref_child(bs, s->log_file);
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
         s->log_file = NULL;
         qemu_mutex_destroy(&s->mutex);
     }
@@ -298,12 +296,10 @@ static void blk_log_writes_close(BlockDriverState *bs)
 {
     BDRVBlkLogWritesState *s = bs->opaque;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     qemu_mutex_destroy(&s->mutex);
 }
 
diff --git a/block/blkverify.c b/block/blkverify.c
index 3a71f7498c..72efcbe7ef 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -151,12 +151,10 @@ static void blkverify_close(BlockDriverState *bs)
 {
     BDRVBlkverifyState *s = bs->opaque;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_unref_child(bs, s->test_file);
     s->test_file = NULL;
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
diff --git a/block/block-backend.c b/block/block-backend.c
index 68209bb2f7..f8d6ba65c1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -889,11 +889,9 @@ void blk_remove_bs(BlockBackend *blk)
     root = blk->root;
     blk->root = NULL;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_root_unref_child(root);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 }
 
 /*
@@ -906,8 +904,7 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
 
     GLOBAL_STATE_CODE();
     bdrv_ref(bs);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
 
     if ((bs->open_flags & BDRV_O_INACTIVE) && blk_can_inactivate(blk)) {
         blk->disable_perm = true;
@@ -922,7 +919,6 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
                                        BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                                        perm, shared_perm, blk, errp);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     if (blk->root == NULL) {
         return -EPERM;
     }
diff --git a/block/commit.c b/block/commit.c
index 6c4b736ff8..dc1942483b 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -392,8 +392,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
      * this is the responsibility of the interface (i.e. whoever calls
      * commit_start()).
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     s->base_overlay = bdrv_find_overlay(top, base);
     assert(s->base_overlay);
 
@@ -425,21 +424,18 @@ void commit_start(const char *job_id, BlockDriverState *bs,
                                  iter_shared_perms, errp);
         if (ret < 0) {
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             goto fail;
         }
     }
 
     if (bdrv_freeze_backing_chain(commit_top_bs, base, errp) < 0) {
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
         goto fail;
     }
     s->chain_frozen = true;
 
     ret = block_job_add_bdrv(&s->common, "base", base, 0, BLK_PERM_ALL, errp);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     if (ret < 0) {
         goto fail;
diff --git a/block/graph-lock.c b/block/graph-lock.c
index c81162b147..b7319473a1 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -33,6 +33,17 @@ static QemuMutex aio_context_list_lock;
 /* Written and read with atomic operations. */
 static int has_writer;
 
+/*
+ * Many write-locked sections are also drained sections. There is a convenience
+ * wrapper bdrv_graph_wrlock_drained() which begins a drained section before
+ * acquiring the lock. This variable here is used so bdrv_graph_wrunlock() knows
+ * if it also needs to end such a drained section. It needs to be a counter,
+ * because the aio_poll() call in bdrv_graph_wrlock() might re-enter
+ * bdrv_graph_wrlock_drained(). And note that aio_bh_poll() in
+ * bdrv_graph_wrunlock() might also re-enter a write-locked section.
+ */
+static int wrlock_quiesced_counter;
+
 /*
  * A reader coroutine could move from an AioContext to another.
  * If this happens, there is no problem from the point of view of
@@ -112,8 +123,14 @@ void no_coroutine_fn bdrv_graph_wrlock(void)
     assert(!qatomic_read(&has_writer));
     assert(!qemu_in_coroutine());
 
-    /* Make sure that constantly arriving new I/O doesn't cause starvation */
-    bdrv_drain_all_begin_nopoll();
+    bool need_drain = wrlock_quiesced_counter == 0;
+
+    if (need_drain) {
+        /*
+         * Make sure that constantly arriving new I/O doesn't cause starvation
+         */
+        bdrv_drain_all_begin_nopoll();
+    }
 
     /*
      * reader_count == 0: this means writer will read has_reader as 1
@@ -139,7 +156,18 @@ void no_coroutine_fn bdrv_graph_wrlock(void)
         smp_mb();
     } while (reader_count() >= 1);
 
-    bdrv_drain_all_end();
+    if (need_drain) {
+        bdrv_drain_all_end();
+    }
+}
+
+void no_coroutine_fn bdrv_graph_wrlock_drained(void)
+{
+    GLOBAL_STATE_CODE();
+
+    bdrv_drain_all_begin();
+    wrlock_quiesced_counter++;
+    bdrv_graph_wrlock();
 }
 
 void no_coroutine_fn bdrv_graph_wrunlock(void)
@@ -168,6 +196,12 @@ void no_coroutine_fn bdrv_graph_wrunlock(void)
      * progress.
      */
     aio_bh_poll(qemu_get_aio_context());
+
+    if (wrlock_quiesced_counter > 0) {
+        bdrv_drain_all_end();
+        wrlock_quiesced_counter--;
+    }
+
 }
 
 void coroutine_fn bdrv_graph_co_rdlock(void)
diff --git a/block/mirror.c b/block/mirror.c
index 6e8caf4b49..82a6e50cf8 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2014,15 +2014,13 @@ static BlockJob *mirror_start_job(
      */
     bdrv_disable_dirty_bitmap(s->dirty_bitmap);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     ret = block_job_add_bdrv(&s->common, "source", bs, 0,
                              BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE |
                              BLK_PERM_CONSISTENT_READ,
                              errp);
     if (ret < 0) {
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
         goto fail;
     }
 
@@ -2068,19 +2066,16 @@ static BlockJob *mirror_start_job(
                                      iter_shared_perms, errp);
             if (ret < 0) {
                 bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
                 goto fail;
             }
         }
 
         if (bdrv_freeze_backing_chain(mirror_top_bs, target, errp) < 0) {
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             goto fail;
         }
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     QTAILQ_INIT(&s->ops_in_flight);
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 420918b3c3..694ad797dc 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2821,11 +2821,9 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
     if (close_data_file && has_data_file(bs)) {
         GLOBAL_STATE_CODE();
         bdrv_graph_rdunlock_main_loop();
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock_drained();
         bdrv_unref_child(bs, s->data_file);
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
         s->data_file = NULL;
         bdrv_graph_rdlock_main_loop();
     }
diff --git a/block/quorum.c b/block/quorum.c
index cc3bc5f4e7..76a4feb2d9 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1037,8 +1037,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
 close_exit:
     /* cleanup on error */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     for (i = 0; i < s->num_children; i++) {
         if (!opened[i]) {
             continue;
@@ -1046,7 +1045,6 @@ close_exit:
         bdrv_unref_child(bs, s->children[i]);
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     g_free(s->children);
     g_free(opened);
 exit:
@@ -1059,13 +1057,11 @@ static void quorum_close(BlockDriverState *bs)
     BDRVQuorumState *s = bs->opaque;
     int i;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     for (i = 0; i < s->num_children; i++) {
         bdrv_unref_child(bs, s->children[i]);
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     g_free(s->children);
 }
diff --git a/block/replication.c b/block/replication.c
index 0879718854..83978b61f5 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -540,8 +540,7 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             return;
         }
 
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock_drained();
 
         bdrv_ref(hidden_disk->bs);
         s->hidden_disk = bdrv_attach_child(bs, hidden_disk->bs, "hidden disk",
@@ -550,7 +549,6 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (local_err) {
             error_propagate(errp, local_err);
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             return;
         }
 
@@ -561,7 +559,6 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
         if (local_err) {
             error_propagate(errp, local_err);
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             return;
         }
 
@@ -574,14 +571,12 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
             !check_top_bs(top_bs, bs)) {
             error_setg(errp, "No top_bs or it is invalid");
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             reopen_backing_file(bs, false, NULL);
             return;
         }
         bdrv_op_block_all(top_bs, s->blocker);
 
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
 
         s->backup_job = backup_job_create(
                                 NULL, s->secondary_disk->bs, s->hidden_disk->bs,
@@ -656,14 +651,12 @@ static void replication_done(void *opaque, int ret)
     if (ret == 0) {
         s->stage = BLOCK_REPLICATION_DONE;
 
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock_drained();
         bdrv_unref_child(bs, s->secondary_disk);
         s->secondary_disk = NULL;
         bdrv_unref_child(bs, s->hidden_disk);
         s->hidden_disk = NULL;
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
 
         s->error = 0;
     } else {
diff --git a/block/snapshot.c b/block/snapshot.c
index 28c9c43621..bd9d759b32 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -291,11 +291,9 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
         }
 
         /* .bdrv_open() will re-attach it */
-        bdrv_drain_all_begin();
-        bdrv_graph_wrlock();
+        bdrv_graph_wrlock_drained();
         bdrv_unref_child(bs, fallback);
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
 
         ret = bdrv_snapshot_goto(fallback_bs, snapshot_id, errp);
         memset(bs->opaque, 0, drv->instance_size);
diff --git a/block/stream.c b/block/stream.c
index f5441f27f4..a6ef840e29 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -371,12 +371,10 @@ void stream_start(const char *job_id, BlockDriverState *bs,
      * already have our own plans. Also don't allow resize as the image size is
      * queried only at the job start and then cached.
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     if (block_job_add_bdrv(&s->common, "active node", bs, 0,
                            basic_flags | BLK_PERM_WRITE, errp)) {
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
         goto fail;
     }
 
@@ -397,12 +395,10 @@ void stream_start(const char *job_id, BlockDriverState *bs,
                                  basic_flags, errp);
         if (ret < 0) {
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             goto fail;
         }
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     s->base_overlay = base_overlay;
     s->above_base = above_base;
diff --git a/block/vmdk.c b/block/vmdk.c
index 89a7250120..04986c8d55 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -271,8 +271,7 @@ static void vmdk_free_extents(BlockDriverState *bs)
     BDRVVmdkState *s = bs->opaque;
     VmdkExtent *e;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     for (i = 0; i < s->num_extents; i++) {
         e = &s->extents[i];
         g_free(e->l1_table);
@@ -284,7 +283,6 @@ static void vmdk_free_extents(BlockDriverState *bs)
         }
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     g_free(s->extents);
 }
@@ -1249,11 +1247,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
                             0, 0, 0, 0, 0, &extent, errp);
             if (ret < 0) {
                 bdrv_graph_rdunlock_main_loop();
-                bdrv_drain_all_begin();
-                bdrv_graph_wrlock();
+                bdrv_graph_wrlock_drained();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1270,11 +1266,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             g_free(buf);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
-                bdrv_drain_all_begin();
-                bdrv_graph_wrlock();
+                bdrv_graph_wrlock_drained();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1283,11 +1277,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
             ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
             if (ret) {
                 bdrv_graph_rdunlock_main_loop();
-                bdrv_drain_all_begin();
-                bdrv_graph_wrlock();
+                bdrv_graph_wrlock_drained();
                 bdrv_unref_child(bs, extent_file);
                 bdrv_graph_wrunlock();
-                bdrv_drain_all_end();
                 bdrv_graph_rdlock_main_loop();
                 goto out;
             }
@@ -1295,11 +1287,9 @@ vmdk_parse_extents(const char *desc, BlockDriverState *bs, QDict *options,
         } else {
             error_setg(errp, "Unsupported extent type '%s'", type);
             bdrv_graph_rdunlock_main_loop();
-            bdrv_drain_all_begin();
-            bdrv_graph_wrlock();
+            bdrv_graph_wrlock_drained();
             bdrv_unref_child(bs, extent_file);
             bdrv_graph_wrunlock();
-            bdrv_drain_all_end();
             bdrv_graph_rdlock_main_loop();
             ret = -ENOTSUP;
             goto out;
diff --git a/blockdev.c b/blockdev.c
index 2e7fda6780..e625534925 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3561,8 +3561,7 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
     BlockDriverState *parent_bs, *new_bs = NULL;
     BdrvChild *p_child;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
 
     parent_bs = bdrv_lookup_bs(parent, parent, errp);
     if (!parent_bs) {
@@ -3599,7 +3598,6 @@ void qmp_x_blockdev_change(const char *parent, const char *child,
 
 out:
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 }
 
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
diff --git a/blockjob.c b/blockjob.c
index e68181a35b..db7c3a69a0 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,8 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
      * one to make sure that such a concurrent access does not attempt
      * to process an already freed BdrvChild.
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     while (job->nodes) {
         GSList *l = job->nodes;
         BdrvChild *c = l->data;
@@ -212,7 +211,6 @@ void block_job_remove_all_bdrv(BlockJob *job)
         g_slist_free_1(l);
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 }
 
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
@@ -498,8 +496,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     int ret;
     GLOBAL_STATE_CODE();
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
 
     if (job_id == NULL && !(flags & JOB_INTERNAL)) {
         job_id = bdrv_get_device_name(bs);
@@ -509,7 +506,6 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
                      flags, cb, opaque, errp);
     if (job == NULL) {
         bdrv_graph_wrunlock();
-        bdrv_drain_all_end();
         return NULL;
     }
 
@@ -548,12 +544,10 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,
     }
 
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     return job;
 
 fail:
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     job_early_fail(&job->job);
     return NULL;
 }
diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index 2c26c72108..95bf5ede40 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -112,10 +112,21 @@ void unregister_aiocontext(AioContext *ctx);
 void no_coroutine_fn TSA_ACQUIRE(graph_lock) TSA_NO_TSA
 bdrv_graph_wrlock(void);
 
+/*
+ * bdrv_graph_wrlock_drained:
+ * Similar to bdrv_graph_wrlock, but will begin a drained section before
+ * locking.
+ */
+void no_coroutine_fn TSA_ACQUIRE(graph_lock) TSA_NO_TSA
+bdrv_graph_wrlock_drained(void);
+
 /*
  * bdrv_graph_wrunlock:
  * Write finished, reset global has_writer to 0 and restart
  * all readers that are waiting.
+ *
+ * Also ends the drained section if bdrv_graph_wrlock_drained() was used to lock
+ * the graph.
  */
 void no_coroutine_fn TSA_RELEASE(graph_lock) TSA_NO_TSA
 bdrv_graph_wrunlock(void);
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index c33f7d31c2..ed60e1693d 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -772,11 +772,9 @@ static void test_blockjob_common_drain_node(enum drain_type drain_type,
     tjob->bs = src;
     job = &tjob->common;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     block_job_add_bdrv(job, "target", target, 0, BLK_PERM_ALL, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     switch (result) {
     case TEST_JOB_SUCCESS:
@@ -955,13 +953,11 @@ static void bdrv_test_top_close(BlockDriverState *bs)
 {
     BdrvChild *c, *next_c;
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     QLIST_FOREACH_SAFE(c, &bs->children, next, next_c) {
         bdrv_unref_child(bs, c);
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 }
 
 static int coroutine_fn GRAPH_RDLOCK
@@ -1051,12 +1047,10 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
 
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     /* This child will be the one to pass to requests through to, and
      * it will stall until a drain occurs */
@@ -1064,25 +1058,21 @@ static void do_test_delete_by_drain(bool detach_instead_of_delete,
                                     &error_abort);
     child_bs->total_sectors = 65536 >> BDRV_SECTOR_BITS;
     /* Takes our reference to child_bs */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     tts->wait_child = bdrv_attach_child(bs, child_bs, "wait-child",
                                         &child_of_bds,
                                         BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY,
                                         &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     /* This child is just there to be deleted
      * (for detach_instead_of_delete == true) */
     null_bs = bdrv_open("null-co://", NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
                         &error_abort);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(bs, null_bs, "null-child", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     blk = blk_new(qemu_get_aio_context(), BLK_PERM_ALL, BLK_PERM_ALL);
     blk_insert_bs(blk, bs, &error_abort);
@@ -1165,8 +1155,7 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
 
     bdrv_dec_in_flight(data->child_b->bs);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_unref_child(data->parent_b, data->child_b);
 
     bdrv_ref(data->c);
@@ -1174,7 +1163,6 @@ static void no_coroutine_fn detach_indirect_bh(void *opaque)
                                       &child_of_bds, BDRV_CHILD_DATA,
                                       &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 }
 
 static void coroutine_mixed_fn detach_by_parent_aio_cb(void *opaque, int ret)
@@ -1272,8 +1260,7 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
     /* Set child relationships */
     bdrv_ref(b);
     bdrv_ref(a);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     child_b = bdrv_attach_child(parent_b, b, "PB-B", &child_of_bds,
                                 BDRV_CHILD_DATA, &error_abort);
     child_a = bdrv_attach_child(parent_b, a, "PB-A", &child_of_bds,
@@ -1284,7 +1271,6 @@ static void TSA_NO_TSA test_detach_indirect(bool by_parent_cb)
                       by_parent_cb ? &child_of_bds : &detach_by_driver_cb_class,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     g_assert_cmpint(parent_a->refcnt, ==, 1);
     g_assert_cmpint(parent_b->refcnt, ==, 1);
@@ -1697,8 +1683,7 @@ static void test_drop_intermediate_poll(void)
      * Establish the chain last, so the chain links are the first
      * elements in the BDS.parents lists
      */
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     for (i = 0; i < 3; i++) {
         if (i) {
             /* Takes the reference to chain[i - 1] */
@@ -1707,7 +1692,6 @@ static void test_drop_intermediate_poll(void)
         }
     }
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     job = block_job_create("job", &test_simple_job_driver, NULL, job_node,
                            0, BLK_PERM_ALL, 0, 0, NULL, NULL, &error_abort);
@@ -1954,12 +1938,10 @@ static void do_test_replace_child_mid_drain(int old_drain_count,
     new_child_bs->total_sectors = 1;
 
     bdrv_ref(old_child_bs);
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(parent_bs, old_child_bs, "child", &child_of_bds,
                       BDRV_CHILD_COW, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     parent_s->setup_completed = true;
 
     for (i = 0; i < old_drain_count; i++) {
diff --git a/tests/unit/test-bdrv-graph-mod.c b/tests/unit/test-bdrv-graph-mod.c
index 7b03ebe4b0..b077f0e3e3 100644
--- a/tests/unit/test-bdrv-graph-mod.c
+++ b/tests/unit/test-bdrv-graph-mod.c
@@ -137,12 +137,10 @@ static void test_update_perm_tree(void)
 
     blk_insert_bs(root, bs, &error_abort);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(filter, bs, "child", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     ret = bdrv_append(filter, bs, NULL);
     g_assert_cmpint(ret, <, 0);
@@ -206,13 +204,11 @@ static void test_should_update_child(void)
 
     bdrv_set_backing_hd(target, bs, &error_abort);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     g_assert(target->backing->bs == bs);
     bdrv_attach_child(filter, target, "target", &child_of_bds,
                       BDRV_CHILD_DATA, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
     bdrv_append(filter, bs, &error_abort);
 
     bdrv_graph_rdlock_main_loop();
@@ -248,8 +244,7 @@ static void test_parallel_exclusive_write(void)
     bdrv_ref(base);
     bdrv_ref(fl1);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(top, fl1, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
@@ -262,7 +257,6 @@ static void test_parallel_exclusive_write(void)
 
     bdrv_replace_node(fl1, fl2, &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     bdrv_drained_end(fl2);
     bdrv_drained_end(fl1);
@@ -369,8 +363,7 @@ static void test_parallel_perm_update(void)
      */
     bdrv_ref(base);
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(top, ws, "file", &child_of_bds, BDRV_CHILD_DATA,
                       &error_abort);
     c_fl1 = bdrv_attach_child(ws, fl1, "first", &child_of_bds,
@@ -384,7 +377,6 @@ static void test_parallel_perm_update(void)
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     /* Select fl1 as first child to be active */
     s->selected = c_fl1;
@@ -438,13 +430,11 @@ static void test_append_greedy_filter(void)
     BlockDriverState *base = no_perm_node("base");
     BlockDriverState *fl = exclusive_writer_node("fl1");
 
-    bdrv_drain_all_begin();
-    bdrv_graph_wrlock();
+    bdrv_graph_wrlock_drained();
     bdrv_attach_child(top, base, "backing", &child_of_bds,
                       BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
                       &error_abort);
     bdrv_graph_wrunlock();
-    bdrv_drain_all_end();
 
     bdrv_append(fl, base, &error_abort);
     bdrv_unref(fl);
-- 
2.39.5




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 24/24] block: add bdrv_graph_wrlock_drained() convenience wrapper
  2025-05-20 10:30 ` [PATCH v2 24/24] block: add bdrv_graph_wrlock_drained() convenience wrapper Fiona Ebner
@ 2025-05-20 15:54   ` Eric Blake
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Blake @ 2025-05-20 15:54 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, kwolf, den, andrey.drobyshev, hreitz,
	stefanha, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

On Tue, May 20, 2025 at 12:30:12PM +0200, Fiona Ebner wrote:
> Many write-lockded sections are also drained sections. A new

write-locked

> bdrv_graph_wrunlock_drained() wrapper around bdrv_graph_wrunlock() is
> introduced, which will begin a drained section first. A global
> variable is used so bdrv_graph_wrunlock() knows if it also needs
> to end such a drained section. Both the aio_poll call in
> bdrv_graph_wrlock() and the aio_bh_poll() in bdrv_graph_wrunlock()
> can re-enter a write-locked section. While for the latter, ending the
> drain could be moved to before the call, the former requires that the
> variable is a counter and not just a boolean.
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 00/24] block: do not drain while holding the graph lock
  2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
                   ` (23 preceding siblings ...)
  2025-05-20 10:30 ` [PATCH v2 24/24] block: add bdrv_graph_wrlock_drained() convenience wrapper Fiona Ebner
@ 2025-05-20 15:44 ` Eric Blake
  24 siblings, 0 replies; 49+ messages in thread
From: Eric Blake @ 2025-05-20 15:44 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, qemu-devel, kwolf, den, andrey.drobyshev, hreitz,
	stefanha, jsnow, vsementsov, xiechanglong.d, wencongyang2, berto,
	fam, ari

On Tue, May 20, 2025 at 12:29:48PM +0200, Fiona Ebner wrote:
> Previous discussion [0].
> 

> NOTE:
> there are pre-existing test failures on current master, e.g. '240' for
> all formats, '295 296 inactive-node-nbd luks-detached-header' for luks
> and 'mirror-sparse' for raw. For me, the failures do not change after
> this series.

I have not yet reproduced the mirror-sparse failure locally; I suspect
the culprit may be a difference in filesystems and how well (or not)
SEEK_HOLE and FALLOC_FL_PUNCH_HOLE work on different filesystems.

Can I get more details on the failure as you see it, so I can work on
skipping the test when the filesystem does not support the test at
hand?  (It might be better to reply in context to the pull request
that introduced the file [2])

[2] https://lists.gnu.org/archive/html/qemu-devel/2025-05/msg03623.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-05-26 11:44 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 10:29 [PATCH v2 00/24] block: do not drain while holding the graph lock Fiona Ebner
2025-05-20 10:29 ` [PATCH v2 01/24] block: remove outdated comments about AioContext locking Fiona Ebner
2025-05-21  9:41   ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 02/24] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
2025-05-21  9:45   ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 03/24] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
2025-05-21  9:53   ` Kevin Wolf
2025-05-23 18:12   ` Andrey Drobyshev
2025-05-26  9:10     ` Kevin Wolf
2025-05-26 10:33       ` Fiona Ebner
2025-05-26 11:42         ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 04/24] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
2025-05-21  9:56   ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 05/24] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK Fiona Ebner
2025-05-21 10:04   ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 06/24] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) Fiona Ebner
2025-05-20 15:47   ` Eric Blake
2025-05-21 15:23   ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 07/24] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
2025-05-20 15:48   ` Eric Blake
2025-05-21 15:28   ` Kevin Wolf
2025-05-23 17:20   ` Andrey Drobyshev
2025-05-26  9:05     ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 08/24] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK Fiona Ebner
2025-05-21 16:05   ` Kevin Wolf
2025-05-22  9:09     ` Fiona Ebner
2025-05-22 12:05       ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 09/24] block: move drain outside of bdrv_try_change_aio_context() Fiona Ebner
2025-05-21 16:36   ` Kevin Wolf
2025-05-22  9:56     ` Fiona Ebner
2025-05-22 12:36       ` Kevin Wolf
2025-05-20 10:29 ` [PATCH v2 10/24] block: move drain outside of bdrv_attach_child_common(_abort)() Fiona Ebner
2025-05-20 10:29 ` [PATCH v2 11/24] block: move drain outside of bdrv_set_backing_hd_drained() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 12/24] block: move drain outside of bdrv_root_attach_child() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 13/24] block: move drain outside of bdrv_attach_child() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 14/24] block: move drain outside of quorum_add_child() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 15/24] block: move drain outside of bdrv_root_unref_child() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 16/24] block: move drain outside of quorum_del_child() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 17/24] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 18/24] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 19/24] block: mark bdrv_drained_begin() as GRAPH_UNLOCKED Fiona Ebner
2025-05-23 17:20   ` Andrey Drobyshev
2025-05-20 10:30 ` [PATCH v2 20/24] iotests/graph-changes-while-io: remove image file after test Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 21/24] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 22/24] block/io: remove duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end() Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 23/24] block: never use atomics to access bs->quiesce_counter Fiona Ebner
2025-05-20 10:30 ` [PATCH v2 24/24] block: add bdrv_graph_wrlock_drained() convenience wrapper Fiona Ebner
2025-05-20 15:54   ` Eric Blake
2025-05-20 15:44 ` [PATCH v2 00/24] block: do not drain while holding the graph lock Eric Blake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).