qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fiona Ebner <f.ebner@proxmox.com>
To: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org, kwolf@redhat.com, den@virtuozzo.com,
	andrey.drobyshev@virtuozzo.com, hreitz@redhat.com,
	stefanha@redhat.com, eblake@redhat.com, jsnow@redhat.com,
	vsementsov@yandex-team.ru, xiechanglong.d@gmail.com,
	wencongyang2@huawei.com, berto@igalia.com, fam@euphon.net,
	ari@tuxera.com
Subject: [PATCH v4 08/48] block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK
Date: Fri, 30 May 2025 17:10:45 +0200	[thread overview]
Message-ID: <20250530151125.955508-9-f.ebner@proxmox.com> (raw)
In-Reply-To: <20250530151125.955508-1-f.ebner@proxmox.com>

This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

Note that even if bdrv_drained_begin() were already marked as
GRAPH_UNLOCKED, TSA would not complain about the instance in
bdrv_change_aio_context() before this change, because it is preceded
by a bdrv_graph_rdunlock_main_loop() call. It is not correct to
release the lock here, and in case the caller holds a write lock, it
wouldn't actually release the lock.

In combination with block-stream, there is a deadlock that can happen
because of this [0]. In particular, it can happen that
main thread              IO thread
1. acquires write lock
                         in blk_co_do_preadv_part():
                         2. have non-zero blk->in_flight
                         3. try to acquire read lock
4. begin drain

Steps 3 and 4 might be switched. Draining will poll and get stuck,
because it will see the non-zero in_flight counter. But the IO thread
will not make any progress either, because it cannot acquire the read
lock.

After this change, all paths to bdrv_change_aio_context() drain:
bdrv_change_aio_context() is called by:
1. bdrv_child_cb_change_aio_ctx() which is only called via the
   change_aio_ctx() callback, see below.
2. bdrv_child_change_aio_context(), see below.
3. bdrv_try_change_aio_context(), where a drained section is
   introduced.

The change_aio_ctx() callback is called by:
1. bdrv_attach_child_common_abort(), where a drained section is
   introduced.
2. bdrv_attach_child_common(), where a drained section is introduced.
3. bdrv_parent_change_aio_context(), see below.

bdrv_child_change_aio_context() is called by:
1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
   section is invariant.
2. child_job_change_aio_ctx(), which is only called via the
   change_aio_ctx() callback, see above.

bdrv_parent_change_aio_context() is called by:
1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
   section is invariant.

This resolves all code paths. Note that bdrv_attach_child_common()
and bdrv_attach_child_common_abort() hold the graph write lock and
callers of bdrv_try_change_aio_context() might too, so they are not
actually allowed to drain either. This will be addressed in the
following commits.

More granular draining is not trivially possible, because
bdrv_change_aio_context() can recursively call itself e.g. via
bdrv_child_change_aio_context().

[0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/

Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                          | 57 +++++++++++++++++++++++---------
 include/block/block_int-common.h | 12 +++++++
 2 files changed, 53 insertions(+), 16 deletions(-)

diff --git a/block.c b/block.c
index 01144c895e..6f42c0f1ab 100644
--- a/block.c
+++ b/block.c
@@ -106,9 +106,9 @@ static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
 
 static bool bdrv_backing_overridden(BlockDriverState *bs);
 
-static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                    GHashTable *visited, Transaction *tran,
-                                    Error **errp);
+static bool GRAPH_RDLOCK
+bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                        GHashTable *visited, Transaction *tran, Error **errp);
 
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
@@ -3040,8 +3040,10 @@ static void GRAPH_WRLOCK bdrv_attach_child_common_abort(void *opaque)
 
         /* No need to visit `child`, because it has been detached already */
         visited = g_hash_table_new(NULL, NULL);
+        bdrv_drain_all_begin();
         ret = s->child->klass->change_aio_ctx(s->child, s->old_parent_ctx,
                                               visited, tran, &error_abort);
+        bdrv_drain_all_end();
         g_hash_table_destroy(visited);
 
         /* transaction is supposed to always succeed */
@@ -3122,9 +3124,11 @@ bdrv_attach_child_common(BlockDriverState *child_bs,
             bool ret_child;
 
             g_hash_table_add(visited, new_child);
+            bdrv_drain_all_begin();
             ret_child = child_class->change_aio_ctx(new_child, child_ctx,
                                                     visited, aio_ctx_tran,
                                                     NULL);
+            bdrv_drain_all_end();
             if (ret_child == true) {
                 error_free(local_err);
                 ret = 0;
@@ -7576,6 +7580,17 @@ typedef struct BdrvStateSetAioContext {
     BlockDriverState *bs;
 } BdrvStateSetAioContext;
 
+/*
+ * Changes the AioContext of @child to @ctx and recursively for the associated
+ * block nodes and all their children and parents. Returns true if the change is
+ * possible and the transaction @tran can be continued. Returns false and sets
+ * @errp if not and the transaction must be aborted.
+ *
+ * @visited will accumulate all visited BdrvChild objects. The caller is
+ * responsible for freeing the list afterwards.
+ *
+ * Must be called with the affected block nodes drained.
+ */
 static bool GRAPH_RDLOCK
 bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
                                GHashTable *visited, Transaction *tran,
@@ -7604,6 +7619,17 @@ bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
     return true;
 }
 
+/*
+ * Changes the AioContext of @c->bs to @ctx and recursively for all its children
+ * and parents. Returns true if the change is possible and the transaction @tran
+ * can be continued. Returns false and sets @errp if not and the transaction
+ * must be aborted.
+ *
+ * @visited will accumulate all visited BdrvChild objects. The caller is
+ * responsible for freeing the list afterwards.
+ *
+ * Must be called with the affected block nodes drained.
+ */
 bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
                                    GHashTable *visited, Transaction *tran,
                                    Error **errp)
@@ -7619,10 +7645,6 @@ bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
 static void bdrv_set_aio_context_clean(void *opaque)
 {
     BdrvStateSetAioContext *state = (BdrvStateSetAioContext *) opaque;
-    BlockDriverState *bs = (BlockDriverState *) state->bs;
-
-    /* Paired with bdrv_drained_begin in bdrv_change_aio_context() */
-    bdrv_drained_end(bs);
 
     g_free(state);
 }
@@ -7650,10 +7672,12 @@ static TransactionActionDrv set_aio_context = {
  *
  * @visited will accumulate all visited BdrvChild objects. The caller is
  * responsible for freeing the list afterwards.
+ *
+ * @bs must be drained.
  */
-static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-                                    GHashTable *visited, Transaction *tran,
-                                    Error **errp)
+static bool GRAPH_RDLOCK
+bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+                        GHashTable *visited, Transaction *tran, Error **errp)
 {
     BdrvChild *c;
     BdrvStateSetAioContext *state;
@@ -7664,21 +7688,17 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
         return true;
     }
 
-    bdrv_graph_rdlock_main_loop();
     QLIST_FOREACH(c, &bs->parents, next_parent) {
         if (!bdrv_parent_change_aio_context(c, ctx, visited, tran, errp)) {
-            bdrv_graph_rdunlock_main_loop();
             return false;
         }
     }
 
     QLIST_FOREACH(c, &bs->children, next) {
         if (!bdrv_child_change_aio_context(c, ctx, visited, tran, errp)) {
-            bdrv_graph_rdunlock_main_loop();
             return false;
         }
     }
-    bdrv_graph_rdunlock_main_loop();
 
     state = g_new(BdrvStateSetAioContext, 1);
     *state = (BdrvStateSetAioContext) {
@@ -7686,8 +7706,7 @@ static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
         .bs = bs,
     };
 
-    /* Paired with bdrv_drained_end in bdrv_set_aio_context_clean() */
-    bdrv_drained_begin(bs);
+    assert(bs->quiesce_counter > 0);
 
     tran_add(tran, &set_aio_context, state);
 
@@ -7720,6 +7739,8 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
     if (ignore_child) {
         g_hash_table_add(visited, ignore_child);
     }
+    bdrv_drain_all_begin();
+    bdrv_graph_rdlock_main_loop();
     ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
     g_hash_table_destroy(visited);
 
@@ -7733,10 +7754,14 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
     if (!ret) {
         /* Just run clean() callbacks. No AioContext changed. */
         tran_abort(tran);
+        bdrv_graph_rdunlock_main_loop();
+        bdrv_drain_all_end();
         return -EPERM;
     }
 
     tran_commit(tran);
+    bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
     return 0;
 }
 
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 37466c7841..168f703fa1 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -983,6 +983,18 @@ struct BdrvChildClass {
                            bool backing_mask_protocol,
                            Error **errp);
 
+    /*
+     * Notifies the parent that the child is trying to change its AioContext.
+     * The parent may in turn change the AioContext of other nodes in the same
+     * transaction. Returns true if the change is possible and the transaction
+     * can be continued. Returns false and sets @errp if not and the transaction
+     * must be aborted.
+     *
+     * @visited will accumulate all visited BdrvChild objects. The caller is
+     * responsible for freeing the list afterwards.
+     *
+     * Must be called with the affected block nodes drained.
+     */
     bool GRAPH_RDLOCK_PTR (*change_aio_ctx)(BdrvChild *child, AioContext *ctx,
                                             GHashTable *visited,
                                             Transaction *tran, Error **errp);
-- 
2.39.5




  parent reply	other threads:[~2025-05-30 15:16 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-30 15:10 [PATCH v4 00/48] block: do not drain while holding the graph lock Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 01/48] block: remove outdated comments about AioContext locking Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 02/48] block: move drain outside of read-locked bdrv_reopen_queue_child() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 03/48] block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 04/48] block: move drain outside of read-locked bdrv_inactivate_recurse() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 05/48] block: mark bdrv_parent_change_aio_context() GRAPH_RDLOCK Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 06/48] block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 07/48] block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK Fiona Ebner
2025-05-30 15:10 ` Fiona Ebner [this message]
2025-05-30 15:10 ` [PATCH v4 09/48] block: move drain outside of bdrv_try_change_aio_context() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 10/48] block: move drain outside of bdrv_attach_child_common(_abort)() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 11/48] block: move drain outside of bdrv_set_backing_hd_drained() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 12/48] block: move drain outside of bdrv_root_attach_child() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 13/48] block: move drain outside of bdrv_attach_child() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 14/48] block: move drain outside of quorum_add_child() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 15/48] block: move drain outside of bdrv_root_unref_child() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 16/48] block: move drain outside of quorum_del_child() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 17/48] blockdev: drain while unlocked in internal_snapshot_action() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 18/48] blockdev: drain while unlocked in external_snapshot_action() Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 19/48] block: mark bdrv_drained_begin() and friends as GRAPH_UNLOCKED Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 20/48] iotests/graph-changes-while-io: remove image file after test Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 21/48] iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner
2025-05-30 15:10 ` [PATCH v4 22/48] block/io: remove duplicate GLOBAL_STATE_CODE() in bdrv_do_drained_end() Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 23/48] block: never use atomics to access bs->quiesce_counter Fiona Ebner
2025-06-02 14:45   ` Fiona Ebner
2025-07-01 11:24     ` Kevin Wolf
2025-05-30 15:11 ` [PATCH v4 24/48] block: add bdrv_graph_wrlock_drained() convenience wrapper Fiona Ebner
2025-07-01 11:37   ` Kevin Wolf
2025-05-30 15:11 ` [PATCH v4 25/48] block/mirror: switch to bdrv_set_backing_hd_drained() variant Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 26/48] block/commit: " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 27/48] block: call bdrv_set_backing_hd() while unlocked in bdrv_open_backing_file() Fiona Ebner
2025-07-01 13:13   ` Kevin Wolf
2025-05-30 15:11 ` [PATCH v4 28/48] block: mark bdrv_set_backing_hd() as GRAPH_UNLOCKED Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 29/48] blockdev: avoid locking and draining multiple times in external_snapshot_abort() Fiona Ebner
2025-06-02  8:56   ` Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 30/48] block: drop wrapper for bdrv_set_backing_hd_drained() Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 31/48] block-backend: mark blk_drain_all() as GRAPH_UNLOCKED Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 32/48] block/snapshot: mark bdrv_all_delete_snapshot() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 33/48] block/stream: mark stream_prepare() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 34/48] block: mark bdrv_reopen_queue() and bdrv_reopen_multiple() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 35/48] block: mark bdrv_inactivate() as GRAPH_RDLOCK and move drain to callers Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 36/48] block: mark bdrv_inactivate_all() as GRAPH_UNLOCKED Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 37/48] block: mark blk_remove_bs() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 38/48] block: mark blk_drain() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 39/48] block-backend: mark blk_io_limits_disable() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 40/48] block/commit: mark commit_abort() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 41/48] block: mark bdrv_new() " Fiona Ebner
2025-07-01 16:55   ` Kevin Wolf
2025-05-30 15:11 ` [PATCH v4 42/48] block: mark bdrv_replace_child_bs() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 43/48] block: mark bdrv_insert_node() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 44/48] block: mark bdrv_drop_intermediate() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 45/48] block: mark bdrv_close_all() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 46/48] block: mark bdrv_close() " Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 47/48] block: mark bdrv_open_child_common() and its callers GRAPH_UNLOCKED Fiona Ebner
2025-05-30 15:11 ` [PATCH v4 48/48] blockjob: mark block_job_remove_all_bdrv() as GRAPH_UNLOCKED Fiona Ebner
2025-06-03 14:54 ` [PATCH v4 00/48] block: do not drain while holding the graph lock Kevin Wolf
2025-06-04  7:38   ` Fiona Ebner
2025-07-01 17:16 ` Kevin Wolf
2025-07-14 13:43   ` Kevin Wolf
2025-07-15 13:24     ` Fiona Ebner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250530151125.955508-9-f.ebner@proxmox.com \
    --to=f.ebner@proxmox.com \
    --cc=andrey.drobyshev@virtuozzo.com \
    --cc=ari@tuxera.com \
    --cc=berto@igalia.com \
    --cc=den@virtuozzo.com \
    --cc=eblake@redhat.com \
    --cc=fam@euphon.net \
    --cc=hreitz@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@yandex-team.ru \
    --cc=wencongyang2@huawei.com \
    --cc=xiechanglong.d@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).