qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Hanna Reitz <hreitz@redhat.com>
To: qemu-block@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>,
	qemu-devel@nongnu.org
Subject: [PATCH] block: Fix BB.root changing across bdrv_next()
Date: Tue,  1 Mar 2022 18:39:14 +0100	[thread overview]
Message-ID: <20220301173914.12279-1-hreitz@redhat.com> (raw)

bdrv_next() has no guarantee that its caller has stopped all block graph
operations; for example, bdrv_flush_all() does not.

The latter can actually provoke such operations, because its
bdrv_flush() call, which runs a coroutine (bdrv_co_flush()), may run
this coroutine in a different AioContext than the main one, and then
when this coroutine is done and invokes aio_wait_kick(), the monitor may
get a chance to run and start executing some graph-modifying QMP
command.

One example for this is when the VM encounters an I/O error on a block
device and stops, triggering a bdrv_flush_all(), and a blockdev-mirror
is started simultaneously on a block node in an I/O thread.  When
bdrv_flush_all() comes to that node[1] and flushes it, the
aio_wait_kick() at the end of bdrv_co_flush_entry() may cause the
monitor to process the mirror request, and mirror_start_job() will then
replace the node by a mirror filter node, before bdrv_flush_all()
resumes and can invoke bdrv_next() again to continue iterating.

[1] Say there is a BlockBackend on top of the node in question, and so
bdrv_next() finds that BB and returns the node as the BB's blk_bs().
bdrv_next() will bdrv_ref() the node such that it remains valid through
bdrv_flush_all()'s iteration, and drop the reference when it is called
the next time.

The problem is that bdrv_next() does not store to which BDS it retains a
strong reference when the BDS is a BB's child, so on the subsequent
call, it will just invoke blk_bs() again and bdrv_unref() the returned
node -- but as the example above shows, this node might be a different
one than the one that was bdrv_ref()-ed before.  This can lead to a
use-after-free (for the mirror filter node in our example), because this
negligent bdrv_unref() would steal a strong reference from someone else.

We can solve this problem by always storing the returned (and strongly
referenced) BDS in BdrvNextIterator.bs.  When we want to drop the strong
reference of a BDS previously returned, always drop BdrvNextIterator.bs
instead of using other ways of trying to figure out what that BDS was
that we returned last time.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2058457
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
Sadly, I didn't find a nice way to test this, primarily because I
believe reproducing this requires a bdrv_flush_all() to come from the
VM (without QMP command).  The only way I know that this can happen is
when the VM encounters an I/O error and responds with stopping the
guest.
It's also anything but easily reproducible, and I can't think of a way
to improve on that, because it really seems to be based on chance
whether the aio_wait_kick() wakes up the monitor and has it process an
incoming QMP command.
---
 block/block-backend.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 4ff6b4d785..c822f257dc 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -575,7 +575,7 @@ BlockBackend *blk_next(BlockBackend *blk)
  * the monitor or attached to a BlockBackend */
 BlockDriverState *bdrv_next(BdrvNextIterator *it)
 {
-    BlockDriverState *bs, *old_bs;
+    BlockDriverState *bs, *old_bs = it->bs;
 
     /* Must be called from the main loop */
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
@@ -586,8 +586,6 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
     if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
         BlockBackend *old_blk = it->blk;
 
-        old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
         do {
             it->blk = blk_all_next(it->blk);
             bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -601,11 +599,12 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
         if (bs) {
             bdrv_ref(bs);
             bdrv_unref(old_bs);
+            it->bs = bs;
             return bs;
         }
         it->phase = BDRV_NEXT_MONITOR_OWNED;
-    } else {
-        old_bs = it->bs;
+        /* Start from the first monitor-owned BDS */
+        it->bs = NULL;
     }
 
     /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.34.1



             reply	other threads:[~2022-03-01 17:41 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-01 17:39 Hanna Reitz [this message]
2022-03-11 15:20 ` [PATCH] block: Fix BB.root changing across bdrv_next() Hanna Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220301173914.12279-1-hreitz@redhat.com \
    --to=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).