Re: [PATCH] migration: Fix a possible crash when halting a guest during migration

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Fabiano Rosas <farosas@suse.de>
To: Thomas Huth <thuth@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>
Cc: "Peter Xu" <peterx@redhat.com>,
	qemu-devel@nongnu.org, "Kevin Wolf" <kwolf@redhat.com>,
	qemu-block@nongnu.org, "Eric Blake" <eblake@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Daniel P . Berrangé" <berrange@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: [PATCH] migration: Fix a possible crash when halting a guest during migration
Date: Fri, 12 Dec 2025 18:26:32 -0300	[thread overview]
Message-ID: <87jyyrv1br.fsf@suse.de> (raw)
In-Reply-To: <5b510f3b-796a-45fb-a63f-e87b02dace61@redhat.com>

Thomas Huth <thuth@redhat.com> writes:

> On 08/12/2025 16.26, Fabiano Rosas wrote:
>> Stefan Hajnoczi <stefanha@redhat.com> writes:
>> 
>>> On Mon, Dec 08, 2025 at 02:51:01PM +0100, Thomas Huth wrote:
>>>> From: Thomas Huth <thuth@redhat.com>
>>>>
>>>> When shutting down a guest that is currently in progress of being
>>>> migrated, there is a chance that QEMU might crash during bdrv_delete().
>>>> The backtrace looks like this:
>>>>
>>>>   Thread 74 "mig/src/main" received signal SIGSEGV, Segmentation fault.
>>>>
>>>>   [Switching to Thread 0x3f7de7fc8c0 (LWP 2161436)]
>>>>   0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>   5560	        QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
>>>>   (gdb) bt
>>>>   #0  0x000002aa00664012 in bdrv_delete (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:5560
>>>>   #1  bdrv_unref (bs=0x2aa00f875c0) at ../../devel/qemu/block.c:7170
>>>>   Backtrace stopped: Cannot access memory at address 0x3f7de7f83e0
>>>>
>> 
>> How does the migration thread reaches here? Is this from
>> migration_block_inactivate()?
>
> Unfortunately, gdb was not very helpful here (claiming that it cannot access 
> the memory and stack anymore), so I had to do some printf debugging. This is 
> what seems to happen:
>
> Main thread: qemu_cleanup() calls  migration_shutdown() --> 
> migration_cancel() which signals the migration thread to cancel the migration.
>
> Migration thread: migration_thread() got kicked out the loop and calls 
> migration_iteration_finish(), which tries to get the BQL via bql_lock() but 
> that is currently held by another thread, so the migration thread is blocked 
> here.
>
> Main thread: qemu_cleanup() advances to bdrv_close_all() that uses 
> blockdev_close_all_bdrv_states() to unref all BDS. The BDS with the name 
> 'libvirt-1-storage' gets deleted via bdrv_delete() that way.
>

Has qmp_blockdev_del() ever been called to remove the BDS from the
monitor_bdrv_states list? Otherwise your debugging seems to indicate
blockdev_close_all_bdrv_states() is dropping the last reference to bs,
but it's still accessible from bdrv_next() via
bdrv_next_monitor_owned().

> Migration thread: Later, migration_iteration_finish() finally gets the BQL, 
> and calls the migration_block_activate() function in the 
> MIGRATION_STATUS_CANCELLING case statement. This calls bdrv_activate_all().
> bdrv_activate_all() gets a pointer to that 'libvirt-1-storage' BDS again 
> from bdrv_first(), and during the bdrv_next() that BDS gets unref'ed again 
> which is causing the crash.
>
> ==> Why is bdrv_first() still providing a BDS that have been deleted by 
> other threads earlier?
>

>>> It sounds like the migration thread does not hold block graph refcounts
>>> and assumes the BlockDriverStates it uses have a long enough lifetime.
>>>
>>> I don't know the migration code well enough to say whether joining in
>>> migration_shutdown() is okay. Another option would be expicitly holding
>>> the necessary refcounts in the migration thread.
>> 
>> I agree. In principle and also because shuffling the joining around
>> feels like something that's prone to introduce other bugs.
>
> I'm a little bit lost here right now ... Can you suggest a place where we 
> would need to increase the refcounts in the migration thread?
>
>   Thomas

next prev parent reply	other threads:[~2025-12-12 21:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08 13:51 [PATCH] migration: Fix a possible crash when halting a guest during migration Thomas Huth
2025-12-08 14:45 ` Stefan Hajnoczi
2025-12-08 15:26   ` Fabiano Rosas
2025-12-12 17:18     ` Thomas Huth
2025-12-12 21:26       ` Fabiano Rosas [this message]
2025-12-15 13:42         ` Kevin Wolf
2025-12-15 15:11           ` Thomas Huth
2025-12-15 15:28             ` Kevin Wolf
2025-12-15 16:12               ` Peter Xu
2025-12-16  7:18                 ` Thomas Huth
2025-12-15 13:42         ` Thomas Huth
2025-12-08 15:45 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyyrv1br.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=alex.bennee@linaro.org \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.