[RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
@ 2024-06-05 12:08 Fiona Ebner
  2024-06-05 13:59 ` Fabiano Rosas
  2024-06-06 18:36 ` Stefan Hajnoczi
  0 siblings, 2 replies; 9+ messages in thread
From: Fiona Ebner @ 2024-06-05 12:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: peterx, farosas, stefanha, pbonzini

The fact that the snapshot_save_job_bh() is scheduled in the main
loop's qemu_aio_context AioContext means that it might get executed
during a vCPU thread's aio_poll(). But saving of the VM state cannot
happen while the guest or devices are active and can lead to assertion
failures. See issue #2111 for two examples. Avoid the problem by
scheduling the snapshot_save_job_bh() in the iohandler AioContext,
which is not polled by vCPU threads.

Solves Issue #2111.

This change also solves the following issue:

Since commit effd60c878 ("monitor: only run coroutine commands in
qemu_aio_context"), the 'snapshot-save' QMP call would not respond
right after starting the job anymore, but only after the job finished,
which can take a long time. The reason is, because after commit
effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
coroutine cannot be entered immediately anymore, but needs to be
scheduled to the main loop's qemu_aio_context AioContext. But
snapshot_save_job_bh() was scheduled first to the same AioContext and
thus gets executed first.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---

While initial smoke testing seems fine, I'm not familiar enough with
this to rule out any pitfalls with the approach. Any reason why
scheduling to the iohandler AioContext could be wrong here?

Should the same be done for the snapshot_load_job_bh and
snapshot_delete_job_bh to keep it consistent?

 migration/savevm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index c621f2359b..0086b76ab0 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, Error **errp)
     SnapshotJob *s = container_of(job, SnapshotJob, common);
     s->errp = errp;
     s->co = qemu_coroutine_self();
-    aio_bh_schedule_oneshot(qemu_get_aio_context(),
+    aio_bh_schedule_oneshot(iohandler_get_aio_context(),
                             snapshot_save_job_bh, job);
     qemu_coroutine_yield();
     return s->ret ? 0 : -1;
-- 
2.39.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-05 12:08 [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context Fiona Ebner
@ 2024-06-05 13:59 ` Fabiano Rosas
  2024-06-06 18:36 ` Stefan Hajnoczi
  1 sibling, 0 replies; 9+ messages in thread
From: Fabiano Rosas @ 2024-06-05 13:59 UTC (permalink / raw)
  To: Fiona Ebner, qemu-devel; +Cc: peterx, stefanha, pbonzini

Fiona Ebner <f.ebner@proxmox.com> writes:

> The fact that the snapshot_save_job_bh() is scheduled in the main
> loop's qemu_aio_context AioContext means that it might get executed
> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> happen while the guest or devices are active and can lead to assertion
> failures. See issue #2111 for two examples. Avoid the problem by
> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> which is not polled by vCPU threads.
>
> Solves Issue #2111.
>
> This change also solves the following issue:
>
> Since commit effd60c878 ("monitor: only run coroutine commands in
> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> right after starting the job anymore, but only after the job finished,
> which can take a long time. The reason is, because after commit
> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> coroutine cannot be entered immediately anymore, but needs to be
> scheduled to the main loop's qemu_aio_context AioContext. But
> snapshot_save_job_bh() was scheduled first to the same AioContext and
> thus gets executed first.
>
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>
> While initial smoke testing seems fine, I'm not familiar enough with
> this to rule out any pitfalls with the approach. Any reason why
> scheduling to the iohandler AioContext could be wrong here?

FWIW, I couldn't find any reason why this would not work. But let's see
what Stefan and Paolo have to say.

>
> Should the same be done for the snapshot_load_job_bh and
> snapshot_delete_job_bh to keep it consistent?

Yep, I think it makes sense.

>
>  migration/savevm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c621f2359b..0086b76ab0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, Error **errp)
>      SnapshotJob *s = container_of(job, SnapshotJob, common);
>      s->errp = errp;
>      s->co = qemu_coroutine_self();
> -    aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +    aio_bh_schedule_oneshot(iohandler_get_aio_context(),
>                              snapshot_save_job_bh, job);
>      qemu_coroutine_yield();
>      return s->ret ? 0 : -1;


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-05 12:08 [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context Fiona Ebner
  2024-06-05 13:59 ` Fabiano Rosas
@ 2024-06-06 18:36 ` Stefan Hajnoczi
  2024-06-11 12:08   ` Fiona Ebner
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2024-06-06 18:36 UTC (permalink / raw)
  To: Fiona Ebner; +Cc: qemu-devel, peterx, farosas, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3399 bytes --]

On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> The fact that the snapshot_save_job_bh() is scheduled in the main
> loop's qemu_aio_context AioContext means that it might get executed
> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> happen while the guest or devices are active and can lead to assertion
> failures. See issue #2111 for two examples. Avoid the problem by
> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> which is not polled by vCPU threads.
> 
> Solves Issue #2111.
> 
> This change also solves the following issue:
> 
> Since commit effd60c878 ("monitor: only run coroutine commands in
> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> right after starting the job anymore, but only after the job finished,
> which can take a long time. The reason is, because after commit
> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> coroutine cannot be entered immediately anymore, but needs to be
> scheduled to the main loop's qemu_aio_context AioContext. But
> snapshot_save_job_bh() was scheduled first to the same AioContext and
> thus gets executed first.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
> 
> While initial smoke testing seems fine, I'm not familiar enough with
> this to rule out any pitfalls with the approach. Any reason why
> scheduling to the iohandler AioContext could be wrong here?

If something waits for a BlockJob to finish using aio_poll() from
qemu_aio_context then a deadlock is possible since the iohandler_ctx
won't get a chance to execute. The only suspicious code path I found was
job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
sure whether it triggers this scenario. Please check that code path.

> Should the same be done for the snapshot_load_job_bh and
> snapshot_delete_job_bh to keep it consistent?

In the long term it would be cleaner to move away from synchronous APIs
that rely on nested event loops. They have been a source of bugs for
years.

If vm_stop() and perhaps other operations in save_snapshot() were
asynchronous, then it would be safe to run the operation in
qemu_aio_context without using iohandler_ctx. vm_stop() wouldn't invoke
its callback until vCPUs were quiesced and outside device emulation
code.

I think this patch is fine as a one-line bug fix, but we should be
careful about falling back on this trick because it makes the codebase
harder to understand and more fragile.

> 
>  migration/savevm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c621f2359b..0086b76ab0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, Error **errp)
>      SnapshotJob *s = container_of(job, SnapshotJob, common);
>      s->errp = errp;
>      s->co = qemu_coroutine_self();
> -    aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +    aio_bh_schedule_oneshot(iohandler_get_aio_context(),
>                              snapshot_save_job_bh, job);
>      qemu_coroutine_yield();
>      return s->ret ? 0 : -1;
> -- 
> 2.39.2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-06 18:36 ` Stefan Hajnoczi
@ 2024-06-11 12:08   ` Fiona Ebner
  2024-06-11 14:04     ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Fiona Ebner @ 2024-06-11 12:08 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, peterx, farosas, pbonzini

Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi:
> On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
>> The fact that the snapshot_save_job_bh() is scheduled in the main
>> loop's qemu_aio_context AioContext means that it might get executed
>> during a vCPU thread's aio_poll(). But saving of the VM state cannot
>> happen while the guest or devices are active and can lead to assertion
>> failures. See issue #2111 for two examples. Avoid the problem by
>> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
>> which is not polled by vCPU threads.
>>
>> Solves Issue #2111.
>>
>> This change also solves the following issue:
>>
>> Since commit effd60c878 ("monitor: only run coroutine commands in
>> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
>> right after starting the job anymore, but only after the job finished,
>> which can take a long time. The reason is, because after commit
>> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
>> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
>> coroutine cannot be entered immediately anymore, but needs to be
>> scheduled to the main loop's qemu_aio_context AioContext. But
>> snapshot_save_job_bh() was scheduled first to the same AioContext and
>> thus gets executed first.
>>
>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>> ---
>>
>> While initial smoke testing seems fine, I'm not familiar enough with
>> this to rule out any pitfalls with the approach. Any reason why
>> scheduling to the iohandler AioContext could be wrong here?
> 
> If something waits for a BlockJob to finish using aio_poll() from
> qemu_aio_context then a deadlock is possible since the iohandler_ctx
> won't get a chance to execute. The only suspicious code path I found was
> job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
> sure whether it triggers this scenario. Please check that code path.
> 

Sorry, I don't understand. Isn't executing the scheduled BH the only
additional progress that the iohandler_ctx needs to make compared to
before the patch? How exactly would that cause issues when waiting for a
BlockJob?

Or do you mean something waiting for the SnapshotJob from
qemu_aio_context before snapshot_save_job_bh had the chance to run?

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-11 12:08   ` Fiona Ebner
@ 2024-06-11 14:04     ` Stefan Hajnoczi
  2024-06-12  9:20       ` Fiona Ebner
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2024-06-11 14:04 UTC (permalink / raw)
  To: Fiona Ebner; +Cc: qemu-devel, peterx, farosas, pbonzini

[-- Attachment #1: Type: text/plain, Size: 2735 bytes --]

On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote:
> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi:
> > On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> >> The fact that the snapshot_save_job_bh() is scheduled in the main
> >> loop's qemu_aio_context AioContext means that it might get executed
> >> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> >> happen while the guest or devices are active and can lead to assertion
> >> failures. See issue #2111 for two examples. Avoid the problem by
> >> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> >> which is not polled by vCPU threads.
> >>
> >> Solves Issue #2111.
> >>
> >> This change also solves the following issue:
> >>
> >> Since commit effd60c878 ("monitor: only run coroutine commands in
> >> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> >> right after starting the job anymore, but only after the job finished,
> >> which can take a long time. The reason is, because after commit
> >> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> >> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> >> coroutine cannot be entered immediately anymore, but needs to be
> >> scheduled to the main loop's qemu_aio_context AioContext. But
> >> snapshot_save_job_bh() was scheduled first to the same AioContext and
> >> thus gets executed first.
> >>
> >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> >> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> >> ---
> >>
> >> While initial smoke testing seems fine, I'm not familiar enough with
> >> this to rule out any pitfalls with the approach. Any reason why
> >> scheduling to the iohandler AioContext could be wrong here?
> > 
> > If something waits for a BlockJob to finish using aio_poll() from
> > qemu_aio_context then a deadlock is possible since the iohandler_ctx
> > won't get a chance to execute. The only suspicious code path I found was
> > job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
> > sure whether it triggers this scenario. Please check that code path.
> > 
> 
> Sorry, I don't understand. Isn't executing the scheduled BH the only
> additional progress that the iohandler_ctx needs to make compared to
> before the patch? How exactly would that cause issues when waiting for a
> BlockJob?
> 
> Or do you mean something waiting for the SnapshotJob from
> qemu_aio_context before snapshot_save_job_bh had the chance to run?

Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has
no chance to execute. But I haven't audited the code to understand
whether this can happen.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-11 14:04     ` Stefan Hajnoczi
@ 2024-06-12  9:20       ` Fiona Ebner
  2024-06-12 15:34         ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Fiona Ebner @ 2024-06-12  9:20 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, peterx, farosas, pbonzini

Am 11.06.24 um 16:04 schrieb Stefan Hajnoczi:
> On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote:
>> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi:
>>> On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
>>>> The fact that the snapshot_save_job_bh() is scheduled in the main
>>>> loop's qemu_aio_context AioContext means that it might get executed
>>>> during a vCPU thread's aio_poll(). But saving of the VM state cannot
>>>> happen while the guest or devices are active and can lead to assertion
>>>> failures. See issue #2111 for two examples. Avoid the problem by
>>>> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
>>>> which is not polled by vCPU threads.
>>>>
>>>> Solves Issue #2111.
>>>>
>>>> This change also solves the following issue:
>>>>
>>>> Since commit effd60c878 ("monitor: only run coroutine commands in
>>>> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
>>>> right after starting the job anymore, but only after the job finished,
>>>> which can take a long time. The reason is, because after commit
>>>> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
>>>> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
>>>> coroutine cannot be entered immediately anymore, but needs to be
>>>> scheduled to the main loop's qemu_aio_context AioContext. But
>>>> snapshot_save_job_bh() was scheduled first to the same AioContext and
>>>> thus gets executed first.
>>>>
>>>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
>>>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
>>>> ---
>>>>
>>>> While initial smoke testing seems fine, I'm not familiar enough with
>>>> this to rule out any pitfalls with the approach. Any reason why
>>>> scheduling to the iohandler AioContext could be wrong here?
>>>
>>> If something waits for a BlockJob to finish using aio_poll() from
>>> qemu_aio_context then a deadlock is possible since the iohandler_ctx
>>> won't get a chance to execute. The only suspicious code path I found was
>>> job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
>>> sure whether it triggers this scenario. Please check that code path.
>>>
>>
>> Sorry, I don't understand. Isn't executing the scheduled BH the only
>> additional progress that the iohandler_ctx needs to make compared to
>> before the patch? How exactly would that cause issues when waiting for a
>> BlockJob?
>>
>> Or do you mean something waiting for the SnapshotJob from
>> qemu_aio_context before snapshot_save_job_bh had the chance to run?
> 
> Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has
> no chance to execute. But I haven't audited the code to understand
> whether this can happen.
So job_finish_sync_locked() is executed in
job_completed_txn_abort_locked() when the following branch is taken

> if (!job_is_completed_locked(other_job))

and there is no other job in the transaction, so we can assume other_job
being the snapshot-save job itself.

The callers of job_completed_txn_abort_locked():

1. in job_do_finalize_locked() if job->ret is non-zero. The callers of
which are:

1a. in job_finalize_locked() if JOB_VERB_FINALIZE is allowed, meaning
job->status is JOB_STATUS_PENDING, so job_is_completed_locked() will be
true.

1b. job_completed_txn_success_locked() sets job->status to
JOB_STATUS_WAITING before, so job_is_completed_locked() will be true.

2. in job_completed_locked() it is only done if job->ret is non-zero, in
which case job->status was set to JOB_STATUS_ABORTING by the preceding
job_update_rc_locked(), and thus job_is_completed_locked() will be true.

3. in job_cancel_locked() if job->deferred_to_main_loop is true, which
is set in job_co_entry() before job_exit() is scheduled as a BH and is
also set in job_do_dismiss_locked(). In the former case, the
snapshot_save_job_bh has already been executed. In the latter case,
job_is_completed_locked() will be true (since job_early_fail() is not
used for the snapshot job).


However, job_finish_sync_locked() is also executed via
job_cancel_sync_all() during qemu_cleanup(). With bad timing there, I
suppose the issue could arise.

In fact, I could trigger it with the following hack on top:

> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0086b76ab0..42c93176ba 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3429,6 +3429,17 @@ static void snapshot_load_job_bh(void *opaque)
>  
>  static void snapshot_save_job_bh(void *opaque)
>  {
> +    static int n = 0;
> +    n++;
> +    if (n < 10000000) {
> +        aio_bh_schedule_oneshot(iohandler_get_aio_context(),
> +                                snapshot_save_job_bh, opaque);
> +        if (n % 1000000 == 0) {
> +            error_report("iteration %d", n);
> +        }
> +        return;
> +    }
> +
>      Job *job = opaque;
>      SnapshotJob *s = container_of(job, SnapshotJob, common);
>  

Once the AIO_WAIT_WHILE_UNLOCKED(job->aio_context, ...) was hit in
job_finish_sync_locked(), the snapshot_save_job_bh would never be
executed again, leading to the deadlock you described.

Best Regards,
Fiona





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-12  9:20       ` Fiona Ebner
@ 2024-06-12 15:34         ` Stefan Hajnoczi
  2024-06-14  9:29           ` Fiona Ebner
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2024-06-12 15:34 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Stefan Hajnoczi, qemu-devel, peterx, farosas, pbonzini,
	Kevin Wolf

On Wed, 12 Jun 2024 at 05:21, Fiona Ebner <f.ebner@proxmox.com> wrote:
>
> Am 11.06.24 um 16:04 schrieb Stefan Hajnoczi:
> > On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote:
> >> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi:
> >>> On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> >>>> The fact that the snapshot_save_job_bh() is scheduled in the main
> >>>> loop's qemu_aio_context AioContext means that it might get executed
> >>>> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> >>>> happen while the guest or devices are active and can lead to assertion
> >>>> failures. See issue #2111 for two examples. Avoid the problem by
> >>>> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> >>>> which is not polled by vCPU threads.
> >>>>
> >>>> Solves Issue #2111.
> >>>>
> >>>> This change also solves the following issue:
> >>>>
> >>>> Since commit effd60c878 ("monitor: only run coroutine commands in
> >>>> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> >>>> right after starting the job anymore, but only after the job finished,
> >>>> which can take a long time. The reason is, because after commit
> >>>> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> >>>> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> >>>> coroutine cannot be entered immediately anymore, but needs to be
> >>>> scheduled to the main loop's qemu_aio_context AioContext. But
> >>>> snapshot_save_job_bh() was scheduled first to the same AioContext and
> >>>> thus gets executed first.
> >>>>
> >>>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> >>>> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> >>>> ---
> >>>>
> >>>> While initial smoke testing seems fine, I'm not familiar enough with
> >>>> this to rule out any pitfalls with the approach. Any reason why
> >>>> scheduling to the iohandler AioContext could be wrong here?
> >>>
> >>> If something waits for a BlockJob to finish using aio_poll() from
> >>> qemu_aio_context then a deadlock is possible since the iohandler_ctx
> >>> won't get a chance to execute. The only suspicious code path I found was
> >>> job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
> >>> sure whether it triggers this scenario. Please check that code path.
> >>>
> >>
> >> Sorry, I don't understand. Isn't executing the scheduled BH the only
> >> additional progress that the iohandler_ctx needs to make compared to
> >> before the patch? How exactly would that cause issues when waiting for a
> >> BlockJob?
> >>
> >> Or do you mean something waiting for the SnapshotJob from
> >> qemu_aio_context before snapshot_save_job_bh had the chance to run?
> >
> > Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has
> > no chance to execute. But I haven't audited the code to understand
> > whether this can happen.
> So job_finish_sync_locked() is executed in
> job_completed_txn_abort_locked() when the following branch is taken
>
> > if (!job_is_completed_locked(other_job))
>
> and there is no other job in the transaction, so we can assume other_job
> being the snapshot-save job itself.
>
> The callers of job_completed_txn_abort_locked():
>
> 1. in job_do_finalize_locked() if job->ret is non-zero. The callers of
> which are:
>
> 1a. in job_finalize_locked() if JOB_VERB_FINALIZE is allowed, meaning
> job->status is JOB_STATUS_PENDING, so job_is_completed_locked() will be
> true.
>
> 1b. job_completed_txn_success_locked() sets job->status to
> JOB_STATUS_WAITING before, so job_is_completed_locked() will be true.
>
> 2. in job_completed_locked() it is only done if job->ret is non-zero, in
> which case job->status was set to JOB_STATUS_ABORTING by the preceding
> job_update_rc_locked(), and thus job_is_completed_locked() will be true.
>
> 3. in job_cancel_locked() if job->deferred_to_main_loop is true, which
> is set in job_co_entry() before job_exit() is scheduled as a BH and is
> also set in job_do_dismiss_locked(). In the former case, the
> snapshot_save_job_bh has already been executed. In the latter case,
> job_is_completed_locked() will be true (since job_early_fail() is not
> used for the snapshot job).
>
>
> However, job_finish_sync_locked() is also executed via
> job_cancel_sync_all() during qemu_cleanup(). With bad timing there, I
> suppose the issue could arise.
>
> In fact, I could trigger it with the following hack on top:
>
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index 0086b76ab0..42c93176ba 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -3429,6 +3429,17 @@ static void snapshot_load_job_bh(void *opaque)
> >
> >  static void snapshot_save_job_bh(void *opaque)
> >  {
> > +    static int n = 0;
> > +    n++;
> > +    if (n < 10000000) {
> > +        aio_bh_schedule_oneshot(iohandler_get_aio_context(),
> > +                                snapshot_save_job_bh, opaque);
> > +        if (n % 1000000 == 0) {
> > +            error_report("iteration %d", n);
> > +        }
> > +        return;
> > +    }
> > +
> >      Job *job = opaque;
> >      SnapshotJob *s = container_of(job, SnapshotJob, common);
> >
>
> Once the AIO_WAIT_WHILE_UNLOCKED(job->aio_context, ...) was hit in
> job_finish_sync_locked(), the snapshot_save_job_bh would never be
> executed again, leading to the deadlock you described.

Thank you for investigating! It looks like we would be trading one
issue (the assertion failures you mentioned) for another (a rare, but
possible, hang).

I'm not sure what the best solution is. It seems like vm_stop() is the
first place where things go awry. It's where we should exit device
emulation code. Doing that probably requires an asynchronous API that
takes a callback. Do you want to try that?

CCing Kevin in case he has ideas.

Stefan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-12 15:34         ` Stefan Hajnoczi
@ 2024-06-14  9:29           ` Fiona Ebner
  2024-06-18 20:34             ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Fiona Ebner @ 2024-06-14  9:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, qemu-devel, peterx, farosas, pbonzini,
	Kevin Wolf

Am 12.06.24 um 17:34 schrieb Stefan Hajnoczi:
> 
> Thank you for investigating! It looks like we would be trading one
> issue (the assertion failures you mentioned) for another (a rare, but
> possible, hang).
> 
> I'm not sure what the best solution is. It seems like vm_stop() is the
> first place where things go awry. It's where we should exit device
> emulation code. Doing that probably requires an asynchronous API that
> takes a callback. Do you want to try that?
> 

I can try, but I'm afraid it will be a while (at least a few weeks)
until I can get around to it.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
  2024-06-14  9:29           ` Fiona Ebner
@ 2024-06-18 20:34             ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2024-06-18 20:34 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: Stefan Hajnoczi, qemu-devel, peterx, farosas, pbonzini,
	Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

On Fri, Jun 14, 2024 at 11:29:13AM +0200, Fiona Ebner wrote:
> Am 12.06.24 um 17:34 schrieb Stefan Hajnoczi:
> > 
> > Thank you for investigating! It looks like we would be trading one
> > issue (the assertion failures you mentioned) for another (a rare, but
> > possible, hang).
> > 
> > I'm not sure what the best solution is. It seems like vm_stop() is the
> > first place where things go awry. It's where we should exit device
> > emulation code. Doing that probably requires an asynchronous API that
> > takes a callback. Do you want to try that?
> > 
> 
> I can try, but I'm afraid it will be a while (at least a few weeks)
> until I can get around to it.

I am wrapping current work up and then going on vacation at the end of
June until mid-July. I'll let you know if I get a chance to look at it
when I'm back.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-06-18 20:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-05 12:08 [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context Fiona Ebner
2024-06-05 13:59 ` Fabiano Rosas
2024-06-06 18:36 ` Stefan Hajnoczi
2024-06-11 12:08   ` Fiona Ebner
2024-06-11 14:04     ` Stefan Hajnoczi
2024-06-12  9:20       ` Fiona Ebner
2024-06-12 15:34         ` Stefan Hajnoczi
2024-06-14  9:29           ` Fiona Ebner
2024-06-18 20:34             ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).