* [PATCH 0/2] migration/multifd: Fix rb->receivedmap cleanup race @ 2024-09-13 22:05 Fabiano Rosas 2024-09-13 22:05 ` [PATCH 1/2] migration/savevm: Remove extra load cleanup calls Fabiano Rosas 2024-09-13 22:05 ` [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas 0 siblings, 2 replies; 12+ messages in thread From: Fabiano Rosas @ 2024-09-13 22:05 UTC (permalink / raw) To: qemu-devel; +Cc: Peter Xu, Peter Maydell This fixes the crash we've been seing recently in migration-test. The first patch is a cleanup to have only one place calling qemu_loadvm_state_cleanup() and the second patch reorders the cleanup calls to make multifd_recv_cleanup() run first and stop the recv threads. CI run: https://gitlab.com/farosas/qemu/-/pipelines/1453038652 Fabiano Rosas (2): migration/savevm: Remove extra load cleanup calls migration/multifd: Fix rb->receivedmap cleanup race migration/migration.c | 1 + migration/migration.h | 1 - migration/savevm.c | 11 ----------- 3 files changed, 1 insertion(+), 12 deletions(-) -- 2.35.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] migration/savevm: Remove extra load cleanup calls 2024-09-13 22:05 [PATCH 0/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas @ 2024-09-13 22:05 ` Fabiano Rosas 2024-09-17 16:42 ` Peter Xu 2024-09-13 22:05 ` [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas 1 sibling, 1 reply; 12+ messages in thread From: Fabiano Rosas @ 2024-09-13 22:05 UTC (permalink / raw) To: qemu-devel; +Cc: Peter Xu, Peter Maydell, qemu-stable There are two qemu_loadvm_state_cleanup() calls that were introduced when qemu_loadvm_state_setup() was still called before loading the configuration section, so there was state to be cleaned up if the header checks failed. However, commit 9e14b84908 ("migration/savevm: load_header before load_setup") has moved that configuration section part to qemu_loadvm_state_header() which now happens before qemu_loadvm_state_setup(). Remove the cleanup calls that are now misplaced. CC: qemu-stable@nongnu.org Fixes: 9e14b84908 ("migration/savevm: load_header before load_setup") Signed-off-by: Fabiano Rosas <farosas@suse.de> --- migration/savevm.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/migration/savevm.c b/migration/savevm.c index d500eae979..d0759694fd 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2732,13 +2732,11 @@ static int qemu_loadvm_state_header(QEMUFile *f) if (migrate_get_current()->send_configuration) { if (qemu_get_byte(f) != QEMU_VM_CONFIGURATION) { error_report("Configuration section missing"); - qemu_loadvm_state_cleanup(); return -EINVAL; } ret = vmstate_load_state(f, &vmstate_configuration, &savevm_state, 0); if (ret) { - qemu_loadvm_state_cleanup(); return ret; } } -- 2.35.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] migration/savevm: Remove extra load cleanup calls 2024-09-13 22:05 ` [PATCH 1/2] migration/savevm: Remove extra load cleanup calls Fabiano Rosas @ 2024-09-17 16:42 ` Peter Xu 2024-09-17 17:17 ` Fabiano Rosas 0 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2024-09-17 16:42 UTC (permalink / raw) To: Fabiano Rosas; +Cc: qemu-devel, Peter Maydell, qemu-stable On Fri, Sep 13, 2024 at 07:05:41PM -0300, Fabiano Rosas wrote: > There are two qemu_loadvm_state_cleanup() calls that were introduced > when qemu_loadvm_state_setup() was still called before loading the > configuration section, so there was state to be cleaned up if the > header checks failed. > > However, commit 9e14b84908 ("migration/savevm: load_header before > load_setup") has moved that configuration section part to > qemu_loadvm_state_header() which now happens before > qemu_loadvm_state_setup(). > > Remove the cleanup calls that are now misplaced. > > CC: qemu-stable@nongnu.org > Fixes: 9e14b84908 ("migration/savevm: load_header before load_setup") > Signed-off-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Peter Xu <peterx@redhat.com> We don't need to copy stable, am I right? IIUC it's a good cleanup, however not a bug fix, as qemu_loadvm_state_cleanup() can be invoked without calling _setup() safely? > --- > migration/savevm.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/migration/savevm.c b/migration/savevm.c > index d500eae979..d0759694fd 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2732,13 +2732,11 @@ static int qemu_loadvm_state_header(QEMUFile *f) > if (migrate_get_current()->send_configuration) { > if (qemu_get_byte(f) != QEMU_VM_CONFIGURATION) { > error_report("Configuration section missing"); > - qemu_loadvm_state_cleanup(); > return -EINVAL; > } > ret = vmstate_load_state(f, &vmstate_configuration, &savevm_state, 0); > > if (ret) { > - qemu_loadvm_state_cleanup(); > return ret; > } > } > -- > 2.35.3 > -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] migration/savevm: Remove extra load cleanup calls 2024-09-17 16:42 ` Peter Xu @ 2024-09-17 17:17 ` Fabiano Rosas 0 siblings, 0 replies; 12+ messages in thread From: Fabiano Rosas @ 2024-09-17 17:17 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel, Peter Maydell, qemu-stable Peter Xu <peterx@redhat.com> writes: > On Fri, Sep 13, 2024 at 07:05:41PM -0300, Fabiano Rosas wrote: >> There are two qemu_loadvm_state_cleanup() calls that were introduced >> when qemu_loadvm_state_setup() was still called before loading the >> configuration section, so there was state to be cleaned up if the >> header checks failed. >> >> However, commit 9e14b84908 ("migration/savevm: load_header before >> load_setup") has moved that configuration section part to >> qemu_loadvm_state_header() which now happens before >> qemu_loadvm_state_setup(). >> >> Remove the cleanup calls that are now misplaced. >> >> CC: qemu-stable@nongnu.org >> Fixes: 9e14b84908 ("migration/savevm: load_header before load_setup") >> Signed-off-by: Fabiano Rosas <farosas@suse.de> > > Reviewed-by: Peter Xu <peterx@redhat.com> > > We don't need to copy stable, am I right? IIUC it's a good cleanup, > however not a bug fix, as qemu_loadvm_state_cleanup() can be invoked > without calling _setup() safely? Hm, I think you're right. If we fail in the header part the multifd threads will still be waiting for the ram code to release them. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-13 22:05 [PATCH 0/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas 2024-09-13 22:05 ` [PATCH 1/2] migration/savevm: Remove extra load cleanup calls Fabiano Rosas @ 2024-09-13 22:05 ` Fabiano Rosas 2024-09-17 17:02 ` Peter Xu 2024-09-20 18:55 ` Elena Ufimtseva 1 sibling, 2 replies; 12+ messages in thread From: Fabiano Rosas @ 2024-09-13 22:05 UTC (permalink / raw) To: qemu-devel; +Cc: Peter Xu, Peter Maydell, qemu-stable Fix a segmentation fault in multifd when rb->receivedmap is cleared too early. After commit 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults"), multifd started using the rb->receivedmap bitmap, which belongs to ram.c and is initialized and *freed* from the ram SaveVMHandlers. Multifd threads are live until migration_incoming_state_destroy(), which is called after qemu_loadvm_state_cleanup(), leading to a crash when accessing rb->receivedmap. process_incoming_migration_co() ... qemu_loadvm_state() multifd_nocomp_recv() qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) ... migration_incoming_state_destroy() multifd_recv_cleanup() multifd_recv_terminate_threads(NULL) Move the loadvm cleanup into migration_incoming_state_destroy(), after multifd_recv_cleanup() to ensure multifd thread have already exited when rb->receivedmap is cleared. The have_listen_thread logic can now be removed because its purpose was to delay cleanup until postcopy_ram_listen_thread() had finished. CC: qemu-stable@nongnu.org Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults") Signed-off-by: Fabiano Rosas <farosas@suse.de> --- migration/migration.c | 1 + migration/migration.h | 1 - migration/savevm.c | 9 --------- 3 files changed, 1 insertion(+), 10 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 3dea06d577..b190a574b1 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) struct MigrationIncomingState *mis = migration_incoming_get_current(); multifd_recv_cleanup(); + qemu_loadvm_state_cleanup(); if (mis->to_src_file) { /* Tell source that we are done */ diff --git a/migration/migration.h b/migration/migration.h index 38aa1402d5..20b0a5b66e 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -101,7 +101,6 @@ struct MigrationIncomingState { /* Set this when we want the fault thread to quit */ bool fault_thread_quit; - bool have_listen_thread; QemuThread listen_thread; /* For the kernel to send us notifications */ diff --git a/migration/savevm.c b/migration/savevm.c index d0759694fd..532ee5e4b0 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void *opaque) * got a bad migration state). */ migration_incoming_state_destroy(); - qemu_loadvm_state_cleanup(); rcu_unregister_thread(); - mis->have_listen_thread = false; postcopy_state_set(POSTCOPY_INCOMING_END); object_unref(OBJECT(migr)); @@ -2130,7 +2128,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) return -1; } - mis->have_listen_thread = true; postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen", postcopy_ram_listen_thread, QEMU_THREAD_DETACHED); trace_loadvm_postcopy_handle_listen("return"); @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f) trace_qemu_loadvm_state_post_main(ret); - if (mis->have_listen_thread) { - /* Listen thread still going, can't clean up yet */ - return ret; - } - if (ret == 0) { ret = qemu_file_get_error(f); } @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f) } } - qemu_loadvm_state_cleanup(); cpu_synchronize_all_post_init(); return ret; -- 2.35.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-13 22:05 ` [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas @ 2024-09-17 17:02 ` Peter Xu 2024-09-17 17:41 ` Fabiano Rosas 2024-09-20 18:55 ` Elena Ufimtseva 1 sibling, 1 reply; 12+ messages in thread From: Peter Xu @ 2024-09-17 17:02 UTC (permalink / raw) To: Fabiano Rosas; +Cc: qemu-devel, Peter Maydell, qemu-stable On Fri, Sep 13, 2024 at 07:05:42PM -0300, Fabiano Rosas wrote: > Fix a segmentation fault in multifd when rb->receivedmap is cleared > too early. > > After commit 5ef7e26bdb ("migration/multifd: solve zero page causing > multiple page faults"), multifd started using the rb->receivedmap > bitmap, which belongs to ram.c and is initialized and *freed* from the > ram SaveVMHandlers. > > Multifd threads are live until migration_incoming_state_destroy(), > which is called after qemu_loadvm_state_cleanup(), leading to a crash > when accessing rb->receivedmap. > > process_incoming_migration_co() ... > qemu_loadvm_state() multifd_nocomp_recv() > qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() > rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) > ... > migration_incoming_state_destroy() > multifd_recv_cleanup() > multifd_recv_terminate_threads(NULL) > > Move the loadvm cleanup into migration_incoming_state_destroy(), after > multifd_recv_cleanup() to ensure multifd thread have already exited > when rb->receivedmap is cleared. > > The have_listen_thread logic can now be removed because its purpose > was to delay cleanup until postcopy_ram_listen_thread() had finished. > > CC: qemu-stable@nongnu.org > Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults") > Signed-off-by: Fabiano Rosas <farosas@suse.de> > --- > migration/migration.c | 1 + > migration/migration.h | 1 - > migration/savevm.c | 9 --------- > 3 files changed, 1 insertion(+), 10 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 3dea06d577..b190a574b1 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) > struct MigrationIncomingState *mis = migration_incoming_get_current(); > > multifd_recv_cleanup(); > + qemu_loadvm_state_cleanup(); > > if (mis->to_src_file) { > /* Tell source that we are done */ > diff --git a/migration/migration.h b/migration/migration.h > index 38aa1402d5..20b0a5b66e 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -101,7 +101,6 @@ struct MigrationIncomingState { > /* Set this when we want the fault thread to quit */ > bool fault_thread_quit; > > - bool have_listen_thread; > QemuThread listen_thread; > > /* For the kernel to send us notifications */ > diff --git a/migration/savevm.c b/migration/savevm.c > index d0759694fd..532ee5e4b0 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void *opaque) > * got a bad migration state). > */ > migration_incoming_state_destroy(); > - qemu_loadvm_state_cleanup(); > > rcu_unregister_thread(); > - mis->have_listen_thread = false; > postcopy_state_set(POSTCOPY_INCOMING_END); > > object_unref(OBJECT(migr)); > @@ -2130,7 +2128,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) > return -1; > } > > - mis->have_listen_thread = true; > postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen", > postcopy_ram_listen_thread, QEMU_THREAD_DETACHED); > trace_loadvm_postcopy_handle_listen("return"); > @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f) > > trace_qemu_loadvm_state_post_main(ret); > > - if (mis->have_listen_thread) { > - /* Listen thread still going, can't clean up yet */ > - return ret; > - } Hmm, I wonder whether we would still need this. IIUC it's not only about cleanup, but also that when postcopy is involved, dst QEMU postpones doing any of the rest in the qemu_loadvm_state_main() call. E.g. cpu put, aka, cpu_synchronize_all_post_init(), is also done in loadvm_postcopy_handle_run_bh() later. IOW, I'd then expect when this patch applied we'll put cpu twice? I think the should_send_vmdesc() part is fine, as it returns false for postcopy anyway. However not sure on the cpu post_init above. > - > if (ret == 0) { > ret = qemu_file_get_error(f); > } > @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f) > } > } > > - qemu_loadvm_state_cleanup(); > cpu_synchronize_all_post_init(); > > return ret; > -- > 2.35.3 > -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-17 17:02 ` Peter Xu @ 2024-09-17 17:41 ` Fabiano Rosas 0 siblings, 0 replies; 12+ messages in thread From: Fabiano Rosas @ 2024-09-17 17:41 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel, Peter Maydell, qemu-stable Peter Xu <peterx@redhat.com> writes: > On Fri, Sep 13, 2024 at 07:05:42PM -0300, Fabiano Rosas wrote: >> Fix a segmentation fault in multifd when rb->receivedmap is cleared >> too early. >> >> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing >> multiple page faults"), multifd started using the rb->receivedmap >> bitmap, which belongs to ram.c and is initialized and *freed* from the >> ram SaveVMHandlers. >> >> Multifd threads are live until migration_incoming_state_destroy(), >> which is called after qemu_loadvm_state_cleanup(), leading to a crash >> when accessing rb->receivedmap. >> >> process_incoming_migration_co() ... >> qemu_loadvm_state() multifd_nocomp_recv() >> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >> rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) >> ... >> migration_incoming_state_destroy() >> multifd_recv_cleanup() >> multifd_recv_terminate_threads(NULL) >> >> Move the loadvm cleanup into migration_incoming_state_destroy(), after >> multifd_recv_cleanup() to ensure multifd thread have already exited >> when rb->receivedmap is cleared. >> >> The have_listen_thread logic can now be removed because its purpose >> was to delay cleanup until postcopy_ram_listen_thread() had finished. >> >> CC: qemu-stable@nongnu.org >> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults") >> Signed-off-by: Fabiano Rosas <farosas@suse.de> >> --- >> migration/migration.c | 1 + >> migration/migration.h | 1 - >> migration/savevm.c | 9 --------- >> 3 files changed, 1 insertion(+), 10 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 3dea06d577..b190a574b1 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) >> struct MigrationIncomingState *mis = migration_incoming_get_current(); >> >> multifd_recv_cleanup(); >> + qemu_loadvm_state_cleanup(); >> >> if (mis->to_src_file) { >> /* Tell source that we are done */ >> diff --git a/migration/migration.h b/migration/migration.h >> index 38aa1402d5..20b0a5b66e 100644 >> --- a/migration/migration.h >> +++ b/migration/migration.h >> @@ -101,7 +101,6 @@ struct MigrationIncomingState { >> /* Set this when we want the fault thread to quit */ >> bool fault_thread_quit; >> >> - bool have_listen_thread; >> QemuThread listen_thread; >> >> /* For the kernel to send us notifications */ >> diff --git a/migration/savevm.c b/migration/savevm.c >> index d0759694fd..532ee5e4b0 100644 >> --- a/migration/savevm.c >> +++ b/migration/savevm.c >> @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void *opaque) >> * got a bad migration state). >> */ >> migration_incoming_state_destroy(); >> - qemu_loadvm_state_cleanup(); >> >> rcu_unregister_thread(); >> - mis->have_listen_thread = false; >> postcopy_state_set(POSTCOPY_INCOMING_END); >> >> object_unref(OBJECT(migr)); >> @@ -2130,7 +2128,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) >> return -1; >> } >> >> - mis->have_listen_thread = true; >> postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen", >> postcopy_ram_listen_thread, QEMU_THREAD_DETACHED); >> trace_loadvm_postcopy_handle_listen("return"); >> @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f) >> >> trace_qemu_loadvm_state_post_main(ret); >> >> - if (mis->have_listen_thread) { >> - /* Listen thread still going, can't clean up yet */ >> - return ret; >> - } > > Hmm, I wonder whether we would still need this. IIUC it's not only about > cleanup, but also that when postcopy is involved, dst QEMU postpones doing > any of the rest in the qemu_loadvm_state_main() call. > > E.g. cpu put, aka, cpu_synchronize_all_post_init(), is also done in > loadvm_postcopy_handle_run_bh() later. > > IOW, I'd then expect when this patch applied we'll put cpu twice? > > I think the should_send_vmdesc() part is fine, as it returns false for > postcopy anyway. However not sure on the cpu post_init above. I'm not sure either, but there's several ioctls in there, so it's probably better to skip them. I'll keep the have_listen and adjust the comment. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-13 22:05 ` [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas 2024-09-17 17:02 ` Peter Xu @ 2024-09-20 18:55 ` Elena Ufimtseva 2024-10-08 21:36 ` Fabiano Rosas 1 sibling, 1 reply; 12+ messages in thread From: Elena Ufimtseva @ 2024-09-20 18:55 UTC (permalink / raw) To: Fabiano Rosas; +Cc: qemu-devel, Peter Xu, Peter Maydell, qemu-stable [-- Attachment #1: Type: text/plain, Size: 4651 bytes --] On Fri, Sep 13, 2024 at 3:07 PM Fabiano Rosas <farosas@suse.de> wrote: > Fix a segmentation fault in multifd when rb->receivedmap is cleared > too early. > > After commit 5ef7e26bdb ("migration/multifd: solve zero page causing > multiple page faults"), multifd started using the rb->receivedmap > bitmap, which belongs to ram.c and is initialized and *freed* from the > ram SaveVMHandlers. > > Multifd threads are live until migration_incoming_state_destroy(), > which is called after qemu_loadvm_state_cleanup(), leading to a crash > when accessing rb->receivedmap. > > process_incoming_migration_co() ... > qemu_loadvm_state() multifd_nocomp_recv() > qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() > rb->receivedmap = NULL set_bit_atomic(..., > rb->receivedmap) > ... > migration_incoming_state_destroy() > multifd_recv_cleanup() > multifd_recv_terminate_threads(NULL) > > Move the loadvm cleanup into migration_incoming_state_destroy(), after > multifd_recv_cleanup() to ensure multifd thread have already exited > when rb->receivedmap is cleared. > > The have_listen_thread logic can now be removed because its purpose > was to delay cleanup until postcopy_ram_listen_thread() had finished. > > CC: qemu-stable@nongnu.org > Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple > page faults") > Signed-off-by: Fabiano Rosas <farosas@suse.de> > --- > migration/migration.c | 1 + > migration/migration.h | 1 - > migration/savevm.c | 9 --------- > 3 files changed, 1 insertion(+), 10 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 3dea06d577..b190a574b1 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) > struct MigrationIncomingState *mis = migration_incoming_get_current(); > > multifd_recv_cleanup(); > + qemu_loadvm_state_cleanup(); > > if (mis->to_src_file) { > /* Tell source that we are done */ > diff --git a/migration/migration.h b/migration/migration.h > index 38aa1402d5..20b0a5b66e 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -101,7 +101,6 @@ struct MigrationIncomingState { > /* Set this when we want the fault thread to quit */ > bool fault_thread_quit; > > - bool have_listen_thread; > QemuThread listen_thread; > > /* For the kernel to send us notifications */ > diff --git a/migration/savevm.c b/migration/savevm.c > index d0759694fd..532ee5e4b0 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void > *opaque) > * got a bad migration state). > */ > migration_incoming_state_destroy(); > - qemu_loadvm_state_cleanup(); > > rcu_unregister_thread(); > - mis->have_listen_thread = false; > postcopy_state_set(POSTCOPY_INCOMING_END); > > object_unref(OBJECT(migr)); > @@ -2130,7 +2128,6 @@ static int > loadvm_postcopy_handle_listen(MigrationIncomingState *mis) > return -1; > } > > - mis->have_listen_thread = true; > postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen", > postcopy_ram_listen_thread, > QEMU_THREAD_DETACHED); > trace_loadvm_postcopy_handle_listen("return"); > @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f) > > trace_qemu_loadvm_state_post_main(ret); > > - if (mis->have_listen_thread) { > - /* Listen thread still going, can't clean up yet */ > - return ret; > - } > - > if (ret == 0) { > ret = qemu_file_get_error(f); > } > @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f) > } > } > > - qemu_loadvm_state_cleanup(); > cpu_synchronize_all_post_init(); > Hi Fabiano I have a question. By removing qemu_loadvm_state_cleanup() here, the failure path that ends up with exit(EXIT_FAILURE) in process_incoming_migration_co() end up not calling the qemu_loadvm_state_cleanup(). I am not sure how this is important since there is exit, but the vfio, for example, will not call the VF reset. Another more general question is why destination Qemu has to terminate there if there was an error detected during live migration? Could just failing the migration and leave destination running be a more expected scenario? Thank you! return ret; > -- > 2.35.3 > > > -- Elena [-- Attachment #2: Type: text/html, Size: 5926 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-20 18:55 ` Elena Ufimtseva @ 2024-10-08 21:36 ` Fabiano Rosas 0 siblings, 0 replies; 12+ messages in thread From: Fabiano Rosas @ 2024-10-08 21:36 UTC (permalink / raw) To: Elena Ufimtseva; +Cc: qemu-devel, Peter Xu, Peter Maydell, qemu-stable Elena Ufimtseva <ufimtseva@gmail.com> writes: > On Fri, Sep 13, 2024 at 3:07 PM Fabiano Rosas <farosas@suse.de> wrote: > >> Fix a segmentation fault in multifd when rb->receivedmap is cleared >> too early. >> >> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing >> multiple page faults"), multifd started using the rb->receivedmap >> bitmap, which belongs to ram.c and is initialized and *freed* from the >> ram SaveVMHandlers. >> >> Multifd threads are live until migration_incoming_state_destroy(), >> which is called after qemu_loadvm_state_cleanup(), leading to a crash >> when accessing rb->receivedmap. >> >> process_incoming_migration_co() ... >> qemu_loadvm_state() multifd_nocomp_recv() >> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >> rb->receivedmap = NULL set_bit_atomic(..., >> rb->receivedmap) >> ... >> migration_incoming_state_destroy() >> multifd_recv_cleanup() >> multifd_recv_terminate_threads(NULL) >> >> Move the loadvm cleanup into migration_incoming_state_destroy(), after >> multifd_recv_cleanup() to ensure multifd thread have already exited >> when rb->receivedmap is cleared. >> >> The have_listen_thread logic can now be removed because its purpose >> was to delay cleanup until postcopy_ram_listen_thread() had finished. >> >> CC: qemu-stable@nongnu.org >> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple >> page faults") >> Signed-off-by: Fabiano Rosas <farosas@suse.de> >> --- >> migration/migration.c | 1 + >> migration/migration.h | 1 - >> migration/savevm.c | 9 --------- >> 3 files changed, 1 insertion(+), 10 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 3dea06d577..b190a574b1 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) >> struct MigrationIncomingState *mis = migration_incoming_get_current(); >> >> multifd_recv_cleanup(); >> + qemu_loadvm_state_cleanup(); >> >> if (mis->to_src_file) { >> /* Tell source that we are done */ >> diff --git a/migration/migration.h b/migration/migration.h >> index 38aa1402d5..20b0a5b66e 100644 >> --- a/migration/migration.h >> +++ b/migration/migration.h >> @@ -101,7 +101,6 @@ struct MigrationIncomingState { >> /* Set this when we want the fault thread to quit */ >> bool fault_thread_quit; >> >> - bool have_listen_thread; >> QemuThread listen_thread; >> >> /* For the kernel to send us notifications */ >> diff --git a/migration/savevm.c b/migration/savevm.c >> index d0759694fd..532ee5e4b0 100644 >> --- a/migration/savevm.c >> +++ b/migration/savevm.c >> @@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void >> *opaque) >> * got a bad migration state). >> */ >> migration_incoming_state_destroy(); >> - qemu_loadvm_state_cleanup(); >> >> rcu_unregister_thread(); >> - mis->have_listen_thread = false; >> postcopy_state_set(POSTCOPY_INCOMING_END); >> >> object_unref(OBJECT(migr)); >> @@ -2130,7 +2128,6 @@ static int >> loadvm_postcopy_handle_listen(MigrationIncomingState *mis) >> return -1; >> } >> >> - mis->have_listen_thread = true; >> postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen", >> postcopy_ram_listen_thread, >> QEMU_THREAD_DETACHED); >> trace_loadvm_postcopy_handle_listen("return"); >> @@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f) >> >> trace_qemu_loadvm_state_post_main(ret); >> >> - if (mis->have_listen_thread) { >> - /* Listen thread still going, can't clean up yet */ >> - return ret; >> - } >> - >> if (ret == 0) { >> ret = qemu_file_get_error(f); >> } >> @@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f) >> } >> } >> >> - qemu_loadvm_state_cleanup(); >> cpu_synchronize_all_post_init(); >> > > > Hi Fabiano Hi, sorry for the delay. > > I have a question. By removing qemu_loadvm_state_cleanup() here, the > failure path that ends up with exit(EXIT_FAILURE) > in process_incoming_migration_co() end up not calling the > qemu_loadvm_state_cleanup(). I am not sure how this is important since > there is exit, but the > vfio, for example, will not call the VF reset. In the fail label, migration_incoming_state_destroy() is called right before the exit block. > > Another more general question is why destination Qemu has to terminate > there if there was an error detected during live migration? > Could just failing the migration and leave destination running be a more > expected scenario? After failure to load, the destination VM is not really usable, so terminating it seems ok. This was changed with the addition of mis->exit_on_error in dbea1c89da ("qapi: introduce exit-on-error parameter for migrate-incoming"). We still exit on error by default, but allow the user to modify the behavior to allow management applications to collect the error or inspect the VM. > > Thank you! > > return ret; >> -- >> 2.35.3 >> >> >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 0/2] migration/multifd: Fix rb->receivedmap cleanup race @ 2024-09-17 18:58 Fabiano Rosas 2024-09-17 18:58 ` [PATCH 2/2] " Fabiano Rosas 0 siblings, 1 reply; 12+ messages in thread From: Fabiano Rosas @ 2024-09-17 18:58 UTC (permalink / raw) To: qemu-devel; +Cc: Peter Xu, Peter Maydell v2: Keep skipping the cpu_synchronize_all_post_init() call if the postcopy listen thread is live. Don't copy stable on the first patch. CI run: https://gitlab.com/farosas/qemu/-/pipelines/1457418838 ==== v1: https://lore.kernel.org/r/20240913220542.18305-1-farosas@suse.de This fixes the crash we've been seing recently in migration-test. The first patch is a cleanup to have only one place calling qemu_loadvm_state_cleanup() and the second patch reorders the cleanup calls to make multifd_recv_cleanup() run first and stop the recv threads. Fabiano Rosas (2): migration/savevm: Remove extra load cleanup calls migration/multifd: Fix rb->receivedmap cleanup race migration/migration.c | 1 + migration/savevm.c | 8 ++++---- 2 files changed, 5 insertions(+), 4 deletions(-) -- 2.35.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-17 18:58 [PATCH 0/2] " Fabiano Rosas @ 2024-09-17 18:58 ` Fabiano Rosas 2024-09-17 19:20 ` Peter Xu 0 siblings, 1 reply; 12+ messages in thread From: Fabiano Rosas @ 2024-09-17 18:58 UTC (permalink / raw) To: qemu-devel; +Cc: Peter Xu, Peter Maydell, qemu-stable Fix a segmentation fault in multifd when rb->receivedmap is cleared too early. After commit 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults"), multifd started using the rb->receivedmap bitmap, which belongs to ram.c and is initialized and *freed* from the ram SaveVMHandlers. Multifd threads are live until migration_incoming_state_destroy(), which is called after qemu_loadvm_state_cleanup(), leading to a crash when accessing rb->receivedmap. process_incoming_migration_co() ... qemu_loadvm_state() multifd_nocomp_recv() qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) ... migration_incoming_state_destroy() multifd_recv_cleanup() multifd_recv_terminate_threads(NULL) Move the loadvm cleanup into migration_incoming_state_destroy(), after multifd_recv_cleanup() to ensure multifd threads have already exited when rb->receivedmap is cleared. Adjust the postcopy listen thread comment to indicate that we still want to skip the cpu synchronization. CC: qemu-stable@nongnu.org Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults") Signed-off-by: Fabiano Rosas <farosas@suse.de> --- migration/migration.c | 1 + migration/savevm.c | 6 ++++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 3dea06d577..b190a574b1 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) struct MigrationIncomingState *mis = migration_incoming_get_current(); multifd_recv_cleanup(); + qemu_loadvm_state_cleanup(); if (mis->to_src_file) { /* Tell source that we are done */ diff --git a/migration/savevm.c b/migration/savevm.c index d0759694fd..7e1e27182a 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2979,7 +2979,10 @@ int qemu_loadvm_state(QEMUFile *f) trace_qemu_loadvm_state_post_main(ret); if (mis->have_listen_thread) { - /* Listen thread still going, can't clean up yet */ + /* + * Postcopy listen thread still going, don't synchronize the + * cpus yet. + */ return ret; } @@ -3022,7 +3025,6 @@ int qemu_loadvm_state(QEMUFile *f) } } - qemu_loadvm_state_cleanup(); cpu_synchronize_all_post_init(); return ret; -- 2.35.3 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-17 18:58 ` [PATCH 2/2] " Fabiano Rosas @ 2024-09-17 19:20 ` Peter Xu 2024-09-17 19:29 ` Fabiano Rosas 0 siblings, 1 reply; 12+ messages in thread From: Peter Xu @ 2024-09-17 19:20 UTC (permalink / raw) To: Fabiano Rosas; +Cc: qemu-devel, Peter Maydell, qemu-stable On Tue, Sep 17, 2024 at 03:58:02PM -0300, Fabiano Rosas wrote: > Fix a segmentation fault in multifd when rb->receivedmap is cleared > too early. > > After commit 5ef7e26bdb ("migration/multifd: solve zero page causing > multiple page faults"), multifd started using the rb->receivedmap > bitmap, which belongs to ram.c and is initialized and *freed* from the > ram SaveVMHandlers. > > Multifd threads are live until migration_incoming_state_destroy(), > which is called after qemu_loadvm_state_cleanup(), leading to a crash > when accessing rb->receivedmap. > > process_incoming_migration_co() ... > qemu_loadvm_state() multifd_nocomp_recv() > qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() > rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) > ... > migration_incoming_state_destroy() > multifd_recv_cleanup() > multifd_recv_terminate_threads(NULL) > > Move the loadvm cleanup into migration_incoming_state_destroy(), after > multifd_recv_cleanup() to ensure multifd threads have already exited > when rb->receivedmap is cleared. > > Adjust the postcopy listen thread comment to indicate that we still > want to skip the cpu synchronization. > > CC: qemu-stable@nongnu.org > Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults") > Signed-off-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Peter Xu <peterx@redhat.com> One trivial question below.. > --- > migration/migration.c | 1 + > migration/savevm.c | 6 ++++-- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 3dea06d577..b190a574b1 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) > struct MigrationIncomingState *mis = migration_incoming_get_current(); > > multifd_recv_cleanup(); Would you mind I add a comment squashed here when queue? /* * RAM state cleanup needs to happen after multifd cleanup, because * multifd threads can use some of its states (receivedmap). */ > + qemu_loadvm_state_cleanup(); > > if (mis->to_src_file) { > /* Tell source that we are done */ > diff --git a/migration/savevm.c b/migration/savevm.c > index d0759694fd..7e1e27182a 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2979,7 +2979,10 @@ int qemu_loadvm_state(QEMUFile *f) > trace_qemu_loadvm_state_post_main(ret); > > if (mis->have_listen_thread) { > - /* Listen thread still going, can't clean up yet */ > + /* > + * Postcopy listen thread still going, don't synchronize the > + * cpus yet. > + */ > return ret; > } > > @@ -3022,7 +3025,6 @@ int qemu_loadvm_state(QEMUFile *f) > } > } > > - qemu_loadvm_state_cleanup(); > cpu_synchronize_all_post_init(); > > return ret; > -- > 2.35.3 > -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race 2024-09-17 19:20 ` Peter Xu @ 2024-09-17 19:29 ` Fabiano Rosas 0 siblings, 0 replies; 12+ messages in thread From: Fabiano Rosas @ 2024-09-17 19:29 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel, Peter Maydell, qemu-stable Peter Xu <peterx@redhat.com> writes: > On Tue, Sep 17, 2024 at 03:58:02PM -0300, Fabiano Rosas wrote: >> Fix a segmentation fault in multifd when rb->receivedmap is cleared >> too early. >> >> After commit 5ef7e26bdb ("migration/multifd: solve zero page causing >> multiple page faults"), multifd started using the rb->receivedmap >> bitmap, which belongs to ram.c and is initialized and *freed* from the >> ram SaveVMHandlers. >> >> Multifd threads are live until migration_incoming_state_destroy(), >> which is called after qemu_loadvm_state_cleanup(), leading to a crash >> when accessing rb->receivedmap. >> >> process_incoming_migration_co() ... >> qemu_loadvm_state() multifd_nocomp_recv() >> qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset() >> rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap) >> ... >> migration_incoming_state_destroy() >> multifd_recv_cleanup() >> multifd_recv_terminate_threads(NULL) >> >> Move the loadvm cleanup into migration_incoming_state_destroy(), after >> multifd_recv_cleanup() to ensure multifd threads have already exited >> when rb->receivedmap is cleared. >> >> Adjust the postcopy listen thread comment to indicate that we still >> want to skip the cpu synchronization. >> >> CC: qemu-stable@nongnu.org >> Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults") >> Signed-off-by: Fabiano Rosas <farosas@suse.de> > > Reviewed-by: Peter Xu <peterx@redhat.com> > > One trivial question below.. > >> --- >> migration/migration.c | 1 + >> migration/savevm.c | 6 ++++-- >> 2 files changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 3dea06d577..b190a574b1 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void) >> struct MigrationIncomingState *mis = migration_incoming_get_current(); >> >> multifd_recv_cleanup(); > > Would you mind I add a comment squashed here when queue? > > /* > * RAM state cleanup needs to happen after multifd cleanup, because > * multifd threads can use some of its states (receivedmap). > */ Yeah, that's ok. > >> + qemu_loadvm_state_cleanup(); >> >> if (mis->to_src_file) { >> /* Tell source that we are done */ >> diff --git a/migration/savevm.c b/migration/savevm.c >> index d0759694fd..7e1e27182a 100644 >> --- a/migration/savevm.c >> +++ b/migration/savevm.c >> @@ -2979,7 +2979,10 @@ int qemu_loadvm_state(QEMUFile *f) >> trace_qemu_loadvm_state_post_main(ret); >> >> if (mis->have_listen_thread) { >> - /* Listen thread still going, can't clean up yet */ >> + /* >> + * Postcopy listen thread still going, don't synchronize the >> + * cpus yet. >> + */ >> return ret; >> } >> >> @@ -3022,7 +3025,6 @@ int qemu_loadvm_state(QEMUFile *f) >> } >> } >> >> - qemu_loadvm_state_cleanup(); >> cpu_synchronize_all_post_init(); >> >> return ret; >> -- >> 2.35.3 >> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-10-08 21:37 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-13 22:05 [PATCH 0/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas 2024-09-13 22:05 ` [PATCH 1/2] migration/savevm: Remove extra load cleanup calls Fabiano Rosas 2024-09-17 16:42 ` Peter Xu 2024-09-17 17:17 ` Fabiano Rosas 2024-09-13 22:05 ` [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race Fabiano Rosas 2024-09-17 17:02 ` Peter Xu 2024-09-17 17:41 ` Fabiano Rosas 2024-09-20 18:55 ` Elena Ufimtseva 2024-10-08 21:36 ` Fabiano Rosas -- strict thread matches above, loose matches on Subject: below -- 2024-09-17 18:58 [PATCH 0/2] " Fabiano Rosas 2024-09-17 18:58 ` [PATCH 2/2] " Fabiano Rosas 2024-09-17 19:20 ` Peter Xu 2024-09-17 19:29 ` Fabiano Rosas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).