* [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup @ 2023-06-21 8:18 z00619469 via 2023-06-21 13:24 ` Peter Xu 2023-06-21 14:22 ` Fabiano Rosas 0 siblings, 2 replies; 9+ messages in thread From: z00619469 via @ 2023-06-21 8:18 UTC (permalink / raw) To: qemu-devel Cc: chenyuhui5, xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33 From: c00454449 <chenyuhui5@huawei.com> There is a coredump while trying to destroy mutex when p->running is false but p->mutex is not unlock. Make sure all mutexes has been released before destroy them. Signed-off-by: c00454449 <chenyuhui5@huawei.com> --- migration/multifd.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index b7ad7002e0..7dcdb2d3a0 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; - if (p->running) { - qemu_thread_join(&p->thread); - } + qemu_thread_join(&p->thread); } for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; @@ -1040,8 +1038,8 @@ int multifd_load_cleanup(Error **errp) * however try to wakeup it without harm in cleanup phase. */ qemu_sem_post(&p->sem_sync); - qemu_thread_join(&p->thread); } + qemu_thread_join(&p->thread); } for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDRecvParams *p = &multifd_recv_state->params[i]; -- 2.21.0.windows.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-06-21 8:18 [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup z00619469 via @ 2023-06-21 13:24 ` Peter Xu 2023-06-21 14:22 ` Fabiano Rosas 1 sibling, 0 replies; 9+ messages in thread From: Peter Xu @ 2023-06-21 13:24 UTC (permalink / raw) To: z00619469 Cc: qemu-devel, chenyuhui5, xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, Juan Quintela On Wed, Jun 21, 2023 at 04:18:26PM +0800, z00619469 via wrote: > From: c00454449 <chenyuhui5@huawei.com> > > There is a coredump while trying to destroy mutex when > p->running is false but p->mutex is not unlock. > Make sure all mutexes has been released before destroy them. It'll be nice to add a backtrace of the coredump here, and also copy maintainer (Juan Quintela, copied now). > > Signed-off-by: c00454449 <chenyuhui5@huawei.com> > --- > migration/multifd.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index b7ad7002e0..7dcdb2d3a0 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDSendParams *p = &multifd_send_state->params[i]; > > - if (p->running) { > - qemu_thread_join(&p->thread); > - } > + qemu_thread_join(&p->thread); I'm not sure whether this will always work, e.g. when migration fails early before creating multifd threads? > } > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDSendParams *p = &multifd_send_state->params[i]; > @@ -1040,8 +1038,8 @@ int multifd_load_cleanup(Error **errp) > * however try to wakeup it without harm in cleanup phase. > */ > qemu_sem_post(&p->sem_sync); > - qemu_thread_join(&p->thread); > } > + qemu_thread_join(&p->thread); > } > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDRecvParams *p = &multifd_recv_state->params[i]; > -- > 2.21.0.windows.1 > > -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-06-21 8:18 [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup z00619469 via 2023-06-21 13:24 ` Peter Xu @ 2023-06-21 14:22 ` Fabiano Rosas 2023-06-26 13:16 ` chenyuhui (A) via 1 sibling, 1 reply; 9+ messages in thread From: Fabiano Rosas @ 2023-06-21 14:22 UTC (permalink / raw) To: z00619469, qemu-devel Cc: chenyuhui5, xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33 z00619469 via <qemu-devel@nongnu.org> writes: > From: c00454449 <chenyuhui5@huawei.com> > > There is a coredump while trying to destroy mutex when > p->running is false but p->mutex is not unlock. > Make sure all mutexes has been released before destroy them. > > Signed-off-by: c00454449 <chenyuhui5@huawei.com> > --- > migration/multifd.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index b7ad7002e0..7dcdb2d3a0 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDSendParams *p = &multifd_send_state->params[i]; > > - if (p->running) { The need for this flag is dubious IMO. Commit 10351fbad1 ("migration/multifd: Join all multifd threads in order to avoid leaks") already moved the other join outside of it. If we figure out another way to deal with the sem_sync lockup we could probably remove this altogether. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-06-21 14:22 ` Fabiano Rosas @ 2023-06-26 13:16 ` chenyuhui (A) via 2023-06-27 1:11 ` chenyuhui (A) via 0 siblings, 1 reply; 9+ messages in thread From: chenyuhui (A) via @ 2023-06-26 13:16 UTC (permalink / raw) To: Fabiano Rosas, peterx, qemu-devel Cc: xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, zhangjianguo18, caozhongwang1 [-- Attachment #1: Type: text/plain, Size: 4435 bytes --] On 2023/6/21 22:22, Fabiano Rosas wrote: > Jianguo Zhang via <qemu-devel@nongnu.org> writes: > >> From: Yuhui Chen <chenyuhui5@huawei.com> >> >> There is a coredump while trying to destroy mutex when >> p->running is false but p->mutex is not unlock. >> Make sure all mutexes has been released before destroy them. >> >> Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> >> --- >> migration/multifd.c | 6 ++---- >> 1 file changed, 2 insertions(+), 4 deletions(-) >> >> diff --git a/migration/multifd.c b/migration/multifd.c >> index b7ad7002e0..7dcdb2d3a0 100644 >> --- a/migration/multifd.c >> +++ b/migration/multifd.c >> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) >> for (i = 0; i < migrate_multifd_channels(); i++) { >> MultiFDSendParams *p = &multifd_send_state->params[i]; >> >> - if (p->running) { > > The need for this flag is dubious IMO. Commit 10351fbad1 > ("migration/multifd: Join all multifd threads in order to avoid leaks") > already moved the other join outside of it. If we figure out another way > to deal with the sem_sync lockup we could probably remove this > altogether. I've seen this commit 10351fbad1, and it's seems to have the same problem in function multifd_save_cleanup. So that may my patch only need to modify multifd_save_cleanup. ______ On 2023/6/21 21:24, Peter Xu wrote: > On Wed, Jun 21, 2023 at 04:18:26PM +0800, Jianguo Zhang via wrote: >> From: Yuhui Chen<chenyuhui5@huawei.com> >> >> There is a coredump while trying to destroy mutex when >> p->running is false but p->mutex is not unlock. >> Make sure all mutexes has been released before destroy them. > > It'll be nice to add a backtrace of the coredump here, and also copy > maintainer (Juan Quintela, copied now). > The attachment is coredump, and my code is base on https://github.com/qemu/qemu.git tag v6.2.0. How reproducible: 1、And sleep time to produce p->running is false but p->mutex is not unlock. From: Yuhui Chen <chenyuhui5@huawei.com> Date: Mon, 26 Jun 2023 14:24:35 +0800 Subject: [DEBUG][PATCH] And sleep time to produce p->running is false but p->mutex is not unlock. --- migration/multifd.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/migration/multifd.c b/migration/multifd.c index 7c9deb1921..09a7b0748a 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -538,6 +538,7 @@ void multifd_save_cleanup(void) for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; + sleep(2); if (p->running) { qemu_thread_join(&p->thread); } @@ -719,6 +720,7 @@ out: qemu_mutex_lock(&p->mutex); p->running = false; + sleep(20); qemu_mutex_unlock(&p->mutex); rcu_unregister_thread(); -- 2.21.0.windows.1 2、Start a vm then do migration with --parallel-connections >> >> Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> >> --- >> migration/multifd.c | 6 ++---- >> 1 file changed, 2 insertions(+), 4 deletions(-) >> >> diff --git a/migration/multifd.c b/migration/multifd.c >> index b7ad7002e0..7dcdb2d3a0 100644 >> --- a/migration/multifd.c >> +++ b/migration/multifd.c >> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) >> for (i = 0; i < migrate_multifd_channels(); i++) { >> MultiFDSendParams *p = &multifd_send_state->params[i]; >> >> - if (p->running) { >> - qemu_thread_join(&p->thread); >> - } >> + qemu_thread_join(&p->thread); > > I'm not sure whether this will always work, e.g. when migration fails early > before creating multifd threads? > Theoretically, all threads are created here (otherwise it would already have thrown an exception in the preceding function multifd_recv_terminate_threads), so they all need to join regardless the p->running state. >> } >> for (i = 0; i < migrate_multifd_channels(); i++) { >> MultiFDSendParams *p = &multifd_send_state->params[i]; >> @@ -1040,8 +1038,8 @@ int multifd_load_cleanup(Error **errp) >> * however try to wakeup it without harm in cleanup phase. >> */ >> qemu_sem_post(&p->sem_sync); >> - qemu_thread_join(&p->thread); >> } >> + qemu_thread_join(&p->thread); >> } >> for (i = 0; i < migrate_multifd_channels(); i++) { >> MultiFDRecvParams *p = &multifd_recv_state->params[i]; >> -- >> 2.21.0.windows.1 >> >> > [-- Attachment #2: core.zip --] [-- Type: application/x-zip-compressed, Size: 1868017 bytes --] ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-06-26 13:16 ` chenyuhui (A) via @ 2023-06-27 1:11 ` chenyuhui (A) via 2023-07-25 8:43 ` chenyuhui (A) via 0 siblings, 1 reply; 9+ messages in thread From: chenyuhui (A) via @ 2023-06-27 1:11 UTC (permalink / raw) To: Fabiano Rosas, peterx, qemu-devel Cc: xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, zhangjianguo18, caozhongwang1 On 2023/6/26 21:16, chenyuhui (A) wrote: > > On 2023/6/21 22:22, Fabiano Rosas wrote: >> Jianguo Zhang via <qemu-devel@nongnu.org> writes: >> >>> From: Yuhui Chen <chenyuhui5@huawei.com> >>> >>> There is a coredump while trying to destroy mutex when >>> p->running is false but p->mutex is not unlock. >>> Make sure all mutexes has been released before destroy them. >>> >>> Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> >>> --- >>> migration/multifd.c | 6 ++---- >>> 1 file changed, 2 insertions(+), 4 deletions(-) >>> >>> diff --git a/migration/multifd.c b/migration/multifd.c >>> index b7ad7002e0..7dcdb2d3a0 100644 >>> --- a/migration/multifd.c >>> +++ b/migration/multifd.c >>> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) >>> for (i = 0; i < migrate_multifd_channels(); i++) { >>> MultiFDSendParams *p = &multifd_send_state->params[i]; >>> >>> - if (p->running) { >> >> The need for this flag is dubious IMO. Commit 10351fbad1 >> ("migration/multifd: Join all multifd threads in order to avoid leaks") >> already moved the other join outside of it. If we figure out another way >> to deal with the sem_sync lockup we could probably remove this >> altogether. > > > I've seen this commit 10351fbad1, and it's seems to have the same > problem in function multifd_save_cleanup. > > So that may my patch only need to modify multifd_save_cleanup. > > ______ > > > On 2023/6/21 21:24, Peter Xu wrote: >> On Wed, Jun 21, 2023 at 04:18:26PM +0800, Jianguo Zhang via wrote: >>> From: Yuhui Chen<chenyuhui5@huawei.com> >>> >>> There is a coredump while trying to destroy mutex when >>> p->running is false but p->mutex is not unlock. >>> Make sure all mutexes has been released before destroy them. >> >> It'll be nice to add a backtrace of the coredump here, and also copy >> maintainer (Juan Quintela, copied now). >> > > The attachment is coredump, and my code is base on > https://github.com/qemu/qemu.git tag v6.2.0. > (gdb) bt #0 0x0000ffffabe3b2b8 in () at /usr/lib64/libc.so.6 #1 0x0000ffffabdf6d7c in raise () at /usr/lib64/libc.so.6 #2 0x0000ffffabde4d2c in abort () at /usr/lib64/libc.so.6 #3 0x0000aaaac67fcc10 in error_exit (err=<optimized out>, msg=msg@entry=0xaaaac6dc52b8 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:38 #4 0x0000aaaac67fce38 in qemu_mutex_destroy (mutex=mutex@entry=0xaaaafa1a4250) at ../util/qemu-thread-posix.c:71 #5 0x0000aaaac6055688 in multifd_save_cleanup () at ../migration/multifd.c:555 #6 0x0000aaaac6050198 in migrate_fd_cleanup (s=s@entry=0xaaaaf7518800) at ../migration/migration.c:1808 #7 0x0000aaaac6050384 in migrate_fd_cleanup_bh (opaque=0xaaaaf7518800) at ../migration/migration.c:1850 #8 0x0000aaaac680d790 in aio_bh_call (bh=0xffffa0004c40) at ../util/async.c:141 #9 aio_bh_poll (ctx=ctx@entry=0xaaaaf73285a0) at ../util/async.c:169 #10 0x0000aaaac67f9e18 in aio_dispatch (ctx=0xaaaaf73285a0) at ../util/aio-posix.c:381 #11 0x0000aaaac680d414 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:311 #12 0x0000ffffac44cf88 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 #13 0x0000aaaac6819214 in glib_pollfds_poll () at ../util/main-loop.c:232 #14 os_host_main_loop_wait (timeout=735000000) at ../util/main-loop.c:255 #15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:531 #16 0x0000aaaac65005cc in qemu_main_loop () at ../softmmu/runstate.c:726 #17 0x0000aaaac5fe2030 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 (gdb) q > How reproducible: > 1、And sleep time to produce p->running is false but p->mutex is > not unlock. > > From: Yuhui Chen <chenyuhui5@huawei.com> > Date: Mon, 26 Jun 2023 14:24:35 +0800 > Subject: [DEBUG][PATCH] And sleep time to produce p->running is false but p->mutex is > not unlock. > > --- > migration/multifd.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 7c9deb1921..09a7b0748a 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -538,6 +538,7 @@ void multifd_save_cleanup(void) > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDSendParams *p = &multifd_send_state->params[i]; > > + sleep(2); > if (p->running) { > qemu_thread_join(&p->thread); > } > @@ -719,6 +720,7 @@ out: > > qemu_mutex_lock(&p->mutex); > p->running = false; > + sleep(20); > qemu_mutex_unlock(&p->mutex); > > rcu_unregister_thread(); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-06-27 1:11 ` chenyuhui (A) via @ 2023-07-25 8:43 ` chenyuhui (A) via 2023-07-25 16:53 ` Peter Xu 0 siblings, 1 reply; 9+ messages in thread From: chenyuhui (A) via @ 2023-07-25 8:43 UTC (permalink / raw) To: Fabiano Rosas, peterx, qemu-devel Cc: xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, caozhongwang1, caiqingmu @Peter Xu @Fabiano Rosas Kindly ping on this. On 2023/6/27 9:11, chenyuhui (A) wrote: > > On 2023/6/26 21:16, chenyuhui (A) wrote: >> >> On 2023/6/21 22:22, Fabiano Rosas wrote: >>> Jianguo Zhang via <qemu-devel@nongnu.org> writes: >>> >>>> From: Yuhui Chen <chenyuhui5@huawei.com> >>>> >>>> There is a coredump while trying to destroy mutex when >>>> p->running is false but p->mutex is not unlock. >>>> Make sure all mutexes has been released before destroy them. >>>> >>>> Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> >>>> --- >>>> migration/multifd.c | 6 ++---- >>>> 1 file changed, 2 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/migration/multifd.c b/migration/multifd.c >>>> index b7ad7002e0..7dcdb2d3a0 100644 >>>> --- a/migration/multifd.c >>>> +++ b/migration/multifd.c >>>> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) >>>> for (i = 0; i < migrate_multifd_channels(); i++) { >>>> MultiFDSendParams *p = &multifd_send_state->params[i]; >>>> >>>> - if (p->running) { >>> >>> The need for this flag is dubious IMO. Commit 10351fbad1 >>> ("migration/multifd: Join all multifd threads in order to avoid leaks") >>> already moved the other join outside of it. If we figure out another way >>> to deal with the sem_sync lockup we could probably remove this >>> altogether. >> >> >> I've seen this commit 10351fbad1, and it's seems to have the same >> problem in function multifd_save_cleanup. >> >> So that may my patch only need to modify multifd_save_cleanup. >> >> __________________________________________________________________ >> >> >> On 2023/6/21 21:24, Peter Xu wrote: >>> On Wed, Jun 21, 2023 at 04:18:26PM +0800, Jianguo Zhang via wrote: >>>> From: Yuhui Chen<chenyuhui5@huawei.com> >>>> >>>> There is a coredump while trying to destroy mutex when >>>> p->running is false but p->mutex is not unlock. >>>> Make sure all mutexes has been released before destroy them. >>> >>> It'll be nice to add a backtrace of the coredump here, and also copy >>> maintainer (Juan Quintela, copied now). >>> >> >> The following is coredump, and my code is base on >> https://github.com/qemu/qemu.git tag v6.2.0. >> > (gdb) bt > #0 0x0000ffffabe3b2b8 in () at /usr/lib64/libc.so.6 > #1 0x0000ffffabdf6d7c in raise () at /usr/lib64/libc.so.6 > #2 0x0000ffffabde4d2c in abort () at /usr/lib64/libc.so.6 > #3 0x0000aaaac67fcc10 in error_exit (err=<optimized out>, msg=msg@entry=0xaaaac6dc52b8 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:38 > #4 0x0000aaaac67fce38 in qemu_mutex_destroy (mutex=mutex@entry=0xaaaafa1a4250) at ../util/qemu-thread-posix.c:71 > #5 0x0000aaaac6055688 in multifd_save_cleanup () at ../migration/multifd.c:555 > #6 0x0000aaaac6050198 in migrate_fd_cleanup (s=s@entry=0xaaaaf7518800) at ../migration/migration.c:1808 > #7 0x0000aaaac6050384 in migrate_fd_cleanup_bh (opaque=0xaaaaf7518800) at ../migration/migration.c:1850 > #8 0x0000aaaac680d790 in aio_bh_call (bh=0xffffa0004c40) at ../util/async.c:141 > #9 aio_bh_poll (ctx=ctx@entry=0xaaaaf73285a0) at ../util/async.c:169 > #10 0x0000aaaac67f9e18 in aio_dispatch (ctx=0xaaaaf73285a0) at ../util/aio-posix.c:381 > #11 0x0000aaaac680d414 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:311 > #12 0x0000ffffac44cf88 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 > #13 0x0000aaaac6819214 in glib_pollfds_poll () at ../util/main-loop.c:232 > #14 os_host_main_loop_wait (timeout=735000000) at ../util/main-loop.c:255 > #15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:531 > #16 0x0000aaaac65005cc in qemu_main_loop () at ../softmmu/runstate.c:726 > #17 0x0000aaaac5fe2030 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 > (gdb) q > >> How reproducible: >> 1、And sleep time to produce p->running is false but p->mutex is >> not unlock.(apply following patch) >> 2、Do migration with --parallel-connections. >>>> From: Yuhui Chen <chenyuhui5@huawei.com> >> Date: Mon, 26 Jun 2023 14:24:35 +0800 >> Subject: [DEBUG][PATCH] And sleep time to produce p->running is false but p->mutex is >> not unlock. >> >> --- >> migration/multifd.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/migration/multifd.c b/migration/multifd.c >> index 7c9deb1921..09a7b0748a 100644 >> --- a/migration/multifd.c >> +++ b/migration/multifd.c >> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void) >> for (i = 0; i < migrate_multifd_channels(); i++) { >> MultiFDSendParams *p = &multifd_send_state->params[i]; >> >> + sleep(2); >> if (p->running) { >> qemu_thread_join(&p->thread); >> } >> @@ -719,6 +720,7 @@ out: >> >> qemu_mutex_lock(&p->mutex); >> p->running = false; >> + sleep(20); >> qemu_mutex_unlock(&p->mutex); >> >> rcu_unregister_thread(); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-07-25 8:43 ` chenyuhui (A) via @ 2023-07-25 16:53 ` Peter Xu 2023-07-27 1:46 ` chenyuhui (A) via 0 siblings, 1 reply; 9+ messages in thread From: Peter Xu @ 2023-07-25 16:53 UTC (permalink / raw) To: chenyuhui (A) Cc: Fabiano Rosas, qemu-devel, xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, caozhongwang1, caiqingmu, Juan Quintela On Tue, Jul 25, 2023 at 04:43:28PM +0800, chenyuhui (A) wrote: > @Peter Xu @Fabiano Rosas > Kindly ping on this. Ah I see what's missing - please copy maintainer (Juan) for any migration patches, especially multifd ones.. I'm doing that for this one, but I'd suggest you repost with a whole patch and information put into commit msg. Thanks. > > On 2023/6/27 9:11, chenyuhui (A) wrote: > > > > On 2023/6/26 21:16, chenyuhui (A) wrote: > >> > >> On 2023/6/21 22:22, Fabiano Rosas wrote: > >>> Jianguo Zhang via <qemu-devel@nongnu.org> writes: > >>> > >>>> From: Yuhui Chen <chenyuhui5@huawei.com> > >>>> > >>>> There is a coredump while trying to destroy mutex when > >>>> p->running is false but p->mutex is not unlock. > >>>> Make sure all mutexes has been released before destroy them. > >>>> > >>>> Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> > >>>> --- > >>>> migration/multifd.c | 6 ++---- > >>>> 1 file changed, 2 insertions(+), 4 deletions(-) > >>>> > >>>> diff --git a/migration/multifd.c b/migration/multifd.c > >>>> index b7ad7002e0..7dcdb2d3a0 100644 > >>>> --- a/migration/multifd.c > >>>> +++ b/migration/multifd.c > >>>> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) > >>>> for (i = 0; i < migrate_multifd_channels(); i++) { > >>>> MultiFDSendParams *p = &multifd_send_state->params[i]; > >>>> > >>>> - if (p->running) { > >>> > >>> The need for this flag is dubious IMO. Commit 10351fbad1 > >>> ("migration/multifd: Join all multifd threads in order to avoid leaks") > >>> already moved the other join outside of it. If we figure out another way > >>> to deal with the sem_sync lockup we could probably remove this > >>> altogether. > >> > >> > >> I've seen this commit 10351fbad1, and it's seems to have the same > >> problem in function multifd_save_cleanup. > >> > >> So that may my patch only need to modify multifd_save_cleanup. > >> > >> __________________________________________________________________ > >> > >> > >> On 2023/6/21 21:24, Peter Xu wrote: > >>> On Wed, Jun 21, 2023 at 04:18:26PM +0800, Jianguo Zhang via wrote: > >>>> From: Yuhui Chen<chenyuhui5@huawei.com> > >>>> > >>>> There is a coredump while trying to destroy mutex when > >>>> p->running is false but p->mutex is not unlock. > >>>> Make sure all mutexes has been released before destroy them. > >>> > >>> It'll be nice to add a backtrace of the coredump here, and also copy > >>> maintainer (Juan Quintela, copied now). > >>> > >> > >> The following is coredump, and my code is base on > >> https://github.com/qemu/qemu.git tag v6.2.0. > >> > > (gdb) bt > > #0 0x0000ffffabe3b2b8 in () at /usr/lib64/libc.so.6 > > #1 0x0000ffffabdf6d7c in raise () at /usr/lib64/libc.so.6 > > #2 0x0000ffffabde4d2c in abort () at /usr/lib64/libc.so.6 > > #3 0x0000aaaac67fcc10 in error_exit (err=<optimized out>, msg=msg@entry=0xaaaac6dc52b8 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:38 > > #4 0x0000aaaac67fce38 in qemu_mutex_destroy (mutex=mutex@entry=0xaaaafa1a4250) at ../util/qemu-thread-posix.c:71 > > #5 0x0000aaaac6055688 in multifd_save_cleanup () at ../migration/multifd.c:555 > > #6 0x0000aaaac6050198 in migrate_fd_cleanup (s=s@entry=0xaaaaf7518800) at ../migration/migration.c:1808 > > #7 0x0000aaaac6050384 in migrate_fd_cleanup_bh (opaque=0xaaaaf7518800) at ../migration/migration.c:1850 > > #8 0x0000aaaac680d790 in aio_bh_call (bh=0xffffa0004c40) at ../util/async.c:141 > > #9 aio_bh_poll (ctx=ctx@entry=0xaaaaf73285a0) at ../util/async.c:169 > > #10 0x0000aaaac67f9e18 in aio_dispatch (ctx=0xaaaaf73285a0) at ../util/aio-posix.c:381 > > #11 0x0000aaaac680d414 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:311 > > #12 0x0000ffffac44cf88 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 > > #13 0x0000aaaac6819214 in glib_pollfds_poll () at ../util/main-loop.c:232 > > #14 os_host_main_loop_wait (timeout=735000000) at ../util/main-loop.c:255 > > #15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:531 > > #16 0x0000aaaac65005cc in qemu_main_loop () at ../softmmu/runstate.c:726 > > #17 0x0000aaaac5fe2030 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 > > (gdb) q > > > >> How reproducible: > >> 1、And sleep time to produce p->running is false but p->mutex is > >> not unlock.(apply following patch) > >> 2、Do migration with --parallel-connections. > >>>> From: Yuhui Chen <chenyuhui5@huawei.com> > >> Date: Mon, 26 Jun 2023 14:24:35 +0800 > >> Subject: [DEBUG][PATCH] And sleep time to produce p->running is false but p->mutex is > >> not unlock. > >> > >> --- > >> migration/multifd.c | 2 ++ > >> 1 file changed, 2 insertions(+) > >> > >> diff --git a/migration/multifd.c b/migration/multifd.c > >> index 7c9deb1921..09a7b0748a 100644 > >> --- a/migration/multifd.c > >> +++ b/migration/multifd.c > >> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void) > >> for (i = 0; i < migrate_multifd_channels(); i++) { > >> MultiFDSendParams *p = &multifd_send_state->params[i]; > >> > >> + sleep(2); > >> if (p->running) { > >> qemu_thread_join(&p->thread); > >> } > >> @@ -719,6 +720,7 @@ out: > >> > >> qemu_mutex_lock(&p->mutex); > >> p->running = false; > >> + sleep(20); > >> qemu_mutex_unlock(&p->mutex); > >> > >> rcu_unregister_thread(); > -- Peter Xu ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-07-25 16:53 ` Peter Xu @ 2023-07-27 1:46 ` chenyuhui (A) via 2023-08-14 14:55 ` Fabiano Rosas 0 siblings, 1 reply; 9+ messages in thread From: chenyuhui (A) via @ 2023-07-27 1:46 UTC (permalink / raw) To: Juan Quintela, Fabiano Rosas, qemu-devel Cc: xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, caozhongwang1, caiqingmu, Peter Xu On 2023/7/26 0:53, Peter Xu wrote: > On Tue, Jul 25, 2023 at 04:43:28PM +0800, chenyuhui (A) wrote: >> @Peter Xu @Fabiano Rosas >> Kindly ping on this. > > Ah I see what's missing - please copy maintainer (Juan) for any migration > patches, especially multifd ones.. I'm doing that for this one, but I'd > suggest you repost with a whole patch and information put into commit msg. > > Thanks. > Hi, Juan This is a patch for migration,please take a look. From: Yuhui Chen <chenyuhui5@huawei.com> Date: Tue, 25 Jul 2023 10:45:48 +0800 Subject: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup There is a coredump while trying to destroy mutex when p->running is false but p->mutex is not unlock. Make sure all mutexes has been released before destroy them. Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> --- migration/multifd.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 3387d8277f..3a085bf3ec 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -521,9 +521,7 @@ void multifd_save_cleanup(void) for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; - if (p->running) { - qemu_thread_join(&p->thread); - } + qemu_thread_join(&p->thread); } for (i = 0; i < migrate_multifd_channels(); i++) { MultiFDSendParams *p = &multifd_send_state->params[i]; -- 2.21.0.windows.1 >> >> On 2023/6/27 9:11, chenyuhui (A) wrote: >>> >>> On 2023/6/26 21:16, chenyuhui (A) wrote: >>>> >>>> On 2023/6/21 22:22, Fabiano Rosas wrote: >>>>> Jianguo Zhang via <qemu-devel@nongnu.org> writes: >>>>> >>>>>> From: Yuhui Chen <chenyuhui5@huawei.com> >>>>>> >>>>>> There is a coredump while trying to destroy mutex when >>>>>> p->running is false but p->mutex is not unlock. >>>>>> Make sure all mutexes has been released before destroy them. >>>>>> >>>>>> Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> >>>>>> --- >>>>>> migration/multifd.c | 6 ++---- >>>>>> 1 file changed, 2 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/migration/multifd.c b/migration/multifd.c >>>>>> index b7ad7002e0..7dcdb2d3a0 100644 >>>>>> --- a/migration/multifd.c >>>>>> +++ b/migration/multifd.c >>>>>> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void) >>>>>> for (i = 0; i < migrate_multifd_channels(); i++) { >>>>>> MultiFDSendParams *p = &multifd_send_state->params[i]; >>>>>> >>>>>> - if (p->running) { >>>>> >>>>> The need for this flag is dubious IMO. Commit 10351fbad1 >>>>> ("migration/multifd: Join all multifd threads in order to avoid leaks") >>>>> already moved the other join outside of it. If we figure out another way >>>>> to deal with the sem_sync lockup we could probably remove this >>>>> altogether. >>>> >>>> >>>> I've seen this commit 10351fbad1, and it's seems to have the same >>>> problem in function multifd_save_cleanup. >>>> >>>> So that may my patch only need to modify multifd_save_cleanup. >>>> >>>> __________________________________________________________________ >>>> >>>> >>>> On 2023/6/21 21:24, Peter Xu wrote: >>>>> On Wed, Jun 21, 2023 at 04:18:26PM +0800, Jianguo Zhang via wrote: >>>>>> From: Yuhui Chen<chenyuhui5@huawei.com> >>>>>> >>>>>> There is a coredump while trying to destroy mutex when >>>>>> p->running is false but p->mutex is not unlock. >>>>>> Make sure all mutexes has been released before destroy them. >>>>> >>>>> It'll be nice to add a backtrace of the coredump here, and also copy >>>>> maintainer (Juan Quintela, copied now). >>>>> >>>> >>>> The following is coredump, and my code is base on >>>> https://github.com/qemu/qemu.git tag v6.2.0. >>>> >>> (gdb) bt >>> #0 0x0000ffffabe3b2b8 in () at /usr/lib64/libc.so.6 >>> #1 0x0000ffffabdf6d7c in raise () at /usr/lib64/libc.so.6 >>> #2 0x0000ffffabde4d2c in abort () at /usr/lib64/libc.so.6 >>> #3 0x0000aaaac67fcc10 in error_exit (err=<optimized out>, msg=msg@entry=0xaaaac6dc52b8 <__func__.33> "qemu_mutex_destroy") at ../util/qemu-thread-posix.c:38 >>> #4 0x0000aaaac67fce38 in qemu_mutex_destroy (mutex=mutex@entry=0xaaaafa1a4250) at ../util/qemu-thread-posix.c:71 >>> #5 0x0000aaaac6055688 in multifd_save_cleanup () at ../migration/multifd.c:555 >>> #6 0x0000aaaac6050198 in migrate_fd_cleanup (s=s@entry=0xaaaaf7518800) at ../migration/migration.c:1808 >>> #7 0x0000aaaac6050384 in migrate_fd_cleanup_bh (opaque=0xaaaaf7518800) at ../migration/migration.c:1850 >>> #8 0x0000aaaac680d790 in aio_bh_call (bh=0xffffa0004c40) at ../util/async.c:141 >>> #9 aio_bh_poll (ctx=ctx@entry=0xaaaaf73285a0) at ../util/async.c:169 >>> #10 0x0000aaaac67f9e18 in aio_dispatch (ctx=0xaaaaf73285a0) at ../util/aio-posix.c:381 >>> #11 0x0000aaaac680d414 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:311 >>> #12 0x0000ffffac44cf88 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0 >>> #13 0x0000aaaac6819214 in glib_pollfds_poll () at ../util/main-loop.c:232 >>> #14 os_host_main_loop_wait (timeout=735000000) at ../util/main-loop.c:255 >>> #15 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:531 >>> #16 0x0000aaaac65005cc in qemu_main_loop () at ../softmmu/runstate.c:726 >>> #17 0x0000aaaac5fe2030 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50 >>> (gdb) q >>> >>>> How reproducible: >>>> 1、And sleep time to produce p->running is false but p->mutex is >>>> not unlock.(apply following patch) >>>> 2、Do migration with --parallel-connections. >>>> >>>>>> From: Yuhui Chen <chenyuhui5@huawei.com> >>>> Date: Mon, 26 Jun 2023 14:24:35 +0800 >>>> Subject: [DEBUG][PATCH] And sleep time to produce p->running is false but p->mutex is >>>> not unlock. >>>> >>>> --- >>>> migration/multifd.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/migration/multifd.c b/migration/multifd.c >>>> index 7c9deb1921..09a7b0748a 100644 >>>> --- a/migration/multifd.c >>>> +++ b/migration/multifd.c >>>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void) >>>> for (i = 0; i < migrate_multifd_channels(); i++) { >>>> MultiFDSendParams *p = &multifd_send_state->params[i]; >>>> >>>> + sleep(2); >>>> if (p->running) { >>>> qemu_thread_join(&p->thread); >>>> } >>>> @@ -719,6 +720,7 @@ out: >>>> >>>> qemu_mutex_lock(&p->mutex); >>>> p->running = false; >>>> + sleep(20); >>>> qemu_mutex_unlock(&p->mutex); >>>> >>>> rcu_unregister_thread(); >> > ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup 2023-07-27 1:46 ` chenyuhui (A) via @ 2023-08-14 14:55 ` Fabiano Rosas 0 siblings, 0 replies; 9+ messages in thread From: Fabiano Rosas @ 2023-08-14 14:55 UTC (permalink / raw) To: chenyuhui (A), Juan Quintela, qemu-devel Cc: xuyinghua3, liheng.liheng, renxuming, pengyi.pengyi, yubihong, zhengchuan, huhao33, caozhongwang1, caiqingmu, Peter Xu "chenyuhui (A)" via <qemu-devel@nongnu.org> writes: > On 2023/7/26 0:53, Peter Xu wrote: >> On Tue, Jul 25, 2023 at 04:43:28PM +0800, chenyuhui (A) wrote: >>> @Peter Xu @Fabiano Rosas >>> Kindly ping on this. >> >> Ah I see what's missing - please copy maintainer (Juan) for any migration >> patches, especially multifd ones.. I'm doing that for this one, but I'd >> suggest you repost with a whole patch and information put into commit msg. >> >> Thanks. >> > Hi, Juan > This is a patch for migration,please take a look. > > From: Yuhui Chen <chenyuhui5@huawei.com> > Date: Tue, 25 Jul 2023 10:45:48 +0800 > Subject: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup > > There is a coredump while trying to destroy mutex when > p->running is false but p->mutex is not unlock. > Make sure all mutexes has been released before destroy them. > > Signed-off-by: Yuhui Chen <chenyuhui5@huawei.com> > --- > migration/multifd.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 3387d8277f..3a085bf3ec 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -521,9 +521,7 @@ void multifd_save_cleanup(void) > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDSendParams *p = &multifd_send_state->params[i]; > > - if (p->running) { > - qemu_thread_join(&p->thread); > - } > + qemu_thread_join(&p->thread); > } > for (i = 0; i < migrate_multifd_channels(); i++) { > MultiFDSendParams *p = &multifd_send_state->params[i]; Hi, I took another look at this and noticed that we're not even holding the lock in the other threads when accessing p->running. Also, the save_cleanup() function can be called before qemu_thread_create(), which means p->thread.thread would have a bogus value and qemu_thread_join() would abort the QEMU process. What we need here is to stop setting p->running from inside the thread altogether. The flag is only effectively protecting the qemu_thread_join() from being called before the thread's creation. We can post to sem_sync regardless of the thread state because we only destroy it a few lines down in the same function. And there's already p->quit which is being accessed properly and sends the thread out of the loop. So I suggest this pattern: qemu_sem_post(&p->sem_sync); if (p->running) { qemu_thread_join(&p->thread); p->running = false; } It would also be nice to move the p->running = true closer to the qemu_thread_create() call at multifd_channel_connect(). ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-08-14 14:56 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-06-21 8:18 [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup z00619469 via 2023-06-21 13:24 ` Peter Xu 2023-06-21 14:22 ` Fabiano Rosas 2023-06-26 13:16 ` chenyuhui (A) via 2023-06-27 1:11 ` chenyuhui (A) via 2023-07-25 8:43 ` chenyuhui (A) via 2023-07-25 16:53 ` Peter Xu 2023-07-27 1:46 ` chenyuhui (A) via 2023-08-14 14:55 ` Fabiano Rosas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).