* test-blockjob: intermittent CI failures in msys2-64bit job @ 2023-03-03 18:36 Peter Maydell 2023-03-06 11:16 ` Peter Maydell 0 siblings, 1 reply; 12+ messages in thread From: Peter Maydell @ 2023-03-03 18:36 UTC (permalink / raw) To: QEMU Developers, Qemu-block I've noticed that test-blockjob seems to fail intermittently on the msys2-64bit job: https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 Sample output: | 53/89 ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 I haven't looked at whether these jobs all failed on the same subtest or whether there is better info lurking in some log file in the pipeline's artefacts. thanks -- PMM ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-03-03 18:36 test-blockjob: intermittent CI failures in msys2-64bit job Peter Maydell @ 2023-03-06 11:16 ` Peter Maydell 2023-03-17 10:17 ` Peter Maydell 0 siblings, 1 reply; 12+ messages in thread From: Peter Maydell @ 2023-03-06 11:16 UTC (permalink / raw) To: QEMU Developers, Qemu-block On Fri, 3 Mar 2023 at 18:36, Peter Maydell <peter.maydell@linaro.org> wrote: > > I've noticed that test-blockjob seems to fail intermittently > on the msys2-64bit job: > > https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 > https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 > https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 > > Sample output: > | 53/89 ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: > assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR > 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 > > I haven't looked at whether these jobs all failed on the > same subtest or whether there is better info lurking in > some log file in the pipeline's artefacts. Here's an intermittent failure from my macos x86 machine: 172/621 qemu:unit / test-blockjob ERROR 0.26s killed by signal 6 SIGABRT 11:03:46 MALLOC_PERTURB_=176 G_TEST_SRCDIR=/Users/pm215/src/qemu-for-merges/tests/unit G_TEST_BUILDDIR=/Users/pm215/src/qemu-for-merges/build/all/tests/unit /Users/pm215/src/qemu-for-merges/build/all/tests/unit/test-blockjob --tap -k ----------------------------------- output ----------------------------------- stdout: # random seed: R02S8c79d6e1c01ce0b25475b2210a253242 1..9 # Start of blockjob tests ok 1 /blockjob/ids stderr: Assertion failed: (job->status == JOB_STATUS_STANDBY), function test_complete_in_standby, file ../../tests/unit/test-blockjob.c, line 499. TAP parsing error: Too few tests run (expected 9, got 1) (test program exited with status code -6) ------------------------------------------------------------------------------ thanks -- PMM ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-03-06 11:16 ` Peter Maydell @ 2023-03-17 10:17 ` Peter Maydell 2023-03-17 12:35 ` Thomas Huth 0 siblings, 1 reply; 12+ messages in thread From: Peter Maydell @ 2023-03-17 10:17 UTC (permalink / raw) To: QEMU Developers, Qemu-block On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> wrote: > > On Fri, 3 Mar 2023 at 18:36, Peter Maydell <peter.maydell@linaro.org> wrote: > > > > I've noticed that test-blockjob seems to fail intermittently > > on the msys2-64bit job: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 > > https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 > > https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 > > > > Sample output: > > | 53/89 ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: > > assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR > > 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 > Here's an intermittent failure from my macos x86 machine: > > 172/621 qemu:unit / test-blockjob > ERROR 0.26s killed by signal 6 SIGABRT And an intermittent on the freebsd 13 CI job: https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>> MALLOC_PERTURB_=197 G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k ▶ 178/650 /blockjob/ids OK 178/650 qemu:unit / test-blockjob ERROR 0.31s killed by signal 6 SIGABRT ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― stderr: Assertion failed: (job->status == JOB_STATUS_STANDBY), function test_complete_in_standby, file ../tests/unit/test-blockjob.c, line 499. TAP parsing error: Too few tests run (expected 9, got 1) (test program exited with status code -6) ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― Anybody in the block team looking at these, or shall we just disable this test entirely ? thanks -- PMM ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-03-17 10:17 ` Peter Maydell @ 2023-03-17 12:35 ` Thomas Huth 2023-04-21 10:13 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 12+ messages in thread From: Thomas Huth @ 2023-03-17 12:35 UTC (permalink / raw) To: Peter Maydell, John Snow, Alberto Garcia, Vladimir Sementsov-Ogievskiy Cc: QEMU Developers, Qemu-block On 17/03/2023 11.17, Peter Maydell wrote: > On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> wrote: >> >> On Fri, 3 Mar 2023 at 18:36, Peter Maydell <peter.maydell@linaro.org> wrote: >>> >>> I've noticed that test-blockjob seems to fail intermittently >>> on the msys2-64bit job: >>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>> >>> Sample output: >>> | 53/89 ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 > >> Here's an intermittent failure from my macos x86 machine: >> >> 172/621 qemu:unit / test-blockjob >> ERROR 0.26s killed by signal 6 SIGABRT > > And an intermittent on the freebsd 13 CI job: > https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 > >>>> MALLOC_PERTURB_=197 G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k > ▶ 178/650 /blockjob/ids > OK > 178/650 qemu:unit / test-blockjob > ERROR 0.31s killed by signal 6 SIGABRT > ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― > stderr: > Assertion failed: (job->status == JOB_STATUS_STANDBY), function > test_complete_in_standby, file ../tests/unit/test-blockjob.c, line > 499. > > > TAP parsing error: Too few tests run (expected 9, got 1) > (test program exited with status code -6) > ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― > > Anybody in the block team looking at these, or shall we just > disable this test entirely ? I ran into this issue today, too: https://gitlab.com/thuth/qemu/-/jobs/3954367101 ... if nobody is interested in fixing this test, I think we should disable it... Thomas ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-03-17 12:35 ` Thomas Huth @ 2023-04-21 10:13 ` Vladimir Sementsov-Ogievskiy 2023-04-24 13:36 ` Emanuele Giuseppe Esposito 0 siblings, 1 reply; 12+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2023-04-21 10:13 UTC (permalink / raw) To: Thomas Huth, Peter Maydell, John Snow, Alberto Garcia Cc: QEMU Developers, Qemu-block, Emanuele Giuseppe Esposito, Kevin Wolf On 17.03.23 15:35, Thomas Huth wrote: > On 17/03/2023 11.17, Peter Maydell wrote: >> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> wrote: >>> >>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell <peter.maydell@linaro.org> wrote: >>>> >>>> I've noticed that test-blockjob seems to fail intermittently >>>> on the msys2-64bit job: >>>> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>> >>>> Sample output: >>>> | 53/89 ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >> >>> Here's an intermittent failure from my macos x86 machine: >>> >>> 172/621 qemu:unit / test-blockjob >>> ERROR 0.26s killed by signal 6 SIGABRT >> >> And an intermittent on the freebsd 13 CI job: >> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >> >>>>> MALLOC_PERTURB_=197 G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >> ▶ 178/650 /blockjob/ids >> OK >> 178/650 qemu:unit / test-blockjob >> ERROR 0.31s killed by signal 6 SIGABRT >> ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― >> stderr: >> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >> 499. >> >> >> TAP parsing error: Too few tests run (expected 9, got 1) >> (test program exited with status code -6) >> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >> >> Anybody in the block team looking at these, or shall we just >> disable this test entirely ? > > I ran into this issue today, too: > > https://gitlab.com/thuth/qemu/-/jobs/3954367101 > > ... if nobody is interested in fixing this test, I think we should disable it... > > Thomas > I'm looking at this now, and seems that it's broken since 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove Aiocontext locks", as it stops critical section by new aio_context_release() call exactly after bdrv_drain_all_and(), so it's not a surprise that job may start at that moment and the following assertion fires. Emanuele could please look at it? -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-21 10:13 ` Vladimir Sementsov-Ogievskiy @ 2023-04-24 13:36 ` Emanuele Giuseppe Esposito 2023-04-24 18:32 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 12+ messages in thread From: Emanuele Giuseppe Esposito @ 2023-04-24 13:36 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: > On 17.03.23 15:35, Thomas Huth wrote: >> On 17/03/2023 11.17, Peter Maydell wrote: >>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> >>> wrote: >>>> >>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>> <peter.maydell@linaro.org> wrote: >>>>> >>>>> I've noticed that test-blockjob seems to fail intermittently >>>>> on the msys2-64bit job: >>>>> >>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>> >>>>> Sample output: >>>>> | 53/89 >>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>> >>>> Here's an intermittent failure from my macos x86 machine: >>>> >>>> 172/621 qemu:unit / test-blockjob >>>> ERROR 0.26s killed by signal 6 SIGABRT >>> >>> And an intermittent on the freebsd 13 CI job: >>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>> >>>>>> MALLOC_PERTURB_=197 >>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>> ▶ 178/650 /blockjob/ids >>> OK >>> 178/650 qemu:unit / test-blockjob >>> ERROR 0.31s killed by signal 6 SIGABRT >>> ――――――――――――――――――――――――――――――――――――― ✀ >>> ――――――――――――――――――――――――――――――――――――― >>> stderr: >>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>> 499. >>> >>> >>> TAP parsing error: Too few tests run (expected 9, got 1) >>> (test program exited with status code -6) >>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>> >>> Anybody in the block team looking at these, or shall we just >>> disable this test entirely ? >> >> I ran into this issue today, too: >> >> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >> >> ... if nobody is interested in fixing this test, I think we should >> disable it... >> >> Thomas >> > > I'm looking at this now, and seems that it's broken since > 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove > Aiocontext locks", as it stops critical section by new > aio_context_release() call exactly after bdrv_drain_all_and(), so it's > not a surprise that job may start at that moment and the following > assertion fires. > > Emanuele could please look at it? > Well if I understood correctly, the only thing that was preventing the job from continuing was the aiocontext lock held. The failing assertion even mentions that: /* Lock the IO thread to prevent the job from being run */ [...] /* But the job cannot run, so it will remain on standby */ assert(job->status == JOB_STATUS_STANDBY); Essentially bdrv_drain_all_end() would wake up the job coroutine, but it would anyways block somewhere after since the aiocontext lock was taken by the test. Now that we got rid of aiocontext lock in job code, nothing prevents the test from resuming. Mixing job lock and aiocontext acquire/release is not a good idea, and it would anyways block job_resume() called by bdrv_drain_all_end(), so the test itself would deadlock. So unless @Kevin has a better idea, this seems to be just an "hack" to test stuff that is not possible to do now anymore. What I would suggest is to get rid of that test, or at least of that assert function. I unfortunately cannot reproduce the failure, but I think the remaining functions called by the test should run as expected. Thank you, Emanuele ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-24 13:36 ` Emanuele Giuseppe Esposito @ 2023-04-24 18:32 ` Vladimir Sementsov-Ogievskiy 2023-04-25 16:48 ` Hanna Czenczek 0 siblings, 1 reply; 12+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2023-04-24 18:32 UTC (permalink / raw) To: Emanuele Giuseppe Esposito, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block, Hanna Reitz On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote: > > > Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: >> On 17.03.23 15:35, Thomas Huth wrote: >>> On 17/03/2023 11.17, Peter Maydell wrote: >>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> >>>> wrote: >>>>> >>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>>> <peter.maydell@linaro.org> wrote: >>>>>> >>>>>> I've noticed that test-blockjob seems to fail intermittently >>>>>> on the msys2-64bit job: >>>>>> >>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>>> >>>>>> Sample output: >>>>>> | 53/89 >>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>>> >>>>> Here's an intermittent failure from my macos x86 machine: >>>>> >>>>> 172/621 qemu:unit / test-blockjob >>>>> ERROR 0.26s killed by signal 6 SIGABRT >>>> >>>> And an intermittent on the freebsd 13 CI job: >>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>>> >>>>>>> MALLOC_PERTURB_=197 >>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>>> ▶ 178/650 /blockjob/ids >>>> OK >>>> 178/650 qemu:unit / test-blockjob >>>> ERROR 0.31s killed by signal 6 SIGABRT >>>> ――――――――――――――――――――――――――――――――――――― ✀ >>>> ――――――――――――――――――――――――――――――――――――― >>>> stderr: >>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>>> 499. >>>> >>>> >>>> TAP parsing error: Too few tests run (expected 9, got 1) >>>> (test program exited with status code -6) >>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>>> >>>> Anybody in the block team looking at these, or shall we just >>>> disable this test entirely ? >>> >>> I ran into this issue today, too: >>> >>> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >>> >>> ... if nobody is interested in fixing this test, I think we should >>> disable it... >>> >>> Thomas >>> >> >> I'm looking at this now, and seems that it's broken since >> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove >> Aiocontext locks", as it stops critical section by new >> aio_context_release() call exactly after bdrv_drain_all_and(), so it's >> not a surprise that job may start at that moment and the following >> assertion fires. >> >> Emanuele could please look at it? >> > Well if I understood correctly, the only thing that was preventing the > job from continuing was the aiocontext lock held. > > The failing assertion even mentions that: > /* Lock the IO thread to prevent the job from being run */ > [...] > /* But the job cannot run, so it will remain on standby */ > assert(job->status == JOB_STATUS_STANDBY); > > Essentially bdrv_drain_all_end() would wake up the job coroutine, but it > would anyways block somewhere after since the aiocontext lock was taken > by the test. > > Now that we got rid of aiocontext lock in job code, nothing prevents the > test from resuming. > Mixing job lock and aiocontext acquire/release is not a good idea, and > it would anyways block job_resume() called by bdrv_drain_all_end(), so > the test itself would deadlock. > > So unless @Kevin has a better idea, this seems to be just an "hack" to > test stuff that is not possible to do now anymore. What I would suggest > is to get rid of that test, or at least of that assert function. I > unfortunately cannot reproduce the failure, but I think the remaining > functions called by the test should run as expected. > Thanks! I agree. Probably, alternatively we could just expand the drained section, like @@ -488,12 +488,6 @@ static void test_complete_in_standby(void) bdrv_drain_all_begin(); assert_job_status_is(job, JOB_STATUS_STANDBY); - /* Lock the IO thread to prevent the job from being run */ - aio_context_acquire(ctx); - /* This will schedule the job to resume it */ - bdrv_drain_all_end(); - aio_context_release(ctx); - WITH_JOB_LOCK_GUARD() { /* But the job cannot run, so it will remain on standby */ assert(job->status == JOB_STATUS_STANDBY); @@ -511,6 +505,7 @@ static void test_complete_in_standby(void) job_dismiss_locked(&job, &error_abort); } + bdrv_drain_all_end(); aio_context_acquire(ctx); destroy_blk(blk); aio_context_release(ctx); But, seems that test wanted to specifically test job, that still in STANDBY exactly after drained section... [cc: Hanna] Hanna, it was your test (added in c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you remember was important to catch STANDBY job *after* a drained section? -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-24 18:32 ` Vladimir Sementsov-Ogievskiy @ 2023-04-25 16:48 ` Hanna Czenczek 2023-04-25 16:58 ` Vladimir Sementsov-Ogievskiy 2023-04-26 7:38 ` Emanuele Giuseppe Esposito 0 siblings, 2 replies; 12+ messages in thread From: Hanna Czenczek @ 2023-04-25 16:48 UTC (permalink / raw) To: Vladimir Sementsov-Ogievskiy, Emanuele Giuseppe Esposito, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote: > On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote: >> >> >> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: >>> On 17.03.23 15:35, Thomas Huth wrote: >>>> On 17/03/2023 11.17, Peter Maydell wrote: >>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> >>>>> wrote: >>>>>> >>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>>>> <peter.maydell@linaro.org> wrote: >>>>>>> >>>>>>> I've noticed that test-blockjob seems to fail intermittently >>>>>>> on the msys2-64bit job: >>>>>>> >>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>>>> >>>>>>> Sample output: >>>>>>> | 53/89 >>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>>>> >>>>>> Here's an intermittent failure from my macos x86 machine: >>>>>> >>>>>> 172/621 qemu:unit / test-blockjob >>>>>> ERROR 0.26s killed by signal 6 SIGABRT >>>>> >>>>> And an intermittent on the freebsd 13 CI job: >>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>>>> >>>>>>>> MALLOC_PERTURB_=197 >>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>>>> ▶ 178/650 /blockjob/ids >>>>> OK >>>>> 178/650 qemu:unit / test-blockjob >>>>> ERROR 0.31s killed by signal 6 SIGABRT >>>>> ――――――――――――――――――――――――――――――――――――― ✀ >>>>> ――――――――――――――――――――――――――――――――――――― >>>>> stderr: >>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>>>> 499. >>>>> >>>>> >>>>> TAP parsing error: Too few tests run (expected 9, got 1) >>>>> (test program exited with status code -6) >>>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>>>> >>>>> >>>>> Anybody in the block team looking at these, or shall we just >>>>> disable this test entirely ? >>>> >>>> I ran into this issue today, too: >>>> >>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >>>> >>>> ... if nobody is interested in fixing this test, I think we should >>>> disable it... >>>> >>>> Thomas >>>> >>> >>> I'm looking at this now, and seems that it's broken since >>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove >>> Aiocontext locks", as it stops critical section by new >>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's >>> not a surprise that job may start at that moment and the following >>> assertion fires. >>> >>> Emanuele could please look at it? >>> >> Well if I understood correctly, the only thing that was preventing the >> job from continuing was the aiocontext lock held. >> >> The failing assertion even mentions that: >> /* Lock the IO thread to prevent the job from being run */ >> [...] >> /* But the job cannot run, so it will remain on standby */ >> assert(job->status == JOB_STATUS_STANDBY); >> >> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it >> would anyways block somewhere after since the aiocontext lock was taken >> by the test. >> >> Now that we got rid of aiocontext lock in job code, nothing prevents the >> test from resuming. >> Mixing job lock and aiocontext acquire/release is not a good idea, and >> it would anyways block job_resume() called by bdrv_drain_all_end(), so >> the test itself would deadlock. >> >> So unless @Kevin has a better idea, this seems to be just an "hack" to >> test stuff that is not possible to do now anymore. What I would suggest >> is to get rid of that test, or at least of that assert function. I >> unfortunately cannot reproduce the failure, but I think the remaining >> functions called by the test should run as expected. >> > > Thanks! I agree. Probably, alternatively we could just expand the > drained section, like > > @@ -488,12 +488,6 @@ static void test_complete_in_standby(void) > bdrv_drain_all_begin(); > assert_job_status_is(job, JOB_STATUS_STANDBY); > > - /* Lock the IO thread to prevent the job from being run */ > - aio_context_acquire(ctx); > - /* This will schedule the job to resume it */ > - bdrv_drain_all_end(); > - aio_context_release(ctx); > - > WITH_JOB_LOCK_GUARD() { > /* But the job cannot run, so it will remain on standby */ > assert(job->status == JOB_STATUS_STANDBY); > @@ -511,6 +505,7 @@ static void test_complete_in_standby(void) > job_dismiss_locked(&job, &error_abort); > } > > + bdrv_drain_all_end(); > aio_context_acquire(ctx); > destroy_blk(blk); > aio_context_release(ctx); > > > But, seems that test wanted to specifically test job, that still in > STANDBY exactly after drained section... > > [cc: Hanna] > > Hanna, it was your test (added in > c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you > remember was important to catch STANDBY job *after* a drained section? I’m not quite sure, but I think the idea was that we basically try to get as close to something that might come in over QMP. Over QMP, you can’t issue a job-complete while keeping everything drained, so I wouldn’t just extend the drained section. Getting rid of the assert function also seems pointless. If we want to test whether job-complete works on tests in standby, we must put the test in standby, and verify this. We can get rid of the test, of course, but it is a regression test, so it isn’t like it was added just for fun. Then again, it’s now already effectively commented out via environment variable, so it doesn’t seem like a loss in practice to to fully drop it. Anyway – the thing I wonder about is, if this is to test whether jobs in standby can be completed… Why don’t we just pause the job instead of going through the context lock hassle? I.e. just put a job_pause() right after bdrv_drain_all_begin(). If I’m not mistaken, reproducing the bug in the test seems really simple by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that works just fine if only you have a job_pause() in the drained section. (And dropping the aio_context_acquire()/release() calls, because they don’t do anything anymore.) Hanna ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-25 16:48 ` Hanna Czenczek @ 2023-04-25 16:58 ` Vladimir Sementsov-Ogievskiy 2023-04-26 7:38 ` Emanuele Giuseppe Esposito 1 sibling, 0 replies; 12+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2023-04-25 16:58 UTC (permalink / raw) To: Hanna Czenczek, Emanuele Giuseppe Esposito, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block On 25.04.23 19:48, Hanna Czenczek wrote: > On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote: >> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote: >>> >>> >>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: >>>> On 17.03.23 15:35, Thomas Huth wrote: >>>>> On 17/03/2023 11.17, Peter Maydell wrote: >>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> >>>>>> wrote: >>>>>>> >>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>>>>> <peter.maydell@linaro.org> wrote: >>>>>>>> >>>>>>>> I've noticed that test-blockjob seems to fail intermittently >>>>>>>> on the msys2-64bit job: >>>>>>>> >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>>>>> >>>>>>>> Sample output: >>>>>>>> | 53/89 >>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>>>>> >>>>>>> Here's an intermittent failure from my macos x86 machine: >>>>>>> >>>>>>> 172/621 qemu:unit / test-blockjob >>>>>>> ERROR 0.26s killed by signal 6 SIGABRT >>>>>> >>>>>> And an intermittent on the freebsd 13 CI job: >>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>>>>> >>>>>>>>> MALLOC_PERTURB_=197 >>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>>>>> ▶ 178/650 /blockjob/ids >>>>>> OK >>>>>> 178/650 qemu:unit / test-blockjob >>>>>> ERROR 0.31s killed by signal 6 SIGABRT >>>>>> ――――――――――――――――――――――――――――――――――――― ✀ >>>>>> ――――――――――――――――――――――――――――――――――――― >>>>>> stderr: >>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>>>>> 499. >>>>>> >>>>>> >>>>>> TAP parsing error: Too few tests run (expected 9, got 1) >>>>>> (test program exited with status code -6) >>>>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>>>>> >>>>>> Anybody in the block team looking at these, or shall we just >>>>>> disable this test entirely ? >>>>> >>>>> I ran into this issue today, too: >>>>> >>>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >>>>> >>>>> ... if nobody is interested in fixing this test, I think we should >>>>> disable it... >>>>> >>>>> Thomas >>>>> >>>> >>>> I'm looking at this now, and seems that it's broken since >>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove >>>> Aiocontext locks", as it stops critical section by new >>>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's >>>> not a surprise that job may start at that moment and the following >>>> assertion fires. >>>> >>>> Emanuele could please look at it? >>>> >>> Well if I understood correctly, the only thing that was preventing the >>> job from continuing was the aiocontext lock held. >>> >>> The failing assertion even mentions that: >>> /* Lock the IO thread to prevent the job from being run */ >>> [...] >>> /* But the job cannot run, so it will remain on standby */ >>> assert(job->status == JOB_STATUS_STANDBY); >>> >>> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it >>> would anyways block somewhere after since the aiocontext lock was taken >>> by the test. >>> >>> Now that we got rid of aiocontext lock in job code, nothing prevents the >>> test from resuming. >>> Mixing job lock and aiocontext acquire/release is not a good idea, and >>> it would anyways block job_resume() called by bdrv_drain_all_end(), so >>> the test itself would deadlock. >>> >>> So unless @Kevin has a better idea, this seems to be just an "hack" to >>> test stuff that is not possible to do now anymore. What I would suggest >>> is to get rid of that test, or at least of that assert function. I >>> unfortunately cannot reproduce the failure, but I think the remaining >>> functions called by the test should run as expected. >>> >> >> Thanks! I agree. Probably, alternatively we could just expand the drained section, like >> >> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void) >> bdrv_drain_all_begin(); >> assert_job_status_is(job, JOB_STATUS_STANDBY); >> >> - /* Lock the IO thread to prevent the job from being run */ >> - aio_context_acquire(ctx); >> - /* This will schedule the job to resume it */ >> - bdrv_drain_all_end(); >> - aio_context_release(ctx); >> - >> WITH_JOB_LOCK_GUARD() { >> /* But the job cannot run, so it will remain on standby */ >> assert(job->status == JOB_STATUS_STANDBY); >> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void) >> job_dismiss_locked(&job, &error_abort); >> } >> >> + bdrv_drain_all_end(); >> aio_context_acquire(ctx); >> destroy_blk(blk); >> aio_context_release(ctx); >> >> >> But, seems that test wanted to specifically test job, that still in STANDBY exactly after drained section... >> >> [cc: Hanna] >> >> Hanna, it was your test (added in c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you remember was important to catch STANDBY job *after* a drained section? > > I’m not quite sure, but I think the idea was that we basically try to get as close to something that might come in over QMP. Over QMP, you can’t issue a job-complete while keeping everything drained, so I wouldn’t just extend the drained section. > > Getting rid of the assert function also seems pointless. If we want to test whether job-complete works on tests in standby, we must put the test in standby, and verify this. We can get rid of the test, of course, but it is a regression test, so it isn’t like it was added just for fun. Then again, it’s now already effectively commented out via environment variable, so it doesn’t seem like a loss in practice to to fully drop it. > > Anyway – the thing I wonder about is, if this is to test whether jobs in standby can be completed… Why don’t we just pause the job instead of going through the context lock hassle? I.e. just put a job_pause() right after bdrv_drain_all_begin(). > > If I’m not mistaken, reproducing the bug in the test seems really simple by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that works just fine if only you have a job_pause() in the drained section. (And dropping the aio_context_acquire()/release() calls, because they don’t do anything anymore.) > Sounds good. Could you send a patch? -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-25 16:48 ` Hanna Czenczek 2023-04-25 16:58 ` Vladimir Sementsov-Ogievskiy @ 2023-04-26 7:38 ` Emanuele Giuseppe Esposito 2023-04-26 8:03 ` Hanna Czenczek 1 sibling, 1 reply; 12+ messages in thread From: Emanuele Giuseppe Esposito @ 2023-04-26 7:38 UTC (permalink / raw) To: Hanna Czenczek, Vladimir Sementsov-Ogievskiy, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block Am 25/04/2023 um 18:48 schrieb Hanna Czenczek: > On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote: >> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote: >>> >>> >>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: >>>> On 17.03.23 15:35, Thomas Huth wrote: >>>>> On 17/03/2023 11.17, Peter Maydell wrote: >>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> >>>>>> wrote: >>>>>>> >>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>>>>> <peter.maydell@linaro.org> wrote: >>>>>>>> >>>>>>>> I've noticed that test-blockjob seems to fail intermittently >>>>>>>> on the msys2-64bit job: >>>>>>>> >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>>>>> >>>>>>>> Sample output: >>>>>>>> | 53/89 >>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>>>>> >>>>>>> Here's an intermittent failure from my macos x86 machine: >>>>>>> >>>>>>> 172/621 qemu:unit / test-blockjob >>>>>>> ERROR 0.26s killed by signal 6 SIGABRT >>>>>> >>>>>> And an intermittent on the freebsd 13 CI job: >>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>>>>> >>>>>>>>> MALLOC_PERTURB_=197 >>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>>>>> ▶ 178/650 /blockjob/ids >>>>>> OK >>>>>> 178/650 qemu:unit / test-blockjob >>>>>> ERROR 0.31s killed by signal 6 SIGABRT >>>>>> ――――――――――――――――――――――――――――――――――――― ✀ >>>>>> ――――――――――――――――――――――――――――――――――――― >>>>>> stderr: >>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>>>>> 499. >>>>>> >>>>>> >>>>>> TAP parsing error: Too few tests run (expected 9, got 1) >>>>>> (test program exited with status code -6) >>>>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>>>>> >>>>>> Anybody in the block team looking at these, or shall we just >>>>>> disable this test entirely ? >>>>> >>>>> I ran into this issue today, too: >>>>> >>>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >>>>> >>>>> ... if nobody is interested in fixing this test, I think we should >>>>> disable it... >>>>> >>>>> Thomas >>>>> >>>> >>>> I'm looking at this now, and seems that it's broken since >>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove >>>> Aiocontext locks", as it stops critical section by new >>>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's >>>> not a surprise that job may start at that moment and the following >>>> assertion fires. >>>> >>>> Emanuele could please look at it? >>>> >>> Well if I understood correctly, the only thing that was preventing the >>> job from continuing was the aiocontext lock held. >>> >>> The failing assertion even mentions that: >>> /* Lock the IO thread to prevent the job from being run */ >>> [...] >>> /* But the job cannot run, so it will remain on standby */ >>> assert(job->status == JOB_STATUS_STANDBY); >>> >>> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it >>> would anyways block somewhere after since the aiocontext lock was taken >>> by the test. >>> >>> Now that we got rid of aiocontext lock in job code, nothing prevents the >>> test from resuming. >>> Mixing job lock and aiocontext acquire/release is not a good idea, and >>> it would anyways block job_resume() called by bdrv_drain_all_end(), so >>> the test itself would deadlock. >>> >>> So unless @Kevin has a better idea, this seems to be just an "hack" to >>> test stuff that is not possible to do now anymore. What I would suggest >>> is to get rid of that test, or at least of that assert function. I >>> unfortunately cannot reproduce the failure, but I think the remaining >>> functions called by the test should run as expected. >>> >> >> Thanks! I agree. Probably, alternatively we could just expand the >> drained section, like >> >> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void) >> bdrv_drain_all_begin(); >> assert_job_status_is(job, JOB_STATUS_STANDBY); >> >> - /* Lock the IO thread to prevent the job from being run */ >> - aio_context_acquire(ctx); >> - /* This will schedule the job to resume it */ >> - bdrv_drain_all_end(); >> - aio_context_release(ctx); >> - >> WITH_JOB_LOCK_GUARD() { >> /* But the job cannot run, so it will remain on standby */ >> assert(job->status == JOB_STATUS_STANDBY); >> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void) >> job_dismiss_locked(&job, &error_abort); >> } >> >> + bdrv_drain_all_end(); >> aio_context_acquire(ctx); >> destroy_blk(blk); >> aio_context_release(ctx); >> >> >> But, seems that test wanted to specifically test job, that still in >> STANDBY exactly after drained section... >> >> [cc: Hanna] >> >> Hanna, it was your test (added in >> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you >> remember was important to catch STANDBY job *after* a drained section? > > I’m not quite sure, but I think the idea was that we basically try to > get as close to something that might come in over QMP. Over QMP, you > can’t issue a job-complete while keeping everything drained, so I > wouldn’t just extend the drained section. > > Getting rid of the assert function also seems pointless. If we want to > test whether job-complete works on tests in standby, we must put the > test in standby, and verify this. We can get rid of the test, of > course, but it is a regression test, so it isn’t like it was added just > for fun. Then again, it’s now already effectively commented out via > environment variable, so it doesn’t seem like a loss in practice to to > fully drop it. > > Anyway – the thing I wonder about is, if this is to test whether jobs in > standby can be completed… Why don’t we just pause the job instead of > going through the context lock hassle? I.e. just put a job_pause() > right after bdrv_drain_all_begin(). > > If I’m not mistaken, reproducing the bug in the test seems really simple > by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that > works just fine if only you have a job_pause() in the drained section. > (And dropping the aio_context_acquire()/release() calls, because they > don’t do anything anymore.) > Why sleep(1)? I have the patch ready: diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c index a130f6fefb..8054b72afa 100644 --- a/tests/unit/test-blockjob.c +++ b/tests/unit/test-blockjob.c @@ -488,11 +488,15 @@ static void test_complete_in_standby(void) bdrv_drain_all_begin(); assert_job_status_is(job, JOB_STATUS_STANDBY); + /* + * Increase pause_count so that the counter is + * unbalanced and job won't resume + */ + job_pause(job); + /* Lock the IO thread to prevent the job from being run */ - aio_context_acquire(ctx); /* This will schedule the job to resume it */ bdrv_drain_all_end(); - aio_context_release(ctx); WITH_JOB_LOCK_GUARD() { /* But the job cannot run, so it will remain on standby */ But I don't get why we should sleep additionally. Emanuele ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-26 7:38 ` Emanuele Giuseppe Esposito @ 2023-04-26 8:03 ` Hanna Czenczek 2023-04-26 8:17 ` Emanuele Giuseppe Esposito 0 siblings, 1 reply; 12+ messages in thread From: Hanna Czenczek @ 2023-04-26 8:03 UTC (permalink / raw) To: Emanuele Giuseppe Esposito, Vladimir Sementsov-Ogievskiy, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block On 26.04.23 09:38, Emanuele Giuseppe Esposito wrote: > > Am 25/04/2023 um 18:48 schrieb Hanna Czenczek: >> On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote: >>> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote: >>>> >>>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: >>>>> On 17.03.23 15:35, Thomas Huth wrote: >>>>>> On 17/03/2023 11.17, Peter Maydell wrote: >>>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org> >>>>>>> wrote: >>>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>>>>>> <peter.maydell@linaro.org> wrote: >>>>>>>>> I've noticed that test-blockjob seems to fail intermittently >>>>>>>>> on the msys2-64bit job: >>>>>>>>> >>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>>>>>> >>>>>>>>> Sample output: >>>>>>>>> | 53/89 >>>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>>>>>>> Here's an intermittent failure from my macos x86 machine: >>>>>>>> >>>>>>>> 172/621 qemu:unit / test-blockjob >>>>>>>> ERROR 0.26s killed by signal 6 SIGABRT >>>>>>> And an intermittent on the freebsd 13 CI job: >>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>>>>>> >>>>>>>>>> MALLOC_PERTURB_=197 >>>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>>>>>> ▶ 178/650 /blockjob/ids >>>>>>> OK >>>>>>> 178/650 qemu:unit / test-blockjob >>>>>>> ERROR 0.31s killed by signal 6 SIGABRT >>>>>>> ――――――――――――――――――――――――――――――――――――― ✀ >>>>>>> ――――――――――――――――――――――――――――――――――――― >>>>>>> stderr: >>>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>>>>>> 499. >>>>>>> >>>>>>> >>>>>>> TAP parsing error: Too few tests run (expected 9, got 1) >>>>>>> (test program exited with status code -6) >>>>>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>>>>>> >>>>>>> Anybody in the block team looking at these, or shall we just >>>>>>> disable this test entirely ? >>>>>> I ran into this issue today, too: >>>>>> >>>>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >>>>>> >>>>>> ... if nobody is interested in fixing this test, I think we should >>>>>> disable it... >>>>>> >>>>>> Thomas >>>>>> >>>>> I'm looking at this now, and seems that it's broken since >>>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove >>>>> Aiocontext locks", as it stops critical section by new >>>>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's >>>>> not a surprise that job may start at that moment and the following >>>>> assertion fires. >>>>> >>>>> Emanuele could please look at it? >>>>> >>>> Well if I understood correctly, the only thing that was preventing the >>>> job from continuing was the aiocontext lock held. >>>> >>>> The failing assertion even mentions that: >>>> /* Lock the IO thread to prevent the job from being run */ >>>> [...] >>>> /* But the job cannot run, so it will remain on standby */ >>>> assert(job->status == JOB_STATUS_STANDBY); >>>> >>>> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it >>>> would anyways block somewhere after since the aiocontext lock was taken >>>> by the test. >>>> >>>> Now that we got rid of aiocontext lock in job code, nothing prevents the >>>> test from resuming. >>>> Mixing job lock and aiocontext acquire/release is not a good idea, and >>>> it would anyways block job_resume() called by bdrv_drain_all_end(), so >>>> the test itself would deadlock. >>>> >>>> So unless @Kevin has a better idea, this seems to be just an "hack" to >>>> test stuff that is not possible to do now anymore. What I would suggest >>>> is to get rid of that test, or at least of that assert function. I >>>> unfortunately cannot reproduce the failure, but I think the remaining >>>> functions called by the test should run as expected. >>>> >>> Thanks! I agree. Probably, alternatively we could just expand the >>> drained section, like >>> >>> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void) >>> bdrv_drain_all_begin(); >>> assert_job_status_is(job, JOB_STATUS_STANDBY); >>> >>> - /* Lock the IO thread to prevent the job from being run */ >>> - aio_context_acquire(ctx); >>> - /* This will schedule the job to resume it */ >>> - bdrv_drain_all_end(); >>> - aio_context_release(ctx); >>> - >>> WITH_JOB_LOCK_GUARD() { >>> /* But the job cannot run, so it will remain on standby */ >>> assert(job->status == JOB_STATUS_STANDBY); >>> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void) >>> job_dismiss_locked(&job, &error_abort); >>> } >>> >>> + bdrv_drain_all_end(); >>> aio_context_acquire(ctx); >>> destroy_blk(blk); >>> aio_context_release(ctx); >>> >>> >>> But, seems that test wanted to specifically test job, that still in >>> STANDBY exactly after drained section... >>> >>> [cc: Hanna] >>> >>> Hanna, it was your test (added in >>> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you >>> remember was important to catch STANDBY job *after* a drained section? >> I’m not quite sure, but I think the idea was that we basically try to >> get as close to something that might come in over QMP. Over QMP, you >> can’t issue a job-complete while keeping everything drained, so I >> wouldn’t just extend the drained section. >> >> Getting rid of the assert function also seems pointless. If we want to >> test whether job-complete works on tests in standby, we must put the >> test in standby, and verify this. We can get rid of the test, of >> course, but it is a regression test, so it isn’t like it was added just >> for fun. Then again, it’s now already effectively commented out via >> environment variable, so it doesn’t seem like a loss in practice to to >> fully drop it. >> >> Anyway – the thing I wonder about is, if this is to test whether jobs in >> standby can be completed… Why don’t we just pause the job instead of >> going through the context lock hassle? I.e. just put a job_pause() >> right after bdrv_drain_all_begin(). >> >> If I’m not mistaken, reproducing the bug in the test seems really simple >> by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that >> works just fine if only you have a job_pause() in the drained section. >> (And dropping the aio_context_acquire()/release() calls, because they >> don’t do anything anymore.) >> > Why sleep(1)? I have the patch ready: > > diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c > index a130f6fefb..8054b72afa 100644 > --- a/tests/unit/test-blockjob.c > +++ b/tests/unit/test-blockjob.c > @@ -488,11 +488,15 @@ static void test_complete_in_standby(void) > bdrv_drain_all_begin(); > assert_job_status_is(job, JOB_STATUS_STANDBY); > > + /* > + * Increase pause_count so that the counter is > + * unbalanced and job won't resume > + */ > + job_pause(job); > + > /* Lock the IO thread to prevent the job from being run */ > - aio_context_acquire(ctx); > /* This will schedule the job to resume it */ > bdrv_drain_all_end(); > - aio_context_release(ctx); > > WITH_JOB_LOCK_GUARD() { > /* But the job cannot run, so it will remain on standby */ > > But I don't get why we should sleep additionally. Yep, that’s the patch. The sleep was just a note on how to get it to reproduce 100 %. Hanna ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: test-blockjob: intermittent CI failures in msys2-64bit job 2023-04-26 8:03 ` Hanna Czenczek @ 2023-04-26 8:17 ` Emanuele Giuseppe Esposito 0 siblings, 0 replies; 12+ messages in thread From: Emanuele Giuseppe Esposito @ 2023-04-26 8:17 UTC (permalink / raw) To: Hanna Czenczek, Vladimir Sementsov-Ogievskiy, Thomas Huth, Peter Maydell, John Snow, Kevin Wolf, Alberto Garcia Cc: QEMU Developers, Qemu-block Am 26/04/2023 um 10:03 schrieb Hanna Czenczek: > On 26.04.23 09:38, Emanuele Giuseppe Esposito wrote: >> >> Am 25/04/2023 um 18:48 schrieb Hanna Czenczek: >>> On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote: >>>> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote: >>>>> >>>>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy: >>>>>> On 17.03.23 15:35, Thomas Huth wrote: >>>>>>> On 17/03/2023 11.17, Peter Maydell wrote: >>>>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell >>>>>>>> <peter.maydell@linaro.org> >>>>>>>> wrote: >>>>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell >>>>>>>>> <peter.maydell@linaro.org> wrote: >>>>>>>>>> I've noticed that test-blockjob seems to fail intermittently >>>>>>>>>> on the msys2-64bit job: >>>>>>>>>> >>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803 >>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024 >>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440 >>>>>>>>>> >>>>>>>>>> Sample output: >>>>>>>>>> | 53/89 >>>>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby: >>>>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR >>>>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3 >>>>>>>>> Here's an intermittent failure from my macos x86 machine: >>>>>>>>> >>>>>>>>> 172/621 qemu:unit / test-blockjob >>>>>>>>> ERROR 0.26s killed by signal 6 SIGABRT >>>>>>>> And an intermittent on the freebsd 13 CI job: >>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240 >>>>>>>> >>>>>>>>>>> MALLOC_PERTURB_=197 >>>>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit >>>>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit >>>>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k >>>>>>>> ▶ 178/650 /blockjob/ids >>>>>>>> OK >>>>>>>> 178/650 qemu:unit / test-blockjob >>>>>>>> ERROR 0.31s killed by signal 6 SIGABRT >>>>>>>> ――――――――――――――――――――――――――――――――――――― ✀ >>>>>>>> ――――――――――――――――――――――――――――――――――――― >>>>>>>> stderr: >>>>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function >>>>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line >>>>>>>> 499. >>>>>>>> >>>>>>>> >>>>>>>> TAP parsing error: Too few tests run (expected 9, got 1) >>>>>>>> (test program exited with status code -6) >>>>>>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― >>>>>>>> >>>>>>>> Anybody in the block team looking at these, or shall we just >>>>>>>> disable this test entirely ? >>>>>>> I ran into this issue today, too: >>>>>>> >>>>>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101 >>>>>>> >>>>>>> ... if nobody is interested in fixing this test, I think we should >>>>>>> disable it... >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>> I'm looking at this now, and seems that it's broken since >>>>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove >>>>>> Aiocontext locks", as it stops critical section by new >>>>>> aio_context_release() call exactly after bdrv_drain_all_and(), so >>>>>> it's >>>>>> not a surprise that job may start at that moment and the following >>>>>> assertion fires. >>>>>> >>>>>> Emanuele could please look at it? >>>>>> >>>>> Well if I understood correctly, the only thing that was preventing the >>>>> job from continuing was the aiocontext lock held. >>>>> >>>>> The failing assertion even mentions that: >>>>> /* Lock the IO thread to prevent the job from being run */ >>>>> [...] >>>>> /* But the job cannot run, so it will remain on standby */ >>>>> assert(job->status == JOB_STATUS_STANDBY); >>>>> >>>>> Essentially bdrv_drain_all_end() would wake up the job coroutine, >>>>> but it >>>>> would anyways block somewhere after since the aiocontext lock was >>>>> taken >>>>> by the test. >>>>> >>>>> Now that we got rid of aiocontext lock in job code, nothing >>>>> prevents the >>>>> test from resuming. >>>>> Mixing job lock and aiocontext acquire/release is not a good idea, and >>>>> it would anyways block job_resume() called by bdrv_drain_all_end(), so >>>>> the test itself would deadlock. >>>>> >>>>> So unless @Kevin has a better idea, this seems to be just an "hack" to >>>>> test stuff that is not possible to do now anymore. What I would >>>>> suggest >>>>> is to get rid of that test, or at least of that assert function. I >>>>> unfortunately cannot reproduce the failure, but I think the remaining >>>>> functions called by the test should run as expected. >>>>> >>>> Thanks! I agree. Probably, alternatively we could just expand the >>>> drained section, like >>>> >>>> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void) >>>> bdrv_drain_all_begin(); >>>> assert_job_status_is(job, JOB_STATUS_STANDBY); >>>> >>>> - /* Lock the IO thread to prevent the job from being run */ >>>> - aio_context_acquire(ctx); >>>> - /* This will schedule the job to resume it */ >>>> - bdrv_drain_all_end(); >>>> - aio_context_release(ctx); >>>> - >>>> WITH_JOB_LOCK_GUARD() { >>>> /* But the job cannot run, so it will remain on standby */ >>>> assert(job->status == JOB_STATUS_STANDBY); >>>> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void) >>>> job_dismiss_locked(&job, &error_abort); >>>> } >>>> >>>> + bdrv_drain_all_end(); >>>> aio_context_acquire(ctx); >>>> destroy_blk(blk); >>>> aio_context_release(ctx); >>>> >>>> >>>> But, seems that test wanted to specifically test job, that still in >>>> STANDBY exactly after drained section... >>>> >>>> [cc: Hanna] >>>> >>>> Hanna, it was your test (added in >>>> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you >>>> remember was important to catch STANDBY job *after* a drained section? >>> I’m not quite sure, but I think the idea was that we basically try to >>> get as close to something that might come in over QMP. Over QMP, you >>> can’t issue a job-complete while keeping everything drained, so I >>> wouldn’t just extend the drained section. >>> >>> Getting rid of the assert function also seems pointless. If we want to >>> test whether job-complete works on tests in standby, we must put the >>> test in standby, and verify this. We can get rid of the test, of >>> course, but it is a regression test, so it isn’t like it was added just >>> for fun. Then again, it’s now already effectively commented out via >>> environment variable, so it doesn’t seem like a loss in practice to to >>> fully drop it. >>> >>> Anyway – the thing I wonder about is, if this is to test whether jobs in >>> standby can be completed… Why don’t we just pause the job instead of >>> going through the context lock hassle? I.e. just put a job_pause() >>> right after bdrv_drain_all_begin(). >>> >>> If I’m not mistaken, reproducing the bug in the test seems really simple >>> by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that >>> works just fine if only you have a job_pause() in the drained section. >>> (And dropping the aio_context_acquire()/release() calls, because they >>> don’t do anything anymore.) >>> >> Why sleep(1)? I have the patch ready: >> >> diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c >> index a130f6fefb..8054b72afa 100644 >> --- a/tests/unit/test-blockjob.c >> +++ b/tests/unit/test-blockjob.c >> @@ -488,11 +488,15 @@ static void test_complete_in_standby(void) >> bdrv_drain_all_begin(); >> assert_job_status_is(job, JOB_STATUS_STANDBY); >> >> + /* >> + * Increase pause_count so that the counter is >> + * unbalanced and job won't resume >> + */ >> + job_pause(job); >> + >> /* Lock the IO thread to prevent the job from being run */ >> - aio_context_acquire(ctx); >> /* This will schedule the job to resume it */ >> bdrv_drain_all_end(); >> - aio_context_release(ctx); >> >> WITH_JOB_LOCK_GUARD() { >> /* But the job cannot run, so it will remain on standby */ >> >> But I don't get why we should sleep additionally. > > Yep, that’s the patch. The sleep was just a note on how to get it to > reproduce 100 %. > > Hanna > Patch sent, thanks! https://patchew.org/QEMU/20230426081601.2567990-1-eesposit@redhat.com/ Thank you, Emanuele ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-04-26 8:18 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-03-03 18:36 test-blockjob: intermittent CI failures in msys2-64bit job Peter Maydell 2023-03-06 11:16 ` Peter Maydell 2023-03-17 10:17 ` Peter Maydell 2023-03-17 12:35 ` Thomas Huth 2023-04-21 10:13 ` Vladimir Sementsov-Ogievskiy 2023-04-24 13:36 ` Emanuele Giuseppe Esposito 2023-04-24 18:32 ` Vladimir Sementsov-Ogievskiy 2023-04-25 16:48 ` Hanna Czenczek 2023-04-25 16:58 ` Vladimir Sementsov-Ogievskiy 2023-04-26 7:38 ` Emanuele Giuseppe Esposito 2023-04-26 8:03 ` Hanna Czenczek 2023-04-26 8:17 ` Emanuele Giuseppe Esposito
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).