From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
To: Hanna Czenczek <hreitz@redhat.com>,
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
Thomas Huth <thuth@redhat.com>,
Peter Maydell <peter.maydell@linaro.org>,
John Snow <jsnow@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
Alberto Garcia <berto@igalia.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
Qemu-block <qemu-block@nongnu.org>
Subject: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Date: Wed, 26 Apr 2023 09:38:56 +0200 [thread overview]
Message-ID: <817ea28a-3fe8-d92e-5404-6bafc33fa67c@redhat.com> (raw)
In-Reply-To: <78ad8ed0-e68a-5f4e-9627-454f0bd007fb@redhat.com>
Am 25/04/2023 um 18:48 schrieb Hanna Czenczek:
> On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote:
>> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote:
>>>
>>>
>>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:
>>>> On 17.03.23 15:35, Thomas Huth wrote:
>>>>> On 17/03/2023 11.17, Peter Maydell wrote:
>>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell
>>>>>>> <peter.maydell@linaro.org> wrote:
>>>>>>>>
>>>>>>>> I've noticed that test-blockjob seems to fail intermittently
>>>>>>>> on the msys2-64bit job:
>>>>>>>>
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440
>>>>>>>>
>>>>>>>> Sample output:
>>>>>>>> | 53/89
>>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
>>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
>>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3
>>>>>>
>>>>>>> Here's an intermittent failure from my macos x86 machine:
>>>>>>>
>>>>>>> 172/621 qemu:unit / test-blockjob
>>>>>>> ERROR 0.26s killed by signal 6 SIGABRT
>>>>>>
>>>>>> And an intermittent on the freebsd 13 CI job:
>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240
>>>>>>
>>>>>>>>> MALLOC_PERTURB_=197
>>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
>>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
>>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k
>>>>>> ▶ 178/650 /blockjob/ids
>>>>>> OK
>>>>>> 178/650 qemu:unit / test-blockjob
>>>>>> ERROR 0.31s killed by signal 6 SIGABRT
>>>>>> ――――――――――――――――――――――――――――――――――――― ✀
>>>>>> ―――――――――――――――――――――――――――――――――――――
>>>>>> stderr:
>>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function
>>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
>>>>>> 499.
>>>>>>
>>>>>>
>>>>>> TAP parsing error: Too few tests run (expected 9, got 1)
>>>>>> (test program exited with status code -6)
>>>>>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>>>>>
>>>>>> Anybody in the block team looking at these, or shall we just
>>>>>> disable this test entirely ?
>>>>>
>>>>> I ran into this issue today, too:
>>>>>
>>>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101
>>>>>
>>>>> ... if nobody is interested in fixing this test, I think we should
>>>>> disable it...
>>>>>
>>>>> Thomas
>>>>>
>>>>
>>>> I'm looking at this now, and seems that it's broken since
>>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
>>>> Aiocontext locks", as it stops critical section by new
>>>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's
>>>> not a surprise that job may start at that moment and the following
>>>> assertion fires.
>>>>
>>>> Emanuele could please look at it?
>>>>
>>> Well if I understood correctly, the only thing that was preventing the
>>> job from continuing was the aiocontext lock held.
>>>
>>> The failing assertion even mentions that:
>>> /* Lock the IO thread to prevent the job from being run */
>>> [...]
>>> /* But the job cannot run, so it will remain on standby */
>>> assert(job->status == JOB_STATUS_STANDBY);
>>>
>>> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it
>>> would anyways block somewhere after since the aiocontext lock was taken
>>> by the test.
>>>
>>> Now that we got rid of aiocontext lock in job code, nothing prevents the
>>> test from resuming.
>>> Mixing job lock and aiocontext acquire/release is not a good idea, and
>>> it would anyways block job_resume() called by bdrv_drain_all_end(), so
>>> the test itself would deadlock.
>>>
>>> So unless @Kevin has a better idea, this seems to be just an "hack" to
>>> test stuff that is not possible to do now anymore. What I would suggest
>>> is to get rid of that test, or at least of that assert function. I
>>> unfortunately cannot reproduce the failure, but I think the remaining
>>> functions called by the test should run as expected.
>>>
>>
>> Thanks! I agree. Probably, alternatively we could just expand the
>> drained section, like
>>
>> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void)
>> bdrv_drain_all_begin();
>> assert_job_status_is(job, JOB_STATUS_STANDBY);
>>
>> - /* Lock the IO thread to prevent the job from being run */
>> - aio_context_acquire(ctx);
>> - /* This will schedule the job to resume it */
>> - bdrv_drain_all_end();
>> - aio_context_release(ctx);
>> -
>> WITH_JOB_LOCK_GUARD() {
>> /* But the job cannot run, so it will remain on standby */
>> assert(job->status == JOB_STATUS_STANDBY);
>> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void)
>> job_dismiss_locked(&job, &error_abort);
>> }
>>
>> + bdrv_drain_all_end();
>> aio_context_acquire(ctx);
>> destroy_blk(blk);
>> aio_context_release(ctx);
>>
>>
>> But, seems that test wanted to specifically test job, that still in
>> STANDBY exactly after drained section...
>>
>> [cc: Hanna]
>>
>> Hanna, it was your test (added in
>> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you
>> remember was important to catch STANDBY job *after* a drained section?
>
> I’m not quite sure, but I think the idea was that we basically try to
> get as close to something that might come in over QMP. Over QMP, you
> can’t issue a job-complete while keeping everything drained, so I
> wouldn’t just extend the drained section.
>
> Getting rid of the assert function also seems pointless. If we want to
> test whether job-complete works on tests in standby, we must put the
> test in standby, and verify this. We can get rid of the test, of
> course, but it is a regression test, so it isn’t like it was added just
> for fun. Then again, it’s now already effectively commented out via
> environment variable, so it doesn’t seem like a loss in practice to to
> fully drop it.
>
> Anyway – the thing I wonder about is, if this is to test whether jobs in
> standby can be completed… Why don’t we just pause the job instead of
> going through the context lock hassle? I.e. just put a job_pause()
> right after bdrv_drain_all_begin().
>
> If I’m not mistaken, reproducing the bug in the test seems really simple
> by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that
> works just fine if only you have a job_pause() in the drained section.
> (And dropping the aio_context_acquire()/release() calls, because they
> don’t do anything anymore.)
>
Why sleep(1)? I have the patch ready:
diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
index a130f6fefb..8054b72afa 100644
--- a/tests/unit/test-blockjob.c
+++ b/tests/unit/test-blockjob.c
@@ -488,11 +488,15 @@ static void test_complete_in_standby(void)
bdrv_drain_all_begin();
assert_job_status_is(job, JOB_STATUS_STANDBY);
+ /*
+ * Increase pause_count so that the counter is
+ * unbalanced and job won't resume
+ */
+ job_pause(job);
+
/* Lock the IO thread to prevent the job from being run */
- aio_context_acquire(ctx);
/* This will schedule the job to resume it */
bdrv_drain_all_end();
- aio_context_release(ctx);
WITH_JOB_LOCK_GUARD() {
/* But the job cannot run, so it will remain on standby */
But I don't get why we should sleep additionally.
Emanuele
next prev parent reply other threads:[~2023-04-26 7:39 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-03 18:36 test-blockjob: intermittent CI failures in msys2-64bit job Peter Maydell
2023-03-06 11:16 ` Peter Maydell
2023-03-17 10:17 ` Peter Maydell
2023-03-17 12:35 ` Thomas Huth
2023-04-21 10:13 ` Vladimir Sementsov-Ogievskiy
2023-04-24 13:36 ` Emanuele Giuseppe Esposito
2023-04-24 18:32 ` Vladimir Sementsov-Ogievskiy
2023-04-25 16:48 ` Hanna Czenczek
2023-04-25 16:58 ` Vladimir Sementsov-Ogievskiy
2023-04-26 7:38 ` Emanuele Giuseppe Esposito [this message]
2023-04-26 8:03 ` Hanna Czenczek
2023-04-26 8:17 ` Emanuele Giuseppe Esposito
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=817ea28a-3fe8-d92e-5404-6bafc33fa67c@redhat.com \
--to=eesposit@redhat.com \
--cc=berto@igalia.com \
--cc=hreitz@redhat.com \
--cc=jsnow@redhat.com \
--cc=kwolf@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=thuth@redhat.com \
--cc=vsementsov@yandex-team.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).