Re: test-blockjob: intermittent CI failures in msys2-64bit job

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Hanna Czenczek <hreitz@redhat.com>
To: Emanuele Giuseppe Esposito <eesposit@redhat.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
	Thomas Huth <thuth@redhat.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	John Snow <jsnow@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
	Alberto Garcia <berto@igalia.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	Qemu-block <qemu-block@nongnu.org>
Subject: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Date: Wed, 26 Apr 2023 10:03:54 +0200	[thread overview]
Message-ID: <260c2fc7-df11-1de4-c0d3-fa5697d6a125@redhat.com> (raw)
In-Reply-To: <817ea28a-3fe8-d92e-5404-6bafc33fa67c@redhat.com>

On 26.04.23 09:38, Emanuele Giuseppe Esposito wrote:
>
> Am 25/04/2023 um 18:48 schrieb Hanna Czenczek:
>> On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote:
>>> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote:
>>>>
>>>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> On 17.03.23 15:35, Thomas Huth wrote:
>>>>>> On 17/03/2023 11.17, Peter Maydell wrote:
>>>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org>
>>>>>>> wrote:
>>>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell
>>>>>>>> <peter.maydell@linaro.org> wrote:
>>>>>>>>> I've noticed that test-blockjob seems to fail intermittently
>>>>>>>>> on the msys2-64bit job:
>>>>>>>>>
>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440
>>>>>>>>>
>>>>>>>>> Sample output:
>>>>>>>>> | 53/89
>>>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
>>>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
>>>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3
>>>>>>>> Here's an intermittent failure from my macos x86 machine:
>>>>>>>>
>>>>>>>> 172/621 qemu:unit / test-blockjob
>>>>>>>>               ERROR           0.26s   killed by signal 6 SIGABRT
>>>>>>> And an intermittent on the freebsd 13 CI job:
>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240
>>>>>>>
>>>>>>>>>> MALLOC_PERTURB_=197
>>>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
>>>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
>>>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k
>>>>>>> ▶ 178/650 /blockjob/ids
>>>>>>>               OK
>>>>>>> 178/650 qemu:unit / test-blockjob
>>>>>>>               ERROR           0.31s   killed by signal 6 SIGABRT
>>>>>>> ――――――――――――――――――――――――――――――――――――― ✀
>>>>>>> ―――――――――――――――――――――――――――――――――――――
>>>>>>> stderr:
>>>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function
>>>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
>>>>>>> 499.
>>>>>>>
>>>>>>>
>>>>>>> TAP parsing error: Too few tests run (expected 9, got 1)
>>>>>>> (test program exited with status code -6)
>>>>>>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>>>>>>
>>>>>>> Anybody in the block team looking at these, or shall we just
>>>>>>> disable this test entirely ?
>>>>>> I ran into this issue today, too:
>>>>>>
>>>>>>     https://gitlab.com/thuth/qemu/-/jobs/3954367101
>>>>>>
>>>>>> ... if nobody is interested in fixing this test, I think we should
>>>>>> disable it...
>>>>>>
>>>>>>     Thomas
>>>>>>
>>>>> I'm looking at this now, and seems that it's broken since
>>>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
>>>>> Aiocontext locks", as it stops critical section by new
>>>>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's
>>>>> not a surprise that job may start at that moment and the following
>>>>> assertion fires.
>>>>>
>>>>> Emanuele could please look at it?
>>>>>
>>>> Well if I understood correctly, the only thing that was preventing the
>>>> job from continuing was the aiocontext lock held.
>>>>
>>>> The failing assertion even mentions that:
>>>> /* Lock the IO thread to prevent the job from being run */
>>>> [...]
>>>> /* But the job cannot run, so it will remain on standby */
>>>> assert(job->status == JOB_STATUS_STANDBY);
>>>>
>>>> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it
>>>> would anyways block somewhere after since the aiocontext lock was taken
>>>> by the test.
>>>>
>>>> Now that we got rid of aiocontext lock in job code, nothing prevents the
>>>> test from resuming.
>>>> Mixing job lock and aiocontext acquire/release is not a good idea, and
>>>> it would anyways block job_resume() called by bdrv_drain_all_end(), so
>>>> the test itself would deadlock.
>>>>
>>>> So unless @Kevin has a better idea, this seems to be just an "hack" to
>>>> test stuff that is not possible to do now anymore. What I would suggest
>>>> is to get rid of that test, or at least of that assert function. I
>>>> unfortunately cannot reproduce the failure, but I think the remaining
>>>> functions called by the test should run as expected.
>>>>
>>> Thanks! I agree. Probably, alternatively we could just expand the
>>> drained section, like
>>>
>>> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void)
>>>       bdrv_drain_all_begin();
>>>       assert_job_status_is(job, JOB_STATUS_STANDBY);
>>>
>>> -    /* Lock the IO thread to prevent the job from being run */
>>> -    aio_context_acquire(ctx);
>>> -    /* This will schedule the job to resume it */
>>> -    bdrv_drain_all_end();
>>> -    aio_context_release(ctx);
>>> -
>>>       WITH_JOB_LOCK_GUARD() {
>>>           /* But the job cannot run, so it will remain on standby */
>>>           assert(job->status == JOB_STATUS_STANDBY);
>>> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void)
>>>           job_dismiss_locked(&job, &error_abort);
>>>       }
>>>
>>> +    bdrv_drain_all_end();
>>>       aio_context_acquire(ctx);
>>>       destroy_blk(blk);
>>>       aio_context_release(ctx);
>>>
>>>
>>> But, seems that test wanted to specifically test job, that still in
>>> STANDBY exactly after drained section...
>>>
>>> [cc: Hanna]
>>>
>>> Hanna, it was your test (added in
>>> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you
>>> remember was important to catch STANDBY job *after* a drained section?
>> I’m not quite sure, but I think the idea was that we basically try to
>> get as close to something that might come in over QMP.  Over QMP, you
>> can’t issue a job-complete while keeping everything drained, so I
>> wouldn’t just extend the drained section.
>>
>> Getting rid of the assert function also seems pointless.  If we want to
>> test whether job-complete works on tests in standby, we must put the
>> test in standby, and verify this.  We can get rid of the test, of
>> course, but it is a regression test, so it isn’t like it was added just
>> for fun.  Then again, it’s now already effectively commented out via
>> environment variable, so it doesn’t seem like a loss in practice to to
>> fully drop it.
>>
>> Anyway – the thing I wonder about is, if this is to test whether jobs in
>> standby can be completed…  Why don’t we just pause the job instead of
>> going through the context lock hassle?  I.e. just put a job_pause()
>> right after bdrv_drain_all_begin().
>>
>> If I’m not mistaken, reproducing the bug in the test seems really simple
>> by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that
>> works just fine if only you have a job_pause() in the drained section.
>> (And dropping the aio_context_acquire()/release() calls, because they
>> don’t do anything anymore.)
>>
> Why sleep(1)? I have the patch ready:
>
> diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
> index a130f6fefb..8054b72afa 100644
> --- a/tests/unit/test-blockjob.c
> +++ b/tests/unit/test-blockjob.c
> @@ -488,11 +488,15 @@ static void test_complete_in_standby(void)
>       bdrv_drain_all_begin();
>       assert_job_status_is(job, JOB_STATUS_STANDBY);
>
> +    /*
> +     * Increase pause_count so that the counter is
> +     * unbalanced and job won't resume
> +     */
> +    job_pause(job);
> +
>       /* Lock the IO thread to prevent the job from being run */
> -    aio_context_acquire(ctx);
>       /* This will schedule the job to resume it */
>       bdrv_drain_all_end();
> -    aio_context_release(ctx);
>
>       WITH_JOB_LOCK_GUARD() {
>           /* But the job cannot run, so it will remain on standby */
>
> But I don't get why we should sleep additionally.

Yep, that’s the patch.  The sleep was just a note on how to get it to 
reproduce 100 %.

Hanna

next prev parent reply	other threads:[~2023-04-26  8:04 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-03 18:36 test-blockjob: intermittent CI failures in msys2-64bit job Peter Maydell
2023-03-06 11:16 ` Peter Maydell
2023-03-17 10:17   ` Peter Maydell
2023-03-17 12:35     ` Thomas Huth
2023-04-21 10:13       ` Vladimir Sementsov-Ogievskiy
2023-04-24 13:36         ` Emanuele Giuseppe Esposito
2023-04-24 18:32           ` Vladimir Sementsov-Ogievskiy
2023-04-25 16:48             ` Hanna Czenczek
2023-04-25 16:58               ` Vladimir Sementsov-Ogievskiy
2023-04-26  7:38               ` Emanuele Giuseppe Esposito
2023-04-26  8:03                 ` Hanna Czenczek [this message]
2023-04-26  8:17                   ` Emanuele Giuseppe Esposito

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=260c2fc7-df11-1de4-c0d3-fa5697d6a125@redhat.com \
    --to=hreitz@redhat.com \
    --cc=berto@igalia.com \
    --cc=eesposit@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    --cc=vsementsov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).