Re: test-blockjob: intermittent CI failures in msys2-64bit job

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Hanna Czenczek <hreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
	Emanuele Giuseppe Esposito <eesposit@redhat.com>,
	Thomas Huth <thuth@redhat.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	John Snow <jsnow@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
	Alberto Garcia <berto@igalia.com>
Cc: QEMU Developers <qemu-devel@nongnu.org>,
	Qemu-block <qemu-block@nongnu.org>
Subject: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Date: Tue, 25 Apr 2023 18:48:08 +0200	[thread overview]
Message-ID: <78ad8ed0-e68a-5f4e-9627-454f0bd007fb@redhat.com> (raw)
In-Reply-To: <f3f2e92c-d357-4fb9-e765-7f89dc895247@yandex-team.ru>

On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote:
> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 17.03.23 15:35, Thomas Huth wrote:
>>>> On 17/03/2023 11.17, Peter Maydell wrote:
>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org>
>>>>> wrote:
>>>>>>
>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell
>>>>>> <peter.maydell@linaro.org> wrote:
>>>>>>>
>>>>>>> I've noticed that test-blockjob seems to fail intermittently
>>>>>>> on the msys2-64bit job:
>>>>>>>
>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440
>>>>>>>
>>>>>>> Sample output:
>>>>>>> | 53/89
>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3
>>>>>
>>>>>> Here's an intermittent failure from my macos x86 machine:
>>>>>>
>>>>>> 172/621 qemu:unit / test-blockjob
>>>>>>              ERROR           0.26s   killed by signal 6 SIGABRT
>>>>>
>>>>> And an intermittent on the freebsd 13 CI job:
>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240
>>>>>
>>>>>>>> MALLOC_PERTURB_=197
>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k
>>>>> ▶ 178/650 /blockjob/ids
>>>>>              OK
>>>>> 178/650 qemu:unit / test-blockjob
>>>>>              ERROR           0.31s   killed by signal 6 SIGABRT
>>>>> ――――――――――――――――――――――――――――――――――――― ✀
>>>>> ―――――――――――――――――――――――――――――――――――――
>>>>> stderr:
>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function
>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
>>>>> 499.
>>>>>
>>>>>
>>>>> TAP parsing error: Too few tests run (expected 9, got 1)
>>>>> (test program exited with status code -6)
>>>>> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― 
>>>>>
>>>>>
>>>>> Anybody in the block team looking at these, or shall we just
>>>>> disable this test entirely ?
>>>>
>>>> I ran into this issue today, too:
>>>>
>>>>    https://gitlab.com/thuth/qemu/-/jobs/3954367101
>>>>
>>>> ... if nobody is interested in fixing this test, I think we should
>>>> disable it...
>>>>
>>>>    Thomas
>>>>
>>>
>>> I'm looking at this now, and seems that it's broken since
>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
>>> Aiocontext locks", as it stops critical section by new
>>> aio_context_release() call exactly after bdrv_drain_all_and(), so it's
>>> not a surprise that job may start at that moment and the following
>>> assertion fires.
>>>
>>> Emanuele could please look at it?
>>>
>> Well if I understood correctly, the only thing that was preventing the
>> job from continuing was the aiocontext lock held.
>>
>> The failing assertion even mentions that:
>> /* Lock the IO thread to prevent the job from being run */
>> [...]
>> /* But the job cannot run, so it will remain on standby */
>> assert(job->status == JOB_STATUS_STANDBY);
>>
>> Essentially bdrv_drain_all_end() would wake up the job coroutine, but it
>> would anyways block somewhere after since the aiocontext lock was taken
>> by the test.
>>
>> Now that we got rid of aiocontext lock in job code, nothing prevents the
>> test from resuming.
>> Mixing job lock and aiocontext acquire/release is not a good idea, and
>> it would anyways block job_resume() called by bdrv_drain_all_end(), so
>> the test itself would deadlock.
>>
>> So unless @Kevin has a better idea, this seems to be just an "hack" to
>> test stuff that is not possible to do now anymore. What I would suggest
>> is to get rid of that test, or at least of that assert function. I
>> unfortunately cannot reproduce the failure, but I think the remaining
>> functions called by the test should run as expected.
>>
>
> Thanks! I agree. Probably, alternatively we could just expand the 
> drained section, like
>
> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void)
>      bdrv_drain_all_begin();
>      assert_job_status_is(job, JOB_STATUS_STANDBY);
>
> -    /* Lock the IO thread to prevent the job from being run */
> -    aio_context_acquire(ctx);
> -    /* This will schedule the job to resume it */
> -    bdrv_drain_all_end();
> -    aio_context_release(ctx);
> -
>      WITH_JOB_LOCK_GUARD() {
>          /* But the job cannot run, so it will remain on standby */
>          assert(job->status == JOB_STATUS_STANDBY);
> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void)
>          job_dismiss_locked(&job, &error_abort);
>      }
>
> +    bdrv_drain_all_end();
>      aio_context_acquire(ctx);
>      destroy_blk(blk);
>      aio_context_release(ctx);
>
>
> But, seems that test wanted to specifically test job, that still in 
> STANDBY exactly after drained section...
>
> [cc: Hanna]
>
> Hanna, it was your test (added in 
> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you 
> remember was important to catch STANDBY job *after* a drained section?

I’m not quite sure, but I think the idea was that we basically try to 
get as close to something that might come in over QMP.  Over QMP, you 
can’t issue a job-complete while keeping everything drained, so I 
wouldn’t just extend the drained section.

Getting rid of the assert function also seems pointless.  If we want to 
test whether job-complete works on tests in standby, we must put the 
test in standby, and verify this.  We can get rid of the test, of 
course, but it is a regression test, so it isn’t like it was added just 
for fun.  Then again, it’s now already effectively commented out via 
environment variable, so it doesn’t seem like a loss in practice to to 
fully drop it.

Anyway – the thing I wonder about is, if this is to test whether jobs in 
standby can be completed…  Why don’t we just pause the job instead of 
going through the context lock hassle?  I.e. just put a job_pause() 
right after bdrv_drain_all_begin().

If I’m not mistaken, reproducing the bug in the test seems really simple 
by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that 
works just fine if only you have a job_pause() in the drained section.  
(And dropping the aio_context_acquire()/release() calls, because they 
don’t do anything anymore.)

Hanna

next prev parent reply	other threads:[~2023-04-25 16:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-03 18:36 test-blockjob: intermittent CI failures in msys2-64bit job Peter Maydell
2023-03-06 11:16 ` Peter Maydell
2023-03-17 10:17   ` Peter Maydell
2023-03-17 12:35     ` Thomas Huth
2023-04-21 10:13       ` Vladimir Sementsov-Ogievskiy
2023-04-24 13:36         ` Emanuele Giuseppe Esposito
2023-04-24 18:32           ` Vladimir Sementsov-Ogievskiy
2023-04-25 16:48             ` Hanna Czenczek [this message]
2023-04-25 16:58               ` Vladimir Sementsov-Ogievskiy
2023-04-26  7:38               ` Emanuele Giuseppe Esposito
2023-04-26  8:03                 ` Hanna Czenczek
2023-04-26  8:17                   ` Emanuele Giuseppe Esposito

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78ad8ed0-e68a-5f4e-9627-454f0bd007fb@redhat.com \
    --to=hreitz@redhat.com \
    --cc=berto@igalia.com \
    --cc=eesposit@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    --cc=vsementsov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).