QEMU-Devel Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: "Pierrick Bouvier" <pierrick.bouvier@oss.qualcomm.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>
Cc: qemu-devel@nongnu.org, "Hanna Reitz" <hreitz@redhat.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	qemu-block@nongnu.org, "Cleber Rosa" <crosa@redhat.com>,
	"Kevin Wolf" <kwolf@redhat.com>, "John Snow" <jsnow@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Thomas Huth" <thuth@redhat.com>
Subject: Re: [PATCH 14/16] tests: add QEMU_TEST_IO_SKIP for skipping I/O tests
Date: Wed, 13 May 2026 11:11:47 -0300	[thread overview]
Message-ID: <87v7crwg9o.fsf@suse.de> (raw)
In-Reply-To: <dd212ed3-8cde-4397-9585-9eb5439fe44e@oss.qualcomm.com>

Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com> writes:

> On 5/12/2026 12:00 PM, Daniel P. Berrangé wrote:
>> On Tue, May 12, 2026 at 11:52:45AM -0700, Pierrick Bouvier wrote:
>>> On 5/12/2026 11:46 AM, Daniel P. Berrangé wrote:
>>>> On Tue, May 12, 2026 at 10:53:05AM -0700, Pierrick Bouvier wrote:
>>>>> On 5/12/2026 10:24 AM, Daniel P. Berrangé wrote:
>>>>>> On Tue, May 12, 2026 at 10:09:53AM -0700, Pierrick Bouvier wrote:
>>>>>>> On 5/12/2026 9:53 AM, Daniel P. Berrangé wrote:
>>>>>>>> On Tue, May 12, 2026 at 09:47:12AM -0700, Pierrick Bouvier wrote:
>>>>>>>>> On 5/12/2026 9:36 AM, Daniel P. Berrangé wrote:
>>>>>>>>>> On Tue, May 12, 2026 at 09:19:45AM -0700, Pierrick Bouvier wrote:
>>>>>>>>>>> On 5/12/2026 9:06 AM, Daniel P. Berrangé wrote:
>>>>>>>>>>>> On Tue, May 12, 2026 at 08:56:54AM -0700, Pierrick Bouvier wrote:
>>>>>>>>>>>>> On 4/24/2026 8:42 AM, Daniel P. Berrangé wrote:
>>>>>>>>>>>>>> The nature of block I/O tests is such that there can be unexpected false
>>>>>>>>>>>>>> positive failures in certain scenarios that have not been encountered
>>>>>>>>>>>>>> before, and sometimes non-deterministic failures that are hard to
>>>>>>>>>>>>>> reproduce.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Before enabling the I/O tests as gating jobs in CI, there needs to be a
>>>>>>>>>>>>>> mechanism to dynamically mark tests as skipped, without having to commit
>>>>>>>>>>>>>> code changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This introduces the QEMU_TEST_IO_SKIP environment variable that is set
>>>>>>>>>>>>>> to a list of FORMAT-OR-PROTOCOL:NAME pairs. The intent is that this
>>>>>>>>>>>>>> variable can be set as a GitLab CI pipeline variable to temporarily
>>>>>>>>>>>>>> disable a test while problems are being debugged.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reviewed-by: Thomas Huth <thuth@redhat.com>
>>>>>>>>>>>>>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>  docs/devel/testing/main.rst      |  7 +++++++
>>>>>>>>>>>>>>  tests/qemu-iotests/testrunner.py | 16 ++++++++++++++++
>>>>>>>>>>>>>>  2 files changed, 23 insertions(+)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/docs/devel/testing/main.rst b/docs/devel/testing/main.rst
>>>>>>>>>>>>>> index 797111009a..f779a64415 100644
>>>>>>>>>>>>>> --- a/docs/devel/testing/main.rst
>>>>>>>>>>>>>> +++ b/docs/devel/testing/main.rst
>>>>>>>>>>>>>> @@ -284,6 +284,13 @@ that are specific to certain cache mode.
>>>>>>>>>>>>>>  More options are supported by the ``./check`` script, run ``./check -h`` for
>>>>>>>>>>>>>>  help.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> +If a test program is known to be broken, it can be disabled by setting
>>>>>>>>>>>>>> +the ``QEMU_TEST_IO_SKIP`` environment variable with a list of tests to
>>>>>>>>>>>>>> +be skipped. The values are of the form FORMAT-OR-PROTOCOL:NAME, the
>>>>>>>>>>>>>> +leading component can be omitted to skip the test for all formats and
>>>>>>>>>>>>>> +protocols. For example ``export QEMU_TEST_IO_SKIP="luks:149 185 iov-padding``
>>>>>>>>>>>>>> +will skip ``149`` for LUKS only, and ``185`` and ``iov-padding`` for all.
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>  Writing a new test case
>>>>>>>>>>>>>>  ~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> diff --git a/tests/qemu-iotests/testrunner.py b/tests/qemu-iotests/testrunner.py
>>>>>>>>>>>>>> index dbe2dddc32..ecb5d4529f 100644
>>>>>>>>>>>>>> --- a/tests/qemu-iotests/testrunner.py
>>>>>>>>>>>>>> +++ b/tests/qemu-iotests/testrunner.py
>>>>>>>>>>>>>> @@ -145,6 +145,18 @@ def __init__(self, env: TestEnv, tap: bool = False,
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>          self._stack: contextlib.ExitStack
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> +        self.skip = {}
>>>>>>>>>>>>>> +        for rule in os.environ.get("QEMU_TEST_IO_SKIP", "").split(" "):
>>>>>>>>>>>>>> +            rule = rule.strip()
>>>>>>>>>>>>>> +            if rule == "":
>>>>>>>>>>>>>> +                continue
>>>>>>>>>>>>>> +            if ":" in rule:
>>>>>>>>>>>>>> +                fmt, name = rule.split(":")
>>>>>>>>>>>>>> +                if fmt in ("", env.imgfmt, env.imgproto):
>>>>>>>>>>>>>> +                    self.skip[name] = True
>>>>>>>>>>>>>> +            else:
>>>>>>>>>>>>>> +                self.skip[rule] = True
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>      def __enter__(self) -> 'TestRunner':
>>>>>>>>>>>>>>          self._stack = contextlib.ExitStack()
>>>>>>>>>>>>>>          self._stack.enter_context(self.env)
>>>>>>>>>>>>>> @@ -251,6 +263,10 @@ def do_run_test(self, test: str) -> TestResult:
>>>>>>>>>>>>>>                                description='No qualified output '
>>>>>>>>>>>>>>                                            f'(expected {f_reference})')
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> +        if f_test.name in self.skip:
>>>>>>>>>>>>>> +            return TestResult(status='not run',
>>>>>>>>>>>>>> +                              description='Listed in QEMU_TEST_IO_SKIP')
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>          args = [str(f_test.resolve())]
>>>>>>>>>>>>>>          env = self.env.prepare_subprocess(args)
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>
>>>>>>>>>>>>> Why not simply remove the broken tests, and create issues to add them
>>>>>>>>>>>>> again in the future?
>>>>>>>>>>>>
>>>>>>>>>>>> In theory that's what our policy today is, but in practice it is
>>>>>>>>>>>> too much of a burden on the release co-ordinator, to expect them
>>>>>>>>>>>> to create such a patch themselves, or wait on a subsys maintainer
>>>>>>>>>>>> todo it for them.
>>>>>>>>>>>>
>>>>>>>>>>>> They end up just ignoring brokenness in CI which is a bad practice,
>>>>>>>>>>>> and will prevent us ever making CI truely gating or switching to
>>>>>>>>>>>> using MRs for pull requests. This gives us a super-fast way to skip
>>>>>>>>>>>> flaky tests, while the subsystem maintainers figure out the right
>>>>>>>>>>>> permanent answer.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I disagree on this one, merging a single patch doing a git rm, and a git
>>>>>>>>>>> revert later is not more expensive than merging a variable modifying a
>>>>>>>>>>> variable in a yaml file.
>>>>>>>>>>
>>>>>>>>>> Any code changes like that need to be sent back to the subsystem
>>>>>>>>>> maintainer to be acked. IMHO the release manager should not be
>>>>>>>>>> unilaterally deleting tests without peer review.  So that's
>>>>>>>>>> got a non-negligible turn around time, during which CI is broken.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I accept the argument, but it seems like a workaround for a human
>>>>>>>>> process, more than a proper solution to the problem.
>>>>>>>>>
>>>>>>>>> It would be better to have a proper policy for build/test fixes, instead
>>>>>>>>> of implementing local overrides to this.
>>>>>>>>>
>>>>>>>>>> Setting an env variable to skip a problematic test is something
>>>>>>>>>> reasonable to do with zero oversight.
>>>>>>>>>>
>>>>>>>>>>> The issue with this approach is that people running tests locally will
>>>>>>>>>>> not see which tests are skipped, and will see false positives. So you
>>>>>>>>>>> just keep CI green, but not the test base itself.
>>>>>>>>>>
>>>>>>>>>> I would still expect the release manager to file a bug about any
>>>>>>>>>> flaky test they disable via the env var, and the subsystem maintainer
>>>>>>>>>> should still be fixing it or disabling it such that tests won't fail
>>>>>>>>>> more broadly, or deciding to remove it if terminally broken.
>>>>>>>>>>
>>>>>>>>>> We're just decoupling the process so that there is an immediate
>>>>>>>>>> workaround possible. It can also be used by people working in
>>>>>>>>>> their forks - often I've been testing stuff in my fork, but
>>>>>>>>>> see spurious failures because git master has a non-deterministic
>>>>>>>>>> test failure merged. I would like to easily skip those in my fork
>>>>>>>>>> too, without adding extra commits to me working branches, as that
>>>>>>>>>> would require the same commit to be duped into several in-progress
>>>>>>>>>> branches, vs setting the env var once.
>>>>>>>>>>
>>>>>>>>>>> The risk I see is that some tests will stay forever in this skip
>>>>>>>>>>> variable, so it will be dead code for CI, but still alive and failing
>>>>>>>>>>> for people running tests manually who hit the regression.
>>>>>>>>>>
>>>>>>>>>> Again, there should be a bug filed for any flaky test. Anyone can
>>>>>>>>>> do this, if they see it locally or in their fork CI, or in staging
>>>>>>>>>> CI. If no one can see an obvious fix, then anyone can also propose
>>>>>>>>>> to disable the test.
>>>>>>>>>>
>>>>>>>>>>> If you still want an alternative to removing test, implementing a
>>>>>>>>>>> skip_list in tests/qemu-iotests/meson.build is better than an env var
>>>>>>>>>>> IMHO, and achieves the exact same effect, for CI and for users.
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> IMHO there needs to be a way to skip flaky tests which does not
>>>>>>>>>> require code changes as the only available option. Code changes
>>>>>>>>>> are the permanent fix, env var is the immediate workaround.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm not sure all this answers to my question about How to ensure users
>>>>>>>>> who run tests and the CI both see the same skip list.
>>>>>>>>>
>>>>>>>>> I don't mind having an env var, a black list in meson or any other
>>>>>>>>> solution, but having different results on a dev machine and in CI is not
>>>>>>>>> a good design. So whatever the solution is, the CI yaml file is not the
>>>>>>>>> proper place to store this information.
>>>>>>>>
>>>>>>>> AFAICT the test 185 that is being skipped in the CI yaml file only
>>>>>>>> fails when run under gitlab. I've never seen a failure running it
>>>>>>>> locally.
>>>>>>>>
>>>>>>>> If it failed locally too, then I'd agree that it should not be
>>>>>>>> skipped in the CI yaml, but universally skipped in all scenarios.
>>>>>>>>
>>>>>>>
>>>>>>> If I get all this correctly, we add a generic mechanic to be able to
>>>>>>> gate CI with block tests just because there is a single test failing
>>>>>>> with a single driver. Is that the right approach?
>>>>>>
>>>>>> The env variable is the generic mechanism.
>>>>>>
>>>>>> The yaml file exclusion for 185 is the special case, but we get
>>>>>> that basically for free with the former.
>>>>>>
>>>>>>> In the future, do we expect to merge code breaking tests?
>>>>>>
>>>>>> Yes. We will certainly merge more non-deterministic tests. We've seen
>>>>>> this over & over again. Something passes CI initially but after a
>>>>>> number of CI pipelines turns out to be flaky
>>>>>>
>>>>>
>>>>> Then we can mark them as flaky in tests/qemu-iotests/meson.build.
>>>>
>>>> That is a long term solution. It does not address the immediate
>>>> time critical goal to have the ability to fix a broken CI pipeline
>>>> immediately by skipping the test without waiting for code changes.
>>>>
>>>>> It seems like you ignore the point that there is a problem between
>>>>> setting something in CI only vs making something that works for all
>>>>> users. I'm not against an env var, I just don't see how it answers this
>>>>> need.
>>>>
>>>> Again, I'm not saying that we fix this only for CI. The env var is
>>>> to allow broken jobs to be immediately skipped, while waiting for
>>>> code changes to permanently skipped/fix the tests. The latter
>>>> addresses it for every scenario.
>>>>
>>>
>>> I might have missed where we have a default value for this env var, out
>>> of yaml file, that makes it apply the exact same set of skip tests for
>>> CI, and for users running tests manually.
>>>
>>> Where is this default applied for both CI and users?
>>>
>>> I understand it's not needed for test 185 which fails only in GitLab,
>>> but as you mentioned, we'll probably have non deterministic tests in the
>>> future, so we need to consider this.
>> 
>> I was considering any change in meson.build to permanently skip a
>> test would be independent of the env var handling, and outside the
>> scope of this series since there's no need for it here.
>>
>
> Where is the default value for this env var applied for both CI and
> users? yaml is for CI only.
>

Hi all,

I think Pierrick's concern is valid, but I see it slightly differently:

In the case of an environment variable, it's possible to set it without
any code change (even to .yaml files), by setting "-o ci.variable=<val>"
at git push time and also by setting directly in GitLab's web interface
under "New Pipeline". So it allows anyone wanting to "just get work
done" to skip broken tests for the duration of a pipeline.

It also works when running locally and skipping some tests for the
single local run.

All temporary changes, of course.

For the case of a flaky test that will need to be disabled for longer
than a single run, could we standardize on a list committed to
meson.build, so it works for both CI and local?  I think so, and I think
it's a good idea. Having recently taking up maintainership of QTests I
sometimes stumble into disabled tests that have been there for years and
everyone forgot about them. Same with the migration subsystem. Having a
single place to check from time to time would be helpful.

Migration, by the way, has been using the env variable approach in qtest
like this:

    /*
     * Our CI system has problems with shared memory.
     * Don't run this test until we find a workaround.
     */
    if (getenv("QEMU_TEST_FLAKY_TESTS")) {

Which is a case in which a single list committed in meson.build wouldn't
help. We actually do want to skip this only for CI.
("problems with shared memory" here means that CI environments simply
don't have enough shared memory space available to run all the tests
that make use of it)

A further complication is that in the migration case above, that skip
does not apply to the entire migration-test, it applies to a sub-test
(./migration-test -p /x86/migration/mode/reboot). Constructs in
meson.build would not be able to skip this, they can only skip the
entire migration-test. There is an -x option that gtester accepts, so
we'd need to write some meson-fu to invoke "./migration-test -x
/x86/migration/mode/reboot -x ..." if any subtest is to be skipped (note
that this is not migration-specific, all qtests that invoke
g_test_add_data* more than once are like this).

Maybe we could work on a proposal to improve this in a generic way for
all test frameworks.

PS: I'd even say there's a lot we could make common between
frameworks. A bunch of what happens around running a test is generic
programming, not tied to a specific programming language or accelerator
(in case of qtest).


>> 
>> With regards,
>> Daniel


  reply	other threads:[~2026-05-13 14:12 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 15:41 [PATCH 00/16] tests: do more testing of block drivers Daniel P. Berrangé
2026-04-24 15:41 ` [PATCH 01/16] python: bump qemu.qmp to v0.0.6 Daniel P. Berrangé
2026-05-12 15:37   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 02/16] gitlab: ensure all meson jobs capture build/meson-logs by default Daniel P. Berrangé
2026-05-12 15:38   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 03/16] tests: print reason when I/O test is skipped in TAP mode Daniel P. Berrangé
2026-05-12 15:38   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 04/16] tests: remove redundant meson suite for iotests Daniel P. Berrangé
2026-05-12 15:42   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 05/16] tests: ensure all qcow2 I/O tests are able to be run via make Daniel P. Berrangé
2026-04-24 15:41 ` [PATCH 06/16] scripts/mtest2make: ensure output has stable sorting Daniel P. Berrangé
2026-04-24 15:41 ` [PATCH 07/16] scripts/mtest2make: support optional tests grouping Daniel P. Berrangé
2026-05-12 15:45   ` Pierrick Bouvier
2026-05-13 10:08     ` Daniel P. Berrangé
2026-05-13 15:49       ` Pierrick Bouvier
2026-05-13 17:15         ` Daniel P. Berrangé
2026-05-13 17:23           ` Pierrick Bouvier
2026-05-13 17:26             ` Daniel P. Berrangé
2026-05-13 17:32               ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 08/16] tests: add a meson suite / make target per block I/O tests format Daniel P. Berrangé
2026-05-12 15:46   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 09/16] docs/devel/testing: expand documentation for 'make check-block' Daniel P. Berrangé
2026-05-12 15:47   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 10/16] tests: add nbd and luks to the I/O test suites Daniel P. Berrangé
2026-05-12 15:47   ` Pierrick Bouvier
2026-04-24 15:41 ` [PATCH 11/16] tests: use 'driver' as collective term for either format or protocol Daniel P. Berrangé
2026-05-12 15:52   ` Pierrick Bouvier
2026-04-24 15:42 ` [PATCH 12/16] tests: validate dmsetup result in test 128 Daniel P. Berrangé
2026-05-12 15:53   ` Pierrick Bouvier
2026-05-13 10:11     ` Daniel P. Berrangé
2026-05-13 15:51       ` Pierrick Bouvier
2026-04-24 15:42 ` [PATCH 13/16] tests: fix check for sudo access in LUKS I/O test Daniel P. Berrangé
2026-05-12 15:54   ` Pierrick Bouvier
2026-04-24 15:42 ` [PATCH 14/16] tests: add QEMU_TEST_IO_SKIP for skipping I/O tests Daniel P. Berrangé
2026-05-12 15:56   ` Pierrick Bouvier
2026-05-12 16:06     ` Daniel P. Berrangé
2026-05-12 16:19       ` Pierrick Bouvier
2026-05-12 16:36         ` Daniel P. Berrangé
2026-05-12 16:47           ` Pierrick Bouvier
2026-05-12 16:53             ` Daniel P. Berrangé
2026-05-12 17:09               ` Pierrick Bouvier
2026-05-12 17:24                 ` Daniel P. Berrangé
2026-05-12 17:53                   ` Pierrick Bouvier
2026-05-12 18:46                     ` Daniel P. Berrangé
2026-05-12 18:52                       ` Pierrick Bouvier
2026-05-12 19:00                         ` Daniel P. Berrangé
2026-05-12 19:12                           ` Pierrick Bouvier
2026-05-13 14:11                             ` Fabiano Rosas [this message]
2026-05-13 14:58                               ` Daniel P. Berrangé
2026-05-13  6:18           ` Thomas Huth
2026-05-13 15:53             ` Pierrick Bouvier
2026-04-24 15:42 ` [PATCH 15/16] gitlab: add jobs for thorough block tests Daniel P. Berrangé
2026-05-12 15:59   ` Pierrick Bouvier
2026-04-24 15:42 ` [PATCH 16/16] gitlab: remove I/O tests from build-tcg-disabled job Daniel P. Berrangé
2026-04-25  6:53   ` Thomas Huth
2026-05-12 15:47   ` Pierrick Bouvier
2026-05-12 13:53 ` [PATCH 00/16] tests: do more testing of block drivers Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v7crwg9o.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=alex.bennee@linaro.org \
    --cc=berrange@redhat.com \
    --cc=crosa@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=pierrick.bouvier@oss.qualcomm.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox