qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Fabiano Rosas <farosas@suse.de>
Cc: qemu-devel@nongnu.org, "Peter Xu" <peterx@redhat.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Thomas Huth" <thuth@redhat.com>,
	"Wainer dos Santos Moschetta" <wainersm@redhat.com>
Subject: Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
Date: Fri, 18 Oct 2024 10:01:31 +0100	[thread overview]
Message-ID: <ZxIj694WqXwwMRIY@redhat.com> (raw)
In-Reply-To: <87r08e3d74.fsf@suse.de>

On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> >> Recent changes to how we invoke the migration tests have
> >> (intentionally) caused them to not be part of the check-qtest target
> >> anymore. Add the check-migration-quick target so we don't lose
> >> migration code testing in this job.
> >
> > But 'check-migration-quick' is only the subset of migration tests,
> > 'check-migration' is all of the migration tests. So surely this is
> > a massive regressions in covage in CI pipelines.
> 
> I'm not sure it is. There are tests there already for all the major
> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> tweak migration-quick to cover spots where we think we're losing
> coverage.

Each of the tests in migration-test  were added for a good reason,
generally to address testing gaps where we had functional regressions
in the past. I don't think its a good idea to stop running such tests
in CI as gating on new contributions. Any time we've had optional
tests in QEMU, we've seen repeated regressions in the area in question.

> Since our CI offers nothing in terms of reproducibility or
> debuggability, I don't think it's productive to have an increasing
> amount of tests running in CI if that means we'll be dealing with
> timeouts and intermittent crashes constantly.

Test reliability is a different thing. If a particular test is
flaky, it needs to either be fixed or disabled. Splitting into
a fast & slow grouping doesn't address reliability, just hides
the problem from view.

> > Experience shows us that relying on humans to run tests periodically
> > doesn't work well, and they'll slowly bit rot. Migration maintainers
> > don't have a way to run this as gating test for every pull request
> > that merges, and pull requests coming from non-migration maintainers
> > can still break migration code.
> 
> Right, but migration code would still be tested with migration-quick
> which is executed at every make check. Do we really need the full set in
> every pull request? We must draw a line somewhere, otherwise make check
> will just balloon in duration.

Again, the tests all exist because migration code is incredibly
complicated, with a lot of permutations, with a history of being
very bug / regression prone. With that in mind, it is unavoidable
that we're going to have a significant testing overhead for
migration code.

Looking at its execution time right now, I'd say migration test
is pretty good, considering the permutations we have to target.

It gets a bad reputation because historically it has been as
much as x20 slower than it is today, and has also struggled
with reliability. The latter is a reflection of the complexity
of migration and and IMHO actually justifies greater testing,
as long as we put in time to address bugs.

Also we've got one single test program, covering an entire
subsystem in one go, rather than lots of short individual
test programs, so migration unfairly gets blamed for being
slow, when it simply covers alot of functionality in one
program.

> > Any tests in tree need to be exercised by CI as the minimum bar
> > to prevent bit rot from merges.
> >
> 
> No disagreement here. But then I'm going to need advice on what to do
> when other maintainers ask us to stop writing migration tests because
> they take too long. I cannot send contributors away nor merge code
> without tests.

In general, I think it is unreasonable for other maintainers to
tell us to stop adding test coverage for migration, and would
push back against such a request. 

We should, however, continue to optimize how we add further test
coverage, where practical, overload testing of multiple features
onto a single test case helps.

We've already massively optimized the migration-test compared to
its historical behaviour.

A potentially bigger win could be seen if we change how we exercise
the migration functionality. Since we had the migration qtest that
runs a full migration operation, we've tended to expand testing by
adding new qtest functions. ie we've added a functional test for
everything we want covered. This is nice & simple, but also expensive.
We've ignored unit testing, which I think is a mistake.

If i look at the test list:

# /x86_64/migration/bad_dest
# /x86_64/migration/analyze-script
# /x86_64/migration/validate_uuid
# /x86_64/migration/validate_uuid_error
# /x86_64/migration/validate_uuid_src_not_set
# /x86_64/migration/validate_uuid_dst_not_set
# /x86_64/migration/dirty_ring
# /x86_64/migration/precopy/file
# /x86_64/migration/precopy/unix/plain
# /x86_64/migration/precopy/unix/suspend/live
# /x86_64/migration/precopy/unix/suspend/notlive
# /x86_64/migration/precopy/unix/tls/psk
# /x86_64/migration/precopy/unix/tls/x509/default-host
# /x86_64/migration/precopy/unix/tls/x509/override-host
# /x86_64/migration/precopy/file/offset
# /x86_64/migration/precopy/file/mapped-ram
# /x86_64/migration/precopy/file/offset/fdset
# /x86_64/migration/precopy/file/offset/bad
# /x86_64/migration/precopy/file/mapped-ram/live
# /x86_64/migration/precopy/tcp/plain
# /x86_64/migration/precopy/tcp/plain/switchover-ack
# /x86_64/migration/precopy/tcp/tls/psk/match
# /x86_64/migration/precopy/tcp/tls/psk/mismatch
# /x86_64/migration/precopy/tcp/tls/x509/default-host
# /x86_64/migration/precopy/tcp/tls/x509/override-host
# /x86_64/migration/precopy/tcp/tls/x509/mismatch-host
# /x86_64/migration/precopy/tcp/tls/x509/friendly-client
# /x86_64/migration/precopy/tcp/tls/x509/hostile-client
# /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client
# /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client
# /x86_64/migration/precopy/fd/tcp
# /x86_64/migration/precopy/fd/file
# /x86_64/migration/multifd/file/mapped-ram
# /x86_64/migration/multifd/file/mapped-ram/live
# /x86_64/migration/multifd/file/mapped-ram/dio
# /x86_64/migration/multifd/file/mapped-ram/fdset
# /x86_64/migration/multifd/file/mapped-ram/fdset/dio
# /x86_64/migration/multifd/tcp/uri/plain/none
# /x86_64/migration/multifd/tcp/channels/plain/none
# /x86_64/migration/multifd/tcp/plain/cancel
# /x86_64/migration/multifd/tcp/plain/zlib
# /x86_64/migration/multifd/tcp/plain/zstd
# /x86_64/migration/multifd/tcp/plain/zero-page/legacy
# /x86_64/migration/multifd/tcp/plain/zero-page/none
# /x86_64/migration/multifd/tcp/tls/psk/match
# /x86_64/migration/multifd/tcp/tls/psk/mismatch
# /x86_64/migration/multifd/tcp/tls/x509/default-host
# /x86_64/migration/multifd/tcp/tls/x509/override-host
# /x86_64/migration/multifd/tcp/tls/x509/mismatch-host
# /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client
# /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
# /x86_64/migration/validate_uri/channels/both_set
# /x86_64/migration/validate_uri/channels/none_set

Individually none of those is very slow on its own - 10 are in
the 2-3 second range,  35 are 1-2 secs,  and 6 are less than
1 second.

A very large portion of those are validating different ways to
establish migration. Hardly any of them actually need to run
a migration to completion. Even without running to completion
though, we have the overheads of spawning 2 QEMUs.

This feels like something that should be amenable to unit testing.
Might need a little re-factoring of migration code to make it
easier to run a subset of its logic in isolation, but that'd
probably be a win anyway, as such work usually makes code cleaner.

> Do we need nightly CI runs? Unit tests? Bear in mind there's a resource
> allocation issue there. Addressing problems with timeouts/races in our
> CI is not something any random person can do.

In terms of running time, I think migration-test is acceptable as it is
to run in 'make check' by default and doesn't justify dropping test
coverage. We should still look to optimize & move to unit testing more
code, and any reliability issues are something that needs to be addressed
too.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



  reply	other threads:[~2024-10-18  9:02 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-17 14:32 [PATCH 0/4] tests/qtest: Move the bulk of migration tests into a separate target Fabiano Rosas
2024-10-17 14:32 ` [PATCH 1/4] tests/qtest: Add check-migration Fabiano Rosas
2024-10-17 15:09   ` Daniel P. Berrangé
2024-10-17 14:32 ` [PATCH 2/4] docs: Add migration tests documentation Fabiano Rosas
2024-10-17 14:32 ` [PATCH 3/4] tests/qtest/migration: Move tests into g_test_slow() Fabiano Rosas
2024-10-17 14:32 ` [PATCH 4/4] ci: Add check-migration-quick to the clang job Fabiano Rosas
2024-10-17 14:57   ` Daniel P. Berrangé
2024-10-17 16:29     ` Fabiano Rosas
2024-10-18  9:01       ` Daniel P. Berrangé [this message]
2024-10-18  9:46         ` Peter Maydell
2024-10-18 10:00           ` Daniel P. Berrangé
2024-10-18 10:09             ` Peter Maydell
2024-10-18 10:38             ` Alex Bennée
2024-10-18 13:51             ` Fabiano Rosas
2024-10-18 13:54               ` Daniel P. Berrangé
2024-10-18 14:28                 ` Fabiano Rosas
2024-10-18 14:33                   ` Daniel P. Berrangé
2024-10-18 13:51         ` Fabiano Rosas
2024-10-18 14:21           ` Daniel P. Berrangé
2024-10-18 14:47             ` Fabiano Rosas
2024-10-18 15:25         ` Peter Maydell
2024-10-18 16:12           ` Daniel P. Berrangé
2024-10-21 14:55           ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZxIj694WqXwwMRIY@redhat.com \
    --to=berrange@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=farosas@suse.de \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    --cc=wainersm@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).