From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Thomas Huth <thuth@redhat.com>,
qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
Laurent Vivier <lvivier@redhat.com>,
Juan Quintela <quintela@redhat.com>,
Cornelia Huck <cohuck@redhat.com>,
qemu-s390x@nongnu.org
Subject: Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
Date: Tue, 5 Jul 2022 09:38:46 +0100 [thread overview]
Message-ID: <YsP4lpXU6GpE4Hs4@work-vm> (raw)
In-Reply-To: <YsPxp7386xTTWTrv@redhat.com>
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote:
> > On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> > > There have been checks put into the migration test which skip it in a
> > > few scenarios
> > >
> > > * ppc64 TCG
> > > * ppc64 KVM with kvm-pr
> > > * s390x TCG
> > >
> > > In the original commits there are references to unexplained hangs in
> > > the test. There is no record of details of where it was hanging, but
> > > it is suspected that these were all a result of the max downtime limit
> > > being set at too low a value to guarantee convergance.
> > >
> > > Since a previous commit bumped the value from 1 second to 30 seconds,
> > > it is believed that hangs due to non-convergance should be eliminated
> > > and thus worth trying to remove the skipped scenarios.
> > >
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > > tests/qtest/migration-test.c | 21 ---------------------
> > > 1 file changed, 21 deletions(-)
> >
> > I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:
> >
> > /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't
> > support requested feature, cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > Memory content inconsistency at df6000 first_byte = 98 last_byte = 98
> > current = 2 hit_edge = 0
98->2 is a strangely large gap, and just one page.
> > Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
Yeh that's broken; the way I think about this is you've got a loop
and the guest is following the loop incrementing one page at a time;
if you stop the world you should see one 'edge' where the incrementer
has currently incremented the previous page but hasn't done the current
page yet. e.g. in this case the 'start' of the memory is 98, and we
were seeing 97, so we've run past that 'edge' at some point earlier.
Now we've hit 96, that should be impossible, because all of the 96's
should have incremented out before there was ever a 98 in the loop.
> > Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > and in another 5542 pages**
> > ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram:
> > assertion failed: (bad == 0)
> > Aborted (core dumped)
> >
> > So I guess this workaround was about a different issue and we should drop
> > this patch.
>
> Yeah, at the very least needs for investigation.
>
> It is a little worrying though that we get such failures as it smells
> like a genuine bug that we've been missing from having tests disabled.
Yeh I suspect it's a TCG bug not updating the 'changed' flag on the page
*after* writing the data. I believe we've sene a case on ARM.
Dave
>
> With regards,
> Daniel
> --
> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o- https://fstop138.berrange.com :|
> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2022-07-05 8:40 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
2022-06-28 12:47 ` Laurent Vivier
2022-06-28 12:49 ` Thomas Huth
2022-06-28 10:54 ` [PATCH 2/5] tests: wait for migration completion before looking for STOP event Daniel P. Berrangé
2022-06-28 12:47 ` Laurent Vivier
2022-06-28 14:08 ` Dr. David Alan Gilbert
2022-06-28 14:10 ` Daniel P. Berrangé
2022-06-28 10:54 ` [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds Daniel P. Berrangé
2022-06-28 12:47 ` Laurent Vivier
2022-06-28 10:54 ` [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests Daniel P. Berrangé
2022-06-28 14:16 ` Dr. David Alan Gilbert
2022-06-28 10:54 ` [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64 Daniel P. Berrangé
2022-06-28 13:18 ` Thomas Huth
2022-07-05 8:06 ` Thomas Huth
2022-07-05 8:09 ` Daniel P. Berrangé
2022-07-05 8:38 ` Dr. David Alan Gilbert [this message]
2022-06-28 13:19 ` [PATCH 0/5] tests: improve reliability of migration test Thomas Huth
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YsP4lpXU6GpE4Hs4@work-vm \
--to=dgilbert@redhat.com \
--cc=berrange@redhat.com \
--cc=cohuck@redhat.com \
--cc=lvivier@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-s390x@nongnu.org \
--cc=quintela@redhat.com \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).