From: Claudio Fontana <cfontana@suse.de>
To: Thomas Huth <thuth@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Juan Quintela <quintela@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>
Cc: "Jason J. Herne" <jjherne@linux.ibm.com>,
Fam Zheng <fam@euphon.net>, Liang Yan <lyan@suse.com>,
Peter Maydell <peter.maydell@linaro.org>,
Cornelia Huck <cohuck@redhat.com>,
qemu-devel <qemu-devel@nongnu.org>,
Bruce Rogers <brogers@suse.com>,
Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed
Date: Mon, 20 Jul 2020 20:24:11 +0200 [thread overview]
Message-ID: <13728e69-75a5-2edc-9ed3-6e08d94c722d@suse.de> (raw)
In-Reply-To: <1db6d502-73d1-5e3d-10d1-796d80ab8f07@suse.de>
I have now been able to reproduce this on X86 as well.
It happens much more rarely, about once every 10 times.
I will sort out the data and try to make it even more reproducible, then post my findings in detail.
Overall I proceeded as follows:
1) hooked the savevm code to skip all fields with the exception of "s390-skeys". So only s390-skeys are actually saved.
2) reimplemented "s390-skeys" in a common implementation in cpus.c, used on both x86 and s390, modeling the behaviour of save/load from hw/s390
3) ran ./check -qcow2 267 on both x86 and s390.
In the case of s390, failure seems to be reproducible 100% of the times.
On X86, it is as mentioned failing about 10% of the times.
Ciao,
Claudio
On 7/16/20 2:58 PM, Claudio Fontana wrote:
> Small update on this,
>
> On 7/15/20 1:10 PM, Claudio Fontana wrote:
>> Hi Thomas,
>>
>> On 7/14/20 4:35 PM, Thomas Huth wrote:
>>> On 14/07/2020 16.29, Claudio Fontana wrote:
>>>> Hello,
>>>>
>>>> I have some tiny progress in narrowing down this issue, possibly a qcow2 issue, still unclear,
>>>> but involving Kevin Wolf and Max Reitz.
>>>>
>>>>
>>>> The reproducer again:
>>>>
>>>>> --------------------------------------------cut-------------------------------------------
>>>>> diff --git a/cpus.c b/cpus.c
>>>>> index 41d1c5099f..443b88697a 100644
>>>>> --- a/cpus.c
>>>>> +++ b/cpus.c
>>>>> @@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>>>>>
>>>>> static bool icount_state_needed(void *opaque)
>>>>> {
>>>>> - return use_icount;
>>>>> + return 0;
>>>>> }
>>>>>
>>>>> static bool warp_timer_state_needed(void *opaque)
>>>>> --------------------------------------------cut-------------------------------------------
>>>>
>>>> This issue for now appears on s390 only:
>>>>
>>>> On s390 hardware, test 267 fails (both kvm and tcg) in the qcow2 backing file part, with broken migration stream data in the s390-skeys vmsave (old style).
>>> [...]
>>>> If someone has a good idea let me know - first attempts to reproduce on x86 failed, but maybe more work could lead to it.
>>>
>>
>> small update: in the GOOD case (enough padding added) a qcow_merge() is triggered for the last write of 16202 bytes.
>> In the BAD case (not enough padding added) a qcow_merge() is not triggered for the last write of 16201 bytes.
>>
>> Note: manually flushing with qemu_fflush in s390-skeys vmsave also works (maybe got lost in the noise).
>>
>>
>>> Two questions:
>>>
>>> 1) Can you also reproduce the issue manually, without running iotest
>>> 267? ... I tried, but so far I failed.
>>
>> Thanks for the suggestion, will try.
>
> Currently trying to reproduce manually an environment similar to that of the test,
> at the moment I am not able to reproduce the issue manually.
>
> Not very familiar with s390,
> I've been running with
>
> export QEMU=/home/cfontana/qemu-build/s390x-softmmu/qemu-system-s390x
>
> $QEMU -nographic -monitor stdio -nodefaults -no-shutdown FILENAME
>
> where FILENAME is the qcow2 produced by the test.
>
> let me know if you have a suggestion on how to setup up something simple properly.
>
>
>>
>>>
>>> 2) Since all the information so far sounds like the problem could be
>>> elsewhere in the code, and the skeys just catch it by accident ... have
>>> you tried running with valgrind? Maybe it catches something useful?
>>
>> Nothing yet, but will fiddle with the options a bit more.
>
> Only thing I have seen so far:
>
>
> +==33321==
> +==33321== Warning: client switching stacks? SP change: 0x1ffeffe5e8 --> 0x5d9cf60
> +==33321== to suppress, use: --max-stackframe=137324009096 or greater
> +==33321== Warning: client switching stacks? SP change: 0x5d9cd18 --> 0x1ffeffe5e8
> +==33321== to suppress, use: --max-stackframe=137324009680 or greater
> +==33321== Warning: client switching stacks? SP change: 0x1ffeffe8b8 --> 0x5d9ce58
> +==33321== to suppress, use: --max-stackframe=137324010080 or greater
> +==33321== further instances of this message will not be shown.
> +==33321== Thread 4:
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321== at 0x3AEC70: process_queued_cpu_work (cpus-common.c:331)
> +==33321== by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321== by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321== by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321== by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321== by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321== by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321==
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321== at 0x3AEC74: process_queued_cpu_work (cpus-common.c:331)
> +==33321== by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321== by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321== by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321== by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321== by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321== by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321==
> +==33321==
> +==33321== HEAP SUMMARY:
> +==33321== in use at exit: 2,138,442 bytes in 13,935 blocks
> +==33321== total heap usage: 19,089 allocs, 5,154 frees, 5,187,670 bytes allocated
> +==33321==
> +==33321== LEAK SUMMARY:
> +==33321== definitely lost: 0 bytes in 0 blocks
> +==33321== indirectly lost: 0 bytes in 0 blocks
> +==33321== possibly lost: 7,150 bytes in 111 blocks
> +==33321== still reachable: 2,131,292 bytes in 13,824 blocks
> +==33321== suppressed: 0 bytes in 0 blocks
> +==33321== Rerun with --leak-check=full to see details of leaked memory
>
>
>>
>>>
>>> Thomas
>>>
>>
>> Ciao,
>>
>> Claudio
>>
>>
>
> A more interesting update is what follows I think.
>
> I was able to "fix" the problem shown by the reproducer:
>
> @@ -643,7 +643,7 @@ static void qemu_account_warp_tim@@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>
> static bool icount_state_needed(void *opaque)
> {
> - return use_icount;
> + return 0;
> }
>
> by just slowing down qcow2_co_pwritev_task_entry with some tight loops,
> without changing any fields between runs (other than the reproducer icount field removal).
>
> I tried to insert the same slowdown just in savevm.c at the end of save_snapshot, but that does not work, needs to be in the coroutine.
>
> Thanks,
>
> Claudio
>
>
next prev parent reply other threads:[~2020-07-20 18:25 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-12 10:00 migration: broken snapshot saves appear on s390 when small fields in migration stream removed Claudio Fontana
2020-07-12 16:11 ` Paolo Bonzini
2020-07-13 9:11 ` Claudio Fontana
2020-07-14 14:29 ` Claudio Fontana
2020-07-14 14:35 ` Thomas Huth
2020-07-15 11:10 ` Claudio Fontana
2020-07-15 12:25 ` Claudio Fontana
2020-07-16 12:58 ` Claudio Fontana
2020-07-20 18:24 ` Claudio Fontana [this message]
2020-07-21 8:22 ` Claudio Fontana
2020-07-27 23:09 ` Bruce Rogers
2020-07-28 8:15 ` Vladimir Sementsov-Ogievskiy
2020-07-28 8:43 ` Vladimir Sementsov-Ogievskiy
2020-07-28 13:23 ` Bruce Rogers
2020-07-28 11:10 ` Max Reitz
2020-07-28 11:27 ` Vladimir Sementsov-Ogievskiy
2020-07-28 11:33 ` Vladimir Sementsov-Ogievskiy
2020-07-28 11:35 ` Paolo Bonzini
2020-07-28 11:45 ` Max Reitz
2020-07-28 12:09 ` Paolo Bonzini
2020-07-28 12:47 ` Claudio Fontana
2020-07-13 11:03 ` Dr. David Alan Gilbert
2020-07-13 11:39 ` Cornelia Huck
2020-07-13 11:39 ` Claudio Fontana
2020-07-13 11:45 ` Claudio Fontana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=13728e69-75a5-2edc-9ed3-6e08d94c722d@suse.de \
--to=cfontana@suse.de \
--cc=brogers@suse.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=fam@euphon.net \
--cc=jjherne@linux.ibm.com \
--cc=kwolf@redhat.com \
--cc=lyan@suse.com \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=stefanha@redhat.com \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).