Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Claudio Fontana <cfontana@suse.de>
To: Thomas Huth <thuth@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>
Cc: "Jason J. Herne" <jjherne@linux.ibm.com>,
	Fam Zheng <fam@euphon.net>, Liang Yan <lyan@suse.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	Cornelia Huck <cohuck@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Bruce Rogers <brogers@suse.com>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed
Date: Mon, 20 Jul 2020 20:24:11 +0200	[thread overview]
Message-ID: <13728e69-75a5-2edc-9ed3-6e08d94c722d@suse.de> (raw)
In-Reply-To: <1db6d502-73d1-5e3d-10d1-796d80ab8f07@suse.de>

I have now been able to reproduce this on X86 as well.

It happens much more rarely, about once every 10 times.

I will sort out the data and try to make it even more reproducible, then post my findings in detail.

Overall I proceeded as follows:

1) hooked the savevm code to skip all fields with the exception of "s390-skeys". So only s390-skeys are actually saved.

2) reimplemented "s390-skeys" in a common implementation in cpus.c, used on both x86 and s390, modeling the behaviour of save/load from hw/s390

3) ran ./check -qcow2 267 on both x86 and s390.

In the case of s390, failure seems to be reproducible 100% of the times.
On X86, it is as mentioned failing about 10% of the times.

Ciao,

Claudio


On 7/16/20 2:58 PM, Claudio Fontana wrote:
> Small update on this,
> 
> On 7/15/20 1:10 PM, Claudio Fontana wrote:
>> Hi Thomas,
>>
>> On 7/14/20 4:35 PM, Thomas Huth wrote:
>>> On 14/07/2020 16.29, Claudio Fontana wrote:
>>>> Hello,
>>>>
>>>> I have some tiny progress in narrowing down this issue, possibly a qcow2 issue, still unclear,
>>>> but involving Kevin Wolf and Max Reitz.
>>>>
>>>>
>>>> The reproducer again:
>>>>
>>>>> --------------------------------------------cut-------------------------------------------
>>>>> diff --git a/cpus.c b/cpus.c
>>>>> index 41d1c5099f..443b88697a 100644
>>>>> --- a/cpus.c
>>>>> +++ b/cpus.c
>>>>> @@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>>>>>  
>>>>>  static bool icount_state_needed(void *opaque)
>>>>>  {
>>>>> -    return use_icount;
>>>>> +    return 0;
>>>>>  }
>>>>>  
>>>>>  static bool warp_timer_state_needed(void *opaque)
>>>>> --------------------------------------------cut-------------------------------------------
>>>>
>>>> This issue for now appears on s390 only:
>>>>
>>>> On s390 hardware, test 267 fails (both kvm and tcg) in the qcow2 backing file part, with broken migration stream data in the s390-skeys vmsave (old style).
>>> [...]
>>>> If someone has a good idea let me know - first attempts to reproduce on x86 failed, but maybe more work could lead to it.
>>>
>>
>> small update: in the GOOD case (enough padding added) a qcow_merge() is triggered for the last write of 16202 bytes.
>> In the BAD case (not enough padding added) a qcow_merge() is not triggered for the last write of 16201 bytes.
>>
>> Note: manually flushing with qemu_fflush in s390-skeys vmsave also works (maybe got lost in the noise).
>>
>>
>>> Two questions:
>>>
>>> 1) Can you also reproduce the issue manually, without running iotest
>>> 267? ... I tried, but so far I failed.
>>
>> Thanks for the suggestion, will try.
> 
> Currently trying to reproduce manually an environment similar to that of the test,
> at the moment I am not able to reproduce the issue manually.
> 
> Not very familiar with s390,
> I've been running with 
> 
> export QEMU=/home/cfontana/qemu-build/s390x-softmmu/qemu-system-s390x
> 
> $QEMU -nographic -monitor stdio -nodefaults -no-shutdown FILENAME
> 
> where FILENAME is the qcow2 produced by the test.
> 
> let me know if you have a suggestion on how to setup up something simple properly.
> 
> 
>>
>>>
>>> 2) Since all the information so far sounds like the problem could be
>>> elsewhere in the code, and the skeys just catch it by accident ... have
>>> you tried running with valgrind? Maybe it catches something useful?
>>
>> Nothing yet, but will fiddle with the options a bit more.
> 
> Only thing I have seen so far:
> 
> 
> +==33321== 
> +==33321== Warning: client switching stacks?  SP change: 0x1ffeffe5e8 --> 0x5d9cf60
> +==33321==          to suppress, use: --max-stackframe=137324009096 or greater
> +==33321== Warning: client switching stacks?  SP change: 0x5d9cd18 --> 0x1ffeffe5e8
> +==33321==          to suppress, use: --max-stackframe=137324009680 or greater
> +==33321== Warning: client switching stacks?  SP change: 0x1ffeffe8b8 --> 0x5d9ce58
> +==33321==          to suppress, use: --max-stackframe=137324010080 or greater
> +==33321==          further instances of this message will not be shown.
> +==33321== Thread 4:
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321==    at 0x3AEC70: process_queued_cpu_work (cpus-common.c:331)
> +==33321==    by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321==    by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321==    by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321==    by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321==    by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321==    by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321== 
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321==    at 0x3AEC74: process_queued_cpu_work (cpus-common.c:331)
> +==33321==    by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321==    by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321==    by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321==    by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321==    by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321==    by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321== 
> +==33321== 
> +==33321== HEAP SUMMARY:
> +==33321==     in use at exit: 2,138,442 bytes in 13,935 blocks
> +==33321==   total heap usage: 19,089 allocs, 5,154 frees, 5,187,670 bytes allocated
> +==33321== 
> +==33321== LEAK SUMMARY:
> +==33321==    definitely lost: 0 bytes in 0 blocks
> +==33321==    indirectly lost: 0 bytes in 0 blocks
> +==33321==      possibly lost: 7,150 bytes in 111 blocks
> +==33321==    still reachable: 2,131,292 bytes in 13,824 blocks
> +==33321==         suppressed: 0 bytes in 0 blocks
> +==33321== Rerun with --leak-check=full to see details of leaked memory
> 
> 
>>
>>>
>>>  Thomas
>>>
>>
>> Ciao,
>>
>> Claudio
>>
>>
> 
> A more interesting update is what follows I think.
> 
> I was able to "fix" the problem shown by the reproducer:
> 
> @@ -643,7 +643,7 @@ static void qemu_account_warp_tim@@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>  
>  static bool icount_state_needed(void *opaque)
>  {
> -    return use_icount;
> +    return 0;
>  }
>  
> by just slowing down qcow2_co_pwritev_task_entry with some tight loops,
> without changing any fields between runs (other than the reproducer icount field removal).
> 
> I tried to insert the same slowdown just in savevm.c at the end of save_snapshot, but that does not work, needs to be in the coroutine.
> 
> Thanks,
> 
> Claudio
> 
>

next prev parent reply	other threads:[~2020-07-20 18:25 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-12 10:00 migration: broken snapshot saves appear on s390 when small fields in migration stream removed Claudio Fontana
2020-07-12 16:11 ` Paolo Bonzini
2020-07-13  9:11   ` Claudio Fontana
2020-07-14 14:29     ` Claudio Fontana
2020-07-14 14:35       ` Thomas Huth
2020-07-15 11:10         ` Claudio Fontana
2020-07-15 12:25           ` Claudio Fontana
2020-07-16 12:58           ` Claudio Fontana
2020-07-20 18:24             ` Claudio Fontana [this message]
2020-07-21  8:22               ` Claudio Fontana
2020-07-27 23:09                 ` Bruce Rogers
2020-07-28  8:15                   ` Vladimir Sementsov-Ogievskiy
2020-07-28  8:43                   ` Vladimir Sementsov-Ogievskiy
2020-07-28 13:23                     ` Bruce Rogers
2020-07-28 11:10                   ` Max Reitz
2020-07-28 11:27                     ` Vladimir Sementsov-Ogievskiy
2020-07-28 11:33                     ` Vladimir Sementsov-Ogievskiy
2020-07-28 11:35                       ` Paolo Bonzini
2020-07-28 11:45                         ` Max Reitz
2020-07-28 12:09                           ` Paolo Bonzini
2020-07-28 12:47                     ` Claudio Fontana
2020-07-13 11:03 ` Dr. David Alan Gilbert
2020-07-13 11:39   ` Cornelia Huck
2020-07-13 11:39   ` Claudio Fontana
2020-07-13 11:45     ` Claudio Fontana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=13728e69-75a5-2edc-9ed3-6e08d94c722d@suse.de \
    --to=cfontana@suse.de \
    --cc=brogers@suse.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=fam@euphon.net \
    --cc=jjherne@linux.ibm.com \
    --cc=kwolf@redhat.com \
    --cc=lyan@suse.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).