qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Huth <thuth@redhat.com>
To: "Ilya Leoshkevich" <iii@linux.ibm.com>,
	"Alex Bennée" <alex.bennee@linaro.org>
Cc: qemu-s390x@nongnu.org, qemu-devel@nongnu.org,
	Pavel Dovgalyuk <pavel.dovgalyuk@ispras.ru>
Subject: Re: [PATCH RFC] target/s390x: Fix infinite loop during replay
Date: Thu, 4 Dec 2025 10:37:13 +0100	[thread overview]
Message-ID: <01700ae1-b5ef-43f2-af2b-2eb648c5a147@redhat.com> (raw)
In-Reply-To: <20251201215514.1751994-1-iii@linux.ibm.com>

On 01/12/2025 22.49, Ilya Leoshkevich wrote:
> Hi,
> 
> Here is my attempt to fix [1] based on the discussion in [2].
> 
> I'm sending this as an RFC, because I have definitely misunderstood a
> thing or two about record-replay, missed some timer bookkeeping
> intricacies, and haven't split arch-dependent and independent parts
> into different patches.
> 
> This survives "make check" and "make check-tcg" with the test from [2],
> both with and without extra load in background.
> 
> Please let me know what you think about the approach.
> 
> Best regards,
> Ilya
> 
> [1] https://lore.kernel.org/qemu-devel/a0accce9-6042-4a7b-a7c7-218212818891@redhat.com/
> [2] https://lore.kernel.org/qemu-devel/20251128133949.181828-1-thuth@redhat.com/
> 
> ---
> 
> Replaying even trivial s390x kernels hangs, because:
> 
> - cpu_post_load() fires the TOD timer immediately.
> 
> - s390_tod_load() schedules work for firing the TOD timer.
> 
> - If rr loop sees work and then timer, we get one timer expiration.
> 
> - If rr loop sees timer and then work, we get two timer expirations.
> 
> - Record and replay may diverge due to this race.
> 
> - In this particular case divergence makes replay loop spin: it sees that
>    TOD timer has expired, but cannot invoke its callback, because there
>    is no recorded CHECKPOINT_CLOCK_VIRTUAL.
> 
> - The order in which rr loop sees work and timer depends on whether
>    and when rr loop wakes up during load_snapshot().
> 
> - rr loop may wake up after the main thread kicks the CPU and drops
>    the BQL, which may happen if it calls, e.g., qemu_cond_wait_bql().
> 
> Firing TOD timer twice is duplicate work, but it was introduced
> intentionally in commit 7c12f710bad6 ("s390x/tcg: rearm the CKC timer
> during migration") in order to avoid dependency on migration order.
> 
> The key culprits here are timers that are armed ready expired. They
> break the ordering between timers and CPU work, because they are not
> constrained by instruction execution, thus introducing non-determinism
> and record-replay divergence.
> 
> Fix by converting such timer callbacks to CPU work. Also add TOD clock
> updates to the save path, mirroring the load path, in order to have the
> same CHECKPOINT_CLOCK_VIRTUAL during recording and replaying.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
>   hw/s390x/tod.c           |  5 +++++
>   stubs/async-run-on-cpu.c |  7 +++++++
>   stubs/cpus-queue.c       |  4 ++++
>   stubs/meson.build        |  2 ++
>   target/s390x/machine.c   |  4 ++++
>   util/qemu-timer.c        | 30 ++++++++++++++++++++++++++++++
>   6 files changed, 52 insertions(+)
>   create mode 100644 stubs/async-run-on-cpu.c
>   create mode 100644 stubs/cpus-queue.c

Thanks, this indeed fixes the test for me, so:

Tested-by: Thomas Huth <thuth@redhat.com>



      reply	other threads:[~2025-12-04  9:39 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-01 21:49 [PATCH RFC] target/s390x: Fix infinite loop during replay Ilya Leoshkevich
2025-12-04  9:37 ` Thomas Huth [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01700ae1-b5ef-43f2-af2b-2eb648c5a147@redhat.com \
    --to=thuth@redhat.com \
    --cc=alex.bennee@linaro.org \
    --cc=iii@linux.ibm.com \
    --cc=pavel.dovgalyuk@ispras.ru \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).