* [PATCH RFC] target/s390x: Fix infinite loop during replay
@ 2025-12-01 21:49 Ilya Leoshkevich
2025-12-04 9:37 ` Thomas Huth
0 siblings, 1 reply; 2+ messages in thread
From: Ilya Leoshkevich @ 2025-12-01 21:49 UTC (permalink / raw)
To: Alex Bennée, Thomas Huth
Cc: qemu-s390x, qemu-devel, Pavel Dovgalyuk, Ilya Leoshkevich
Hi,
Here is my attempt to fix [1] based on the discussion in [2].
I'm sending this as an RFC, because I have definitely misunderstood a
thing or two about record-replay, missed some timer bookkeeping
intricacies, and haven't split arch-dependent and independent parts
into different patches.
This survives "make check" and "make check-tcg" with the test from [2],
both with and without extra load in background.
Please let me know what you think about the approach.
Best regards,
Ilya
[1] https://lore.kernel.org/qemu-devel/a0accce9-6042-4a7b-a7c7-218212818891@redhat.com/
[2] https://lore.kernel.org/qemu-devel/20251128133949.181828-1-thuth@redhat.com/
---
Replaying even trivial s390x kernels hangs, because:
- cpu_post_load() fires the TOD timer immediately.
- s390_tod_load() schedules work for firing the TOD timer.
- If rr loop sees work and then timer, we get one timer expiration.
- If rr loop sees timer and then work, we get two timer expirations.
- Record and replay may diverge due to this race.
- In this particular case divergence makes replay loop spin: it sees that
TOD timer has expired, but cannot invoke its callback, because there
is no recorded CHECKPOINT_CLOCK_VIRTUAL.
- The order in which rr loop sees work and timer depends on whether
and when rr loop wakes up during load_snapshot().
- rr loop may wake up after the main thread kicks the CPU and drops
the BQL, which may happen if it calls, e.g., qemu_cond_wait_bql().
Firing TOD timer twice is duplicate work, but it was introduced
intentionally in commit 7c12f710bad6 ("s390x/tcg: rearm the CKC timer
during migration") in order to avoid dependency on migration order.
The key culprits here are timers that are armed ready expired. They
break the ordering between timers and CPU work, because they are not
constrained by instruction execution, thus introducing non-determinism
and record-replay divergence.
Fix by converting such timer callbacks to CPU work. Also add TOD clock
updates to the save path, mirroring the load path, in order to have the
same CHECKPOINT_CLOCK_VIRTUAL during recording and replaying.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
hw/s390x/tod.c | 5 +++++
stubs/async-run-on-cpu.c | 7 +++++++
stubs/cpus-queue.c | 4 ++++
stubs/meson.build | 2 ++
target/s390x/machine.c | 4 ++++
util/qemu-timer.c | 30 ++++++++++++++++++++++++++++++
6 files changed, 52 insertions(+)
create mode 100644 stubs/async-run-on-cpu.c
create mode 100644 stubs/cpus-queue.c
diff --git a/hw/s390x/tod.c b/hw/s390x/tod.c
index 3f913cc88ab..81bce90c030 100644
--- a/hw/s390x/tod.c
+++ b/hw/s390x/tod.c
@@ -72,6 +72,11 @@ static void s390_tod_save(QEMUFile *f, void *opaque)
qemu_put_byte(f, S390_TOD_CLOCK_VALUE_PRESENT);
qemu_put_byte(f, tod.high);
qemu_put_be64(f, tod.low);
+
+ tdc->set(td, &tod, &err);
+ if (err) {
+ warn_report_err(err);
+ }
}
static int s390_tod_load(QEMUFile *f, void *opaque, int version_id)
diff --git a/stubs/async-run-on-cpu.c b/stubs/async-run-on-cpu.c
new file mode 100644
index 00000000000..adf1867ad21
--- /dev/null
+++ b/stubs/async-run-on-cpu.c
@@ -0,0 +1,7 @@
+#include "qemu/osdep.h"
+#include "hw/core/cpu.h"
+
+void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data)
+{
+ abort();
+}
diff --git a/stubs/cpus-queue.c b/stubs/cpus-queue.c
new file mode 100644
index 00000000000..fd678f42969
--- /dev/null
+++ b/stubs/cpus-queue.c
@@ -0,0 +1,4 @@
+#include "qemu/osdep.h"
+#include "hw/core/cpu.h"
+
+CPUTailQ cpus_queue = QTAILQ_HEAD_INITIALIZER(cpus_queue);
diff --git a/stubs/meson.build b/stubs/meson.build
index 0b2778c568e..d3b551f4def 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -41,6 +41,8 @@ if have_block or have_ga
stub_ss.add(files('monitor-internal.c'))
stub_ss.add(files('qmp-command-available.c'))
stub_ss.add(files('qmp-quit.c'))
+ stub_ss.add(files('async-run-on-cpu.c'))
+ stub_ss.add(files('cpus-queue.c'))
endif
if have_block or have_user
diff --git a/target/s390x/machine.c b/target/s390x/machine.c
index 3bea6103ffb..f714834a98a 100644
--- a/target/s390x/machine.c
+++ b/target/s390x/machine.c
@@ -52,6 +52,10 @@ static int cpu_pre_save(void *opaque)
kvm_s390_vcpu_interrupt_pre_save(cpu);
}
+ if (tcg_enabled()) {
+ tcg_s390_tod_updated(CPU(cpu), RUN_ON_CPU_NULL);
+ }
+
return 0;
}
diff --git a/util/qemu-timer.c b/util/qemu-timer.c
index 2a6be4c7f95..d93a020064f 100644
--- a/util/qemu-timer.c
+++ b/util/qemu-timer.c
@@ -30,6 +30,7 @@
#include "exec/icount.h"
#include "system/replay.h"
#include "system/cpus.h"
+#include "hw/core/cpu.h"
#ifdef CONFIG_POSIX
#include <pthread.h>
@@ -387,11 +388,40 @@ static void timer_del_locked(QEMUTimerList *timer_list, QEMUTimer *ts)
}
}
+static void timer_fire(CPUState *cpu, run_on_cpu_data data)
+{
+ QEMUTimer *t = data.host_ptr;
+
+ t->cb(t->opaque);
+}
+
static bool timer_mod_ns_locked(QEMUTimerList *timer_list,
QEMUTimer *ts, int64_t expire_time)
{
QEMUTimer **pt, *t;
+ /*
+ * Normally during record-replay virtual clock timers and CPU work are
+ * deterministically ordered. This is because the virtual clock can be
+ * advanced only by instructions running on a CPU.
+ *
+ * A notable exception are timers that are armed already expired. Their
+ * expiration is not constrained by instruction execution, and, therefore,
+ * their ordering relative to CPU work is affected by what the
+ * record-replay thread is doing when they are armed. This introduces
+ * non-determinism.
+ *
+ * Convert such timers to CPU work in order to avoid it.
+ */
+ if (replay_mode != REPLAY_MODE_NONE &&
+ timer_list->clock->type == QEMU_CLOCK_VIRTUAL &&
+ !(ts->attributes & QEMU_TIMER_ATTR_EXTERNAL) &&
+ expire_time <= qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)) {
+ async_run_on_cpu(first_cpu, timer_fire,
+ RUN_ON_CPU_HOST_PTR(ts));
+ return false;
+ }
+
/* add the timer in the sorted list */
pt = &timer_list->active_timers;
for (;;) {
--
2.51.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH RFC] target/s390x: Fix infinite loop during replay
2025-12-01 21:49 [PATCH RFC] target/s390x: Fix infinite loop during replay Ilya Leoshkevich
@ 2025-12-04 9:37 ` Thomas Huth
0 siblings, 0 replies; 2+ messages in thread
From: Thomas Huth @ 2025-12-04 9:37 UTC (permalink / raw)
To: Ilya Leoshkevich, Alex Bennée
Cc: qemu-s390x, qemu-devel, Pavel Dovgalyuk
On 01/12/2025 22.49, Ilya Leoshkevich wrote:
> Hi,
>
> Here is my attempt to fix [1] based on the discussion in [2].
>
> I'm sending this as an RFC, because I have definitely misunderstood a
> thing or two about record-replay, missed some timer bookkeeping
> intricacies, and haven't split arch-dependent and independent parts
> into different patches.
>
> This survives "make check" and "make check-tcg" with the test from [2],
> both with and without extra load in background.
>
> Please let me know what you think about the approach.
>
> Best regards,
> Ilya
>
> [1] https://lore.kernel.org/qemu-devel/a0accce9-6042-4a7b-a7c7-218212818891@redhat.com/
> [2] https://lore.kernel.org/qemu-devel/20251128133949.181828-1-thuth@redhat.com/
>
> ---
>
> Replaying even trivial s390x kernels hangs, because:
>
> - cpu_post_load() fires the TOD timer immediately.
>
> - s390_tod_load() schedules work for firing the TOD timer.
>
> - If rr loop sees work and then timer, we get one timer expiration.
>
> - If rr loop sees timer and then work, we get two timer expirations.
>
> - Record and replay may diverge due to this race.
>
> - In this particular case divergence makes replay loop spin: it sees that
> TOD timer has expired, but cannot invoke its callback, because there
> is no recorded CHECKPOINT_CLOCK_VIRTUAL.
>
> - The order in which rr loop sees work and timer depends on whether
> and when rr loop wakes up during load_snapshot().
>
> - rr loop may wake up after the main thread kicks the CPU and drops
> the BQL, which may happen if it calls, e.g., qemu_cond_wait_bql().
>
> Firing TOD timer twice is duplicate work, but it was introduced
> intentionally in commit 7c12f710bad6 ("s390x/tcg: rearm the CKC timer
> during migration") in order to avoid dependency on migration order.
>
> The key culprits here are timers that are armed ready expired. They
> break the ordering between timers and CPU work, because they are not
> constrained by instruction execution, thus introducing non-determinism
> and record-replay divergence.
>
> Fix by converting such timer callbacks to CPU work. Also add TOD clock
> updates to the save path, mirroring the load path, in order to have the
> same CHECKPOINT_CLOCK_VIRTUAL during recording and replaying.
>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
> hw/s390x/tod.c | 5 +++++
> stubs/async-run-on-cpu.c | 7 +++++++
> stubs/cpus-queue.c | 4 ++++
> stubs/meson.build | 2 ++
> target/s390x/machine.c | 4 ++++
> util/qemu-timer.c | 30 ++++++++++++++++++++++++++++++
> 6 files changed, 52 insertions(+)
> create mode 100644 stubs/async-run-on-cpu.c
> create mode 100644 stubs/cpus-queue.c
Thanks, this indeed fixes the test for me, so:
Tested-by: Thomas Huth <thuth@redhat.com>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-12-04 9:39 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-01 21:49 [PATCH RFC] target/s390x: Fix infinite loop during replay Ilya Leoshkevich
2025-12-04 9:37 ` Thomas Huth
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).