From: Thomas Huth <thuth@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
qemu-devel@nongnu.org,
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>,
Richard Henderson <richard.henderson@linaro.org>
Subject: Regression with the "replay" test on target alpha (was: [PULL 03/18] rcu: Unify force quiescent state)
Date: Mon, 3 Nov 2025 14:59:41 +0100 [thread overview]
Message-ID: <cb41dc20-5a87-42b6-8819-08f5a1ee4303@redhat.com> (raw)
In-Reply-To: <20251028173430.2180057-4-pbonzini@redhat.com>
On 28/10/2025 18.34, Paolo Bonzini wrote:
> From: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>
> Borrow the concept of force quiescent state from Linux to ensure readers
> remain fast during normal operation and to avoid stalls.
Hi Akihiko,
looks like this commit has introduced a regression with the "replay"
functional test on the alpha target.
When I run something like:
pyvenv/bin/meson test --no-rebuild -t 1 --setup thorough \
--num-processes 1 --repeat 10 func-alpha-replay
in the build folder, approx. half of the test runs are failing for me now.
I bisected the issue to this patch here - when I rebuild qemu-system-alpha
with the commit right before this change here, the above test runs work
fine, so I'm pretty sure that the problem has been introduced by this commit
here.
Could you please have a look?
Thanks,
Thomas
> Background
> ==========
>
> The previous implementation had four steps to begin reclamation.
>
> 1. call_rcu_thread() would wait for the first callback.
>
> 2. call_rcu_thread() would periodically poll until a decent number of
> callbacks piled up or it timed out.
>
> 3. synchronize_rcu() would statr a grace period (GP).
>
> 4. wait_for_readers() would wait for the GP to end. It would also
> trigger the force_rcu notifier to break busy loops in a read-side
> critical section if drain_call_rcu() had been called.
>
> Problem
> =======
>
> The separation of waiting logic across these steps led to suboptimal
> behavior:
>
> The GP was delayed until call_rcu_thread() stops polling.
>
> force_rcu was not consistently triggered when call_rcu_thread() detected
> a high number of pending callbacks or a timeout. This inconsistency
> sometimes led to stalls, as reported in a virtio-gpu issue where memory
> unmapping was blocked[1].
>
> wait_for_readers() imposed unnecessary overhead in non-urgent cases by
> unconditionally executing qatomic_set(&index->waiting, true) and
> qemu_event_reset(&rcu_gp_event), which are necessary only for expedited
> synchronization.
>
> Solution
> ========
>
> Move the polling in call_rcu_thread() to wait_for_readers() to prevent
> the delay of the GP. Additionally, reorganize wait_for_readers() to
> distinguish between two states:
>
> Normal State: it relies exclusively on periodic polling to detect
> the end of the GP and maintains the read-side fast path.
>
> Force Quiescent State: Whenever expediting synchronization, it always
> triggers force_rcu and executes both qatomic_set(&index->waiting, true)
> and qemu_event_reset(&rcu_gp_event). This avoids stalls while confining
> the read-side overhead to this state.
>
> This unified approach, inspired by the Linux RCU, ensures consistent and
> efficient RCU grace period handling and confirms resolution of the
> virtio-gpu issue.
>
> [1] https://lore.kernel.org/qemu-devel/20251014111234.3190346-9-alex.bennee@linaro.org/
>
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> Link: https://lore.kernel.org/r/20251016-force-v1-1-919a82112498@rsg.ci.i.u-tokyo.ac.jp
> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> util/rcu.c | 81 +++++++++++++++++++++++++++++++++++-------------------
> 1 file changed, 52 insertions(+), 29 deletions(-)
>
> diff --git a/util/rcu.c b/util/rcu.c
> index b703c86f15a..acac9446ea9 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -43,10 +43,14 @@
> #define RCU_GP_LOCKED (1UL << 0)
> #define RCU_GP_CTR (1UL << 1)
>
> +
> +#define RCU_CALL_MIN_SIZE 30
> +
> unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
>
> QemuEvent rcu_gp_event;
> static int in_drain_call_rcu;
> +static int rcu_call_count;
> static QemuMutex rcu_registry_lock;
> static QemuMutex rcu_sync_lock;
>
> @@ -76,15 +80,29 @@ static void wait_for_readers(void)
> {
> ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
> struct rcu_reader_data *index, *tmp;
> + int sleeps = 0;
> + bool forced = false;
>
> for (;;) {
> - /* We want to be notified of changes made to rcu_gp_ongoing
> - * while we walk the list.
> + /*
> + * Force the grace period to end and wait for it if any of the
> + * following heuristical conditions are satisfied:
> + * - A decent number of callbacks piled up.
> + * - It timed out.
> + * - It is in a drain_call_rcu() call.
> + *
> + * Otherwise, periodically poll the grace period, hoping it ends
> + * promptly.
> */
> - qemu_event_reset(&rcu_gp_event);
> + if (!forced &&
> + (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> + sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
> + forced = true;
>
> - QLIST_FOREACH(index, ®istry, node) {
> - qatomic_set(&index->waiting, true);
> + QLIST_FOREACH(index, ®istry, node) {
> + notifier_list_notify(&index->force_rcu, NULL);
> + qatomic_set(&index->waiting, true);
> + }
> }
>
> /* Here, order the stores to index->waiting before the loads of
> @@ -106,8 +124,6 @@ static void wait_for_readers(void)
> * get some extra futex wakeups.
> */
> qatomic_set(&index->waiting, false);
> - } else if (qatomic_read(&in_drain_call_rcu)) {
> - notifier_list_notify(&index->force_rcu, NULL);
> }
> }
>
> @@ -115,7 +131,8 @@ static void wait_for_readers(void)
> break;
> }
>
> - /* Wait for one thread to report a quiescent state and try again.
> + /*
> + * Sleep for a while and try again.
> * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
> * wait too much time.
> *
> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
> * rcu_registry_lock is released.
> */
> qemu_mutex_unlock(&rcu_registry_lock);
> - qemu_event_wait(&rcu_gp_event);
> +
> + if (forced) {
> + qemu_event_wait(&rcu_gp_event);
> +
> + /*
> + * We want to be notified of changes made to rcu_gp_ongoing
> + * while we walk the list.
> + */
> + qemu_event_reset(&rcu_gp_event);
> + } else {
> + g_usleep(10000);
> + sleeps++;
> + }
> +
> qemu_mutex_lock(&rcu_registry_lock);
> }
>
> @@ -173,15 +203,11 @@ void synchronize_rcu(void)
> }
> }
>
> -
> -#define RCU_CALL_MIN_SIZE 30
> -
> /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
> * from liburcu. Note that head is only used by the consumer.
> */
> static struct rcu_head dummy;
> static struct rcu_head *head = &dummy, **tail = &dummy.next;
> -static int rcu_call_count;
> static QemuEvent rcu_call_ready_event;
>
> static void enqueue(struct rcu_head *node)
> @@ -259,30 +285,27 @@ static void *call_rcu_thread(void *opaque)
> rcu_register_thread();
>
> for (;;) {
> - int tries = 0;
> - int n = qatomic_read(&rcu_call_count);
> + int n;
>
> - /* Heuristically wait for a decent number of callbacks to pile up.
> + /*
> * Fetch rcu_call_count now, we only must process elements that were
> * added before synchronize_rcu() starts.
> */
> - while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
> - g_usleep(10000);
> - if (n == 0) {
> - qemu_event_reset(&rcu_call_ready_event);
> - n = qatomic_read(&rcu_call_count);
> - if (n == 0) {
> -#if defined(CONFIG_MALLOC_TRIM)
> - malloc_trim(4 * 1024 * 1024);
> -#endif
> - qemu_event_wait(&rcu_call_ready_event);
> - }
> - }
> + for (;;) {
> + qemu_event_reset(&rcu_call_ready_event);
> n = qatomic_read(&rcu_call_count);
> + if (n) {
> + break;
> + }
> +
> +#if defined(CONFIG_MALLOC_TRIM)
> + malloc_trim(4 * 1024 * 1024);
> +#endif
> + qemu_event_wait(&rcu_call_ready_event);
> }
>
> - qatomic_sub(&rcu_call_count, n);
> synchronize_rcu();
> + qatomic_sub(&rcu_call_count, n);
> bql_lock();
> while (n > 0) {
> node = try_dequeue();
next prev parent reply other threads:[~2025-11-03 14:00 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-28 17:34 [PULL 00/18] Miscellaneous changes for 2025-10-28 Paolo Bonzini
2025-10-28 17:34 ` [PULL 01/18] scripts: clean up meson-buildoptions.py Paolo Bonzini
2025-10-28 17:34 ` [PULL 02/18] i386/kvm/cpu: Init SMM cpu address space for hotplugged CPUs Paolo Bonzini
2025-10-29 7:01 ` Zhao Liu
2025-10-30 2:19 ` Xiaoyao Li
2025-10-30 7:36 ` Michael Tokarev
2025-10-30 7:49 ` Xiaoyao Li
2025-10-30 8:03 ` Michael Tokarev
2025-10-28 17:34 ` [PULL 03/18] rcu: Unify force quiescent state Paolo Bonzini
2025-11-03 13:59 ` Thomas Huth [this message]
2025-11-04 1:45 ` Regression with the "replay" test on target alpha Akihiko Odaki
2025-11-04 7:41 ` Thomas Huth
2025-11-04 8:08 ` Akihiko Odaki
2025-11-04 8:38 ` Thomas Huth
2025-11-04 12:18 ` Paolo Bonzini
2025-11-05 6:29 ` Akihiko Odaki
2025-11-07 7:41 ` Thomas Huth
2025-11-07 7:49 ` Paolo Bonzini
2025-11-14 6:11 ` Thomas Huth
2025-10-28 17:34 ` [PULL 04/18] rust: remove useless glib_sys bindings Paolo Bonzini
2025-10-28 17:34 ` [PULL 05/18] rust: only leave leaf crates as workspace members Paolo Bonzini
2025-10-28 17:34 ` [PULL 06/18] qobject: make refcount atomic Paolo Bonzini
2025-10-28 17:34 ` [PULL 07/18] char: rename CharBackend->CharFrontend Paolo Bonzini
2025-10-28 17:34 ` [PULL 08/18] accel/mshv: initialize thread name Paolo Bonzini
2025-10-28 17:34 ` [PULL 09/18] accel/mshv: use return value of handle_pio_str_read Paolo Bonzini
2025-10-28 17:34 ` [PULL 10/18] esp.c: fix esp_cdb_ready() FIFO wraparound limit calculation Paolo Bonzini
2025-10-29 11:42 ` Michael Tokarev
2025-10-28 17:34 ` [PULL 11/18] qtest/am53c974-test: add additional test for cmdfifo overflow Paolo Bonzini
2025-10-28 17:34 ` [PULL 12/18] rust/qemu-macros: Convert bit value to u8 within #[property] Paolo Bonzini
2025-10-28 17:34 ` [PULL 13/18] scsi: make SCSIRequest refcount atomic Paolo Bonzini
2025-10-28 17:34 ` [PULL 14/18] qdev: Change PropertyInfo method print() to return malloc'ed string Paolo Bonzini
2025-10-28 17:34 ` [PULL 15/18] hw/i386/isapc.c: warn rather than reject modern x86 CPU models Paolo Bonzini
2025-10-28 17:34 ` [PULL 16/18] docs/about/deprecated.rst: document isapc deprecation for " Paolo Bonzini
2025-10-28 17:34 ` [PULL 17/18] target/i386: clear CPU_INTERRUPT_SIPI for all accelerators Paolo Bonzini
2025-10-28 17:34 ` [PULL 18/18] rust: migration: allow passing ParentField<> to vmstate_of! Paolo Bonzini
2025-10-29 6:34 ` Richard Henderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb41dc20-5a87-42b6-8819-08f5a1ee4303@redhat.com \
--to=thuth@redhat.com \
--cc=dmitry.osipenko@collabora.com \
--cc=odaki@rsg.ci.i.u-tokyo.ac.jp \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).