* [PATCH 1/5] futex: Add qemu_futex_timedwait()
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
@ 2025-10-29 6:12 ` Akihiko Odaki
2025-10-30 16:13 ` Alex Bennée
2025-10-29 6:12 ` [PATCH 2/5] qemu-thread: Add qemu_event_timedwait() Akihiko Odaki
` (6 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-10-29 6:12 UTC (permalink / raw)
To: qemu-devel, Dmitry Osipenko
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée,
Akihiko Odaki
qemu_futex_timedwait() is equivalent to qemu_futex_wait(), except it has
an absolute timeout.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
include/qemu/futex.h | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/include/qemu/futex.h b/include/qemu/futex.h
index 607613eec835..e5e56603d117 100644
--- a/include/qemu/futex.h
+++ b/include/qemu/futex.h
@@ -24,6 +24,7 @@
#ifndef QEMU_FUTEX_H
#define QEMU_FUTEX_H
+#include "qemu/timer.h"
#define HAVE_FUTEX
#ifdef CONFIG_LINUX
@@ -42,18 +43,28 @@ static inline void qemu_futex_wake_single(void *f)
qemu_futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
}
-static inline void qemu_futex_wait(void *f, unsigned val)
+static inline bool qemu_futex_timedwait(void *f, unsigned val, int64_t ns)
{
- while (qemu_futex(f, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
+ struct timespec ts;
+ uint32_t bitset = FUTEX_BITSET_MATCH_ANY;
+
+ ts.tv_sec = ns / NANOSECONDS_PER_SECOND;
+ ts.tv_nsec = ns % NANOSECONDS_PER_SECOND;
+
+ while (qemu_futex(f, FUTEX_WAIT_BITSET, (int) val, &ts, NULL, bitset)) {
switch (errno) {
case EWOULDBLOCK:
- return;
+ return true;
case EINTR:
break; /* get out of switch and retry */
+ case ETIMEDOUT:
+ return false;
default:
abort();
}
}
+
+ return true;
}
#elif defined(CONFIG_WIN32)
#include <synchapi.h>
@@ -68,12 +79,20 @@ static inline void qemu_futex_wake_single(void *f)
WakeByAddressSingle(f);
}
-static inline void qemu_futex_wait(void *f, unsigned val)
+static inline bool qemu_futex_timedwait(void *f, unsigned val, int64_t ns)
{
- WaitOnAddress(f, &val, sizeof(val), INFINITE);
+ uint32_t duration = MIN((ns - get_clock()) / SCALE_MS, UINT32_MAX);
+ return WaitOnAddress(f, &val, sizeof(val), duration);
}
#else
#undef HAVE_FUTEX
#endif
+#ifdef HAVE_FUTEX
+static inline void qemu_futex_wait(void *f, unsigned val)
+{
+ qemu_futex_timedwait(f, val, INT64_MAX);
+}
+#endif
+
#endif /* QEMU_FUTEX_H */
--
2.51.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 1/5] futex: Add qemu_futex_timedwait()
2025-10-29 6:12 ` [PATCH 1/5] futex: Add qemu_futex_timedwait() Akihiko Odaki
@ 2025-10-30 16:13 ` Alex Bennée
0 siblings, 0 replies; 29+ messages in thread
From: Alex Bennée @ 2025-10-30 16:13 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> writes:
> qemu_futex_timedwait() is equivalent to qemu_futex_wait(), except it has
> an absolute timeout.
>
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Hmm this broke my compile, I guess a name clash:
FAILED: tests/qtest/stm32l4x5_usart-test.p/stm32l4x5_usart-test.c.o
cc -m64 -Itests/qtest/stm32l4x5_usart-test.p -Itests/qtest -I../../tests/qtest -I. -Iqapi -Itrace -Iui -Iui/shader -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/sysprof-6 -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -fstack-protector-strong -Wempty-body -Wendif-labels -Wexpansion-to-defined -Wformat-security -Wformat-y2k -Wignored-qualifiers -Wimplicit-fallthrough=2 -Winit-self -Wmissing-format-attribute -Wmissing-prototypes -Wnested-externs -Wold-style-declaration -Wold-style-definition -Wredundant-decls -Wshadow=local -Wstrict-prototypes -Wtype-limits -Wundef -Wvla -Wwrite-strings -Wno-missing-include-dirs -Wno-psabi -Wno-shift-negative-value -isystem /home/alex/lsrc/qemu.git/linux-headers -isystem linux-headers -iquote . -iquote /home/alex/lsrc/qemu.git -iquote /home/alex/lsrc/qemu.git/include -iquote /home/alex/lsrc/qemu.git/host/include/x86_64 -iquote /home/alex/lsrc/qemu.git/host/include/generic -iquote /home/alex/lsrc/qemu.git/tcg/i386 -pthread -mcx16 -msse2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -ftrivial-auto-var-init=zero -fzero-call-used-regs=used-gpr -fPIE -MD -MQ tests/qtest/stm32l4x5_usart-test.p/stm32l4x5_usart-test.c.o -MF tests/qtest/stm32l4x5_usart-test.p/stm32l4x5_usart-test.c.o.d -o tests/qtest/stm32l4x5_usart-test.p/stm32l4x5_usart-test.c.o -c ../../tests/qtest/stm32l4x5_usart-test.c
../../tests/qtest/stm32l4x5_usart-test.c:114:13: error: conflicting types for ‘init_clocks’; have ‘void(QTestState *)’
114 | static void init_clocks(QTestState *qts)
| ^~~~~~~~~~~
In file included from /home/alex/lsrc/qemu.git/include/qemu/futex.h:27,
from /home/alex/lsrc/qemu.git/include/qemu/thread.h:6,
from /home/alex/lsrc/qemu.git/include/qemu/rcu.h:27,
from /home/alex/lsrc/qemu.git/include/hw/qdev-core.h:7,
from /home/alex/lsrc/qemu.git/include/hw/sysbus.h:6,
from /home/alex/lsrc/qemu.git/include/hw/misc/stm32l4x5_rcc.h:21,
from /home/alex/lsrc/qemu.git/include/hw/misc/stm32l4x5_rcc_internals.h:22,
from ../../tests/qtest/stm32l4x5_usart-test.c:13:
/home/alex/lsrc/qemu.git/include/qemu/timer.h:793:6: note: previous declaration of ‘init_clocks’ with type ‘void(void (*)(void *, QEMUClockType))’
793 | void init_clocks(QEMUTimerListNotifyCB *notify_cb);
| ^~~~~~~~~~~
ninja: build stopped: cannot make progress due to previous errors.
make: *** [Makefile:168: run-ninja] Error 1
make: Target 'all' not remade because of errors.
make: Leaving directory '/home/alex/lsrc/qemu.git/builds/all'
> ---
> include/qemu/futex.h | 29 ++++++++++++++++++++++++-----
> 1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/include/qemu/futex.h b/include/qemu/futex.h
> index 607613eec835..e5e56603d117 100644
> --- a/include/qemu/futex.h
> +++ b/include/qemu/futex.h
> @@ -24,6 +24,7 @@
> #ifndef QEMU_FUTEX_H
> #define QEMU_FUTEX_H
>
> +#include "qemu/timer.h"
> #define HAVE_FUTEX
>
> #ifdef CONFIG_LINUX
> @@ -42,18 +43,28 @@ static inline void qemu_futex_wake_single(void *f)
> qemu_futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
> }
>
> -static inline void qemu_futex_wait(void *f, unsigned val)
> +static inline bool qemu_futex_timedwait(void *f, unsigned val, int64_t ns)
> {
> - while (qemu_futex(f, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
> + struct timespec ts;
> + uint32_t bitset = FUTEX_BITSET_MATCH_ANY;
> +
> + ts.tv_sec = ns / NANOSECONDS_PER_SECOND;
> + ts.tv_nsec = ns % NANOSECONDS_PER_SECOND;
> +
> + while (qemu_futex(f, FUTEX_WAIT_BITSET, (int) val, &ts, NULL, bitset)) {
> switch (errno) {
> case EWOULDBLOCK:
> - return;
> + return true;
> case EINTR:
> break; /* get out of switch and retry */
> + case ETIMEDOUT:
> + return false;
> default:
> abort();
> }
> }
> +
> + return true;
> }
> #elif defined(CONFIG_WIN32)
> #include <synchapi.h>
> @@ -68,12 +79,20 @@ static inline void qemu_futex_wake_single(void *f)
> WakeByAddressSingle(f);
> }
>
> -static inline void qemu_futex_wait(void *f, unsigned val)
> +static inline bool qemu_futex_timedwait(void *f, unsigned val, int64_t ns)
> {
> - WaitOnAddress(f, &val, sizeof(val), INFINITE);
> + uint32_t duration = MIN((ns - get_clock()) / SCALE_MS, UINT32_MAX);
> + return WaitOnAddress(f, &val, sizeof(val), duration);
> }
> #else
> #undef HAVE_FUTEX
> #endif
>
> +#ifdef HAVE_FUTEX
> +static inline void qemu_futex_wait(void *f, unsigned val)
> +{
> + qemu_futex_timedwait(f, val, INT64_MAX);
> +}
> +#endif
> +
> #endif /* QEMU_FUTEX_H */
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2/5] qemu-thread: Add qemu_event_timedwait()
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
2025-10-29 6:12 ` [PATCH 1/5] futex: Add qemu_futex_timedwait() Akihiko Odaki
@ 2025-10-29 6:12 ` Akihiko Odaki
2025-10-29 6:12 ` [PATCH 3/5] rcu: Use call_rcu() in synchronize_rcu() Akihiko Odaki
` (5 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-10-29 6:12 UTC (permalink / raw)
To: qemu-devel, Dmitry Osipenko
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée,
Akihiko Odaki
qemu_event_timedwait() is equivalent to qemu_event_wait(), except it has
a relative timeout.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
include/qemu/thread-posix.h | 11 +++++++++++
include/qemu/thread.h | 8 +++++++-
util/event.c | 34 +++++++++++++++++++++++++++++-----
util/qemu-thread-posix.c | 11 +----------
4 files changed, 48 insertions(+), 16 deletions(-)
diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index 758808b705e4..11193b1580f8 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -36,4 +36,15 @@ struct QemuThread {
pthread_t thread;
};
+static inline clockid_t qemu_timedwait_clockid(void)
+{
+#ifdef CONFIG_PTHREAD_CONDATTR_SETCLOCK
+ return CLOCK_MONOTONIC;
+#else
+ return CLOCK_REALTIME;
+#endif
+}
+
+void compute_abs_deadline(struct timespec *ts, int ms);
+
#endif
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f0302ed01fdb..3030458bb666 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -201,7 +201,13 @@ void qemu_sem_destroy(QemuSemaphore *sem);
void qemu_event_init(QemuEvent *ev, bool init);
void qemu_event_set(QemuEvent *ev);
void qemu_event_reset(QemuEvent *ev);
-void qemu_event_wait(QemuEvent *ev);
+bool qemu_event_timedwait(QemuEvent *ev, int ms);
+
+static inline void qemu_event_wait(QemuEvent *ev)
+{
+ qemu_event_timedwait(ev, INT_MAX);
+}
+
void qemu_event_destroy(QemuEvent *ev);
void qemu_thread_create(QemuThread *thread, const char *name,
diff --git a/util/event.c b/util/event.c
index 5a8141cd0e46..e22cc08a629b 100644
--- a/util/event.c
+++ b/util/event.c
@@ -33,7 +33,15 @@ void qemu_event_init(QemuEvent *ev, bool init)
{
#ifndef HAVE_FUTEX
pthread_mutex_init(&ev->lock, NULL);
+#ifdef CONFIG_PTHREAD_CONDATTR_SETCLOCK
+ pthread_condattr_t attr;
+ pthread_condattr_init(&attr);
+ pthread_condattr_setclock(&attr, qemu_timedwait_clockid());
+ pthread_cond_init(&ev->cond, &attr);
+ pthread_condattr_destroy(&attr);
+#else
pthread_cond_init(&ev->cond, NULL);
+#endif
#endif
ev->value = (init ? EV_SET : EV_FREE);
@@ -121,15 +129,17 @@ void qemu_event_reset(QemuEvent *ev)
#endif
}
-void qemu_event_wait(QemuEvent *ev)
+bool qemu_event_timedwait(QemuEvent *ev, int ms)
{
assert(ev->initialized);
#ifdef HAVE_FUTEX
+ int64_t deadline = get_clock() + (int64_t)ms * SCALE_MS;
+
while (true) {
/*
- * qemu_event_wait must synchronize with qemu_event_set even if it does
- * not go down the slow path, so this load-acquire is needed that
+ * qemu_event_timedwait must synchronize with qemu_event_set even if it
+ * does not go down the slow path, so this load-acquire is needed that
* synchronizes with the first memory barrier in qemu_event_set().
*/
unsigned value = qatomic_load_acquire(&ev->value);
@@ -159,13 +169,27 @@ void qemu_event_wait(QemuEvent *ev)
* a smp_mb() pairing with the second barrier of qemu_event_set().
* The barrier is inside the FUTEX_WAIT system call.
*/
- qemu_futex_wait(ev, EV_BUSY);
+ if (!qemu_futex_timedwait(ev, EV_BUSY, deadline)) {
+ return false;
+ }
}
+
+ return true;
#else
+ bool failed;
+ struct timespec ts;
+
+ compute_abs_deadline(&ts, ms);
+
pthread_mutex_lock(&ev->lock);
while (qatomic_read(&ev->value) != EV_SET) {
- pthread_cond_wait(&ev->cond, &ev->lock);
+ failed = pthread_cond_timedwait(&ev->cond, &ev->lock, &ts);
+ if (failed) {
+ break;
+ }
}
pthread_mutex_unlock(&ev->lock);
+
+ return !failed;
#endif
}
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index ba725444ba63..f649bfa00015 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -44,16 +44,7 @@ static void error_exit(int err, const char *msg)
abort();
}
-static inline clockid_t qemu_timedwait_clockid(void)
-{
-#ifdef CONFIG_PTHREAD_CONDATTR_SETCLOCK
- return CLOCK_MONOTONIC;
-#else
- return CLOCK_REALTIME;
-#endif
-}
-
-static void compute_abs_deadline(struct timespec *ts, int ms)
+void compute_abs_deadline(struct timespec *ts, int ms)
{
clock_gettime(qemu_timedwait_clockid(), ts);
ts->tv_nsec += (ms % 1000) * 1000000;
--
2.51.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 3/5] rcu: Use call_rcu() in synchronize_rcu()
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
2025-10-29 6:12 ` [PATCH 1/5] futex: Add qemu_futex_timedwait() Akihiko Odaki
2025-10-29 6:12 ` [PATCH 2/5] qemu-thread: Add qemu_event_timedwait() Akihiko Odaki
@ 2025-10-29 6:12 ` Akihiko Odaki
2025-10-29 6:12 ` [PATCH 4/5] rcu: Wake the RCU thread when draining Akihiko Odaki
` (4 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-10-29 6:12 UTC (permalink / raw)
To: qemu-devel, Dmitry Osipenko
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée,
Akihiko Odaki
Previously, synchronize_rcu() was a single-threaded implementation that
is protected with a mutex. It was used only in the RCU thread and tests,
and real users instead use call_rcu(), which relies on the RCU thread.
The usage of synchronize_rcu() in tests did not accurately represent
real use cases because it caused locking with the mutex, which never
happened in real use cases, and it did not exercise the logic in the
RCU thread.
Add a new implementation of synchronize_rcu() which uses call_rcu() to
represent real use cases in tests. The old synchronize_rcu() is now
renamed to enter_qs() and only used in the RCU thread, making the mutex
unnecessary.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
util/rcu.c | 51 +++++++++++++++++++++++++++------------------------
1 file changed, 27 insertions(+), 24 deletions(-)
diff --git a/util/rcu.c b/util/rcu.c
index acac9446ea98..3c4af9d213c8 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -38,7 +38,7 @@
/*
* Global grace period counter. Bit 0 is always one in rcu_gp_ctr.
- * Bits 1 and above are defined in synchronize_rcu.
+ * Bits 1 and above are defined in enter_qs().
*/
#define RCU_GP_LOCKED (1UL << 0)
#define RCU_GP_CTR (1UL << 1)
@@ -52,7 +52,6 @@ QemuEvent rcu_gp_event;
static int in_drain_call_rcu;
static int rcu_call_count;
static QemuMutex rcu_registry_lock;
-static QemuMutex rcu_sync_lock;
/*
* Check whether a quiescent state was crossed between the beginning of
@@ -111,7 +110,7 @@ static void wait_for_readers(void)
*
* If this is the last iteration, this barrier also prevents
* frees from seeping upwards, and orders the two wait phases
- * on architectures with 32-bit longs; see synchronize_rcu().
+ * on architectures with 32-bit longs; see enter_qs().
*/
smp_mb_global();
@@ -137,9 +136,9 @@ static void wait_for_readers(void)
* wait too much time.
*
* rcu_register_thread() may add nodes to ®istry; it will not
- * wake up synchronize_rcu, but that is okay because at least another
+ * wake up enter_qs(), but that is okay because at least another
* thread must exit its RCU read-side critical section before
- * synchronize_rcu is done. The next iteration of the loop will
+ * enter_qs() is done. The next iteration of the loop will
* move the new thread's rcu_reader from ®istry to &qsreaders,
* because rcu_gp_ongoing() will return false.
*
@@ -171,10 +170,8 @@ static void wait_for_readers(void)
QLIST_SWAP(®istry, &qsreaders, node);
}
-void synchronize_rcu(void)
+static void enter_qs(void)
{
- QEMU_LOCK_GUARD(&rcu_sync_lock);
-
/* Write RCU-protected pointers before reading p_rcu_reader->ctr.
* Pairs with smp_mb_placeholder() in rcu_read_lock().
*
@@ -289,7 +286,7 @@ static void *call_rcu_thread(void *opaque)
/*
* Fetch rcu_call_count now, we only must process elements that were
- * added before synchronize_rcu() starts.
+ * added before enter_qs() starts.
*/
for (;;) {
qemu_event_reset(&rcu_call_ready_event);
@@ -304,7 +301,7 @@ static void *call_rcu_thread(void *opaque)
qemu_event_wait(&rcu_call_ready_event);
}
- synchronize_rcu();
+ enter_qs();
qatomic_sub(&rcu_call_count, n);
bql_lock();
while (n > 0) {
@@ -337,15 +334,24 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
}
-struct rcu_drain {
+typedef struct Sync {
struct rcu_head rcu;
- QemuEvent drain_complete_event;
-};
+ QemuEvent complete_event;
+} Sync;
-static void drain_rcu_callback(struct rcu_head *node)
+static void sync_rcu_callback(Sync *sync)
{
- struct rcu_drain *event = (struct rcu_drain *)node;
- qemu_event_set(&event->drain_complete_event);
+ qemu_event_set(&sync->complete_event);
+}
+
+void synchronize_rcu(void)
+{
+ Sync sync;
+
+ qemu_event_init(&sync.complete_event, false);
+ call_rcu(&sync, sync_rcu_callback, rcu);
+ qemu_event_wait(&sync.complete_event);
+ qemu_event_destroy(&sync.complete_event);
}
/*
@@ -359,11 +365,11 @@ static void drain_rcu_callback(struct rcu_head *node)
void drain_call_rcu(void)
{
- struct rcu_drain rcu_drain;
+ Sync sync;
bool locked = bql_locked();
- memset(&rcu_drain, 0, sizeof(struct rcu_drain));
- qemu_event_init(&rcu_drain.drain_complete_event, false);
+ memset(&sync, 0, sizeof(sync));
+ qemu_event_init(&sync.complete_event, false);
if (locked) {
bql_unlock();
@@ -383,8 +389,8 @@ void drain_call_rcu(void)
*/
qatomic_inc(&in_drain_call_rcu);
- call_rcu1(&rcu_drain.rcu, drain_rcu_callback);
- qemu_event_wait(&rcu_drain.drain_complete_event);
+ call_rcu(&sync, sync_rcu_callback, rcu);
+ qemu_event_wait(&sync.complete_event);
qatomic_dec(&in_drain_call_rcu);
if (locked) {
@@ -427,7 +433,6 @@ static void rcu_init_complete(void)
QemuThread thread;
qemu_mutex_init(&rcu_registry_lock);
- qemu_mutex_init(&rcu_sync_lock);
qemu_event_init(&rcu_gp_event, true);
qemu_event_init(&rcu_call_ready_event, false);
@@ -460,7 +465,6 @@ static void rcu_init_lock(void)
return;
}
- qemu_mutex_lock(&rcu_sync_lock);
qemu_mutex_lock(&rcu_registry_lock);
}
@@ -471,7 +475,6 @@ static void rcu_init_unlock(void)
}
qemu_mutex_unlock(&rcu_registry_lock);
- qemu_mutex_unlock(&rcu_sync_lock);
}
static void rcu_init_child(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
` (2 preceding siblings ...)
2025-10-29 6:12 ` [PATCH 3/5] rcu: Use call_rcu() in synchronize_rcu() Akihiko Odaki
@ 2025-10-29 6:12 ` Akihiko Odaki
2025-10-29 18:22 ` Peter Xu
2025-10-29 6:12 ` [PATCH 5/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
` (3 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-10-29 6:12 UTC (permalink / raw)
To: qemu-devel, Dmitry Osipenko
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée,
Akihiko Odaki
drain_call_rcu() triggers the force quiescent state, but it can be
delayed if the RCU thread is sleeping. Ensure the force quiescent state
is immediately triggered by waking the RCU thread up.
The logic to trigger the force quiescent state is decoupled as
force_rcu() so that it can be used independently.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
include/qemu/rcu.h | 1 +
util/rcu.c | 106 ++++++++++++++++++++++++++++++++---------------------
2 files changed, 65 insertions(+), 42 deletions(-)
diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 020dbe4d8b77..d6aa4e5854d3 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -118,6 +118,7 @@ static inline void rcu_read_unlock(void)
}
}
+void force_rcu(void);
void synchronize_rcu(void);
/*
diff --git a/util/rcu.c b/util/rcu.c
index 3c4af9d213c8..85f9333f5dff 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -49,10 +49,13 @@
unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
QemuEvent rcu_gp_event;
-static int in_drain_call_rcu;
+static bool forced;
static int rcu_call_count;
static QemuMutex rcu_registry_lock;
+/* Set when the forced variable is set or rcu_call_count becomes non-zero. */
+static QemuEvent sync_event;
+
/*
* Check whether a quiescent state was crossed between the beginning of
* update_counter_and_wait and now.
@@ -74,36 +77,21 @@ QEMU_DEFINE_CO_TLS(struct rcu_reader_data, rcu_reader)
typedef QLIST_HEAD(, rcu_reader_data) ThreadList;
static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
+void force_rcu(void)
+{
+ qatomic_set(&forced, true);
+ qemu_event_set(&sync_event);
+}
+
/* Wait for previous parity/grace period to be empty of readers. */
-static void wait_for_readers(void)
+static void wait_for_readers(bool sleep)
{
ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
struct rcu_reader_data *index, *tmp;
- int sleeps = 0;
- bool forced = false;
+ int sleeps = sleep ? 5 : 0;
+ bool waiting = false;
for (;;) {
- /*
- * Force the grace period to end and wait for it if any of the
- * following heuristical conditions are satisfied:
- * - A decent number of callbacks piled up.
- * - It timed out.
- * - It is in a drain_call_rcu() call.
- *
- * Otherwise, periodically poll the grace period, hoping it ends
- * promptly.
- */
- if (!forced &&
- (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
- sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
- forced = true;
-
- QLIST_FOREACH(index, ®istry, node) {
- notifier_list_notify(&index->force_rcu, NULL);
- qatomic_set(&index->waiting, true);
- }
- }
-
/* Here, order the stores to index->waiting before the loads of
* index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
* ensuring that the loads of index->ctr are sequentially consistent.
@@ -150,7 +138,8 @@ static void wait_for_readers(void)
*/
qemu_mutex_unlock(&rcu_registry_lock);
- if (forced) {
+ if (waiting) {
+ /* Wait for the forced quiescent state. */
qemu_event_wait(&rcu_gp_event);
/*
@@ -158,9 +147,25 @@ static void wait_for_readers(void)
* while we walk the list.
*/
qemu_event_reset(&rcu_gp_event);
+ } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
+ !sleeps || qemu_event_timedwait(&sync_event, 10)) {
+ /*
+ * Now one of the following heuristical conditions is satisfied:
+ * - A decent number of callbacks piled up.
+ * - It timed out.
+ * - force_rcu() was called.
+ *
+ * Force a quiescent state.
+ */
+ waiting = true;
+
+ QLIST_FOREACH(index, ®istry, node) {
+ notifier_list_notify(&index->force_rcu, NULL);
+ qatomic_set(&index->waiting, true);
+ }
} else {
- g_usleep(10000);
- sleeps++;
+ /* Try again. */
+ sleeps--;
}
qemu_mutex_lock(&rcu_registry_lock);
@@ -170,7 +175,7 @@ static void wait_for_readers(void)
QLIST_SWAP(®istry, &qsreaders, node);
}
-static void enter_qs(void)
+static void enter_qs(bool sleep)
{
/* Write RCU-protected pointers before reading p_rcu_reader->ctr.
* Pairs with smp_mb_placeholder() in rcu_read_lock().
@@ -189,14 +194,14 @@ static void enter_qs(void)
* Switch parity: 0 -> 1, 1 -> 0.
*/
qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
- wait_for_readers();
+ wait_for_readers(sleep);
qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
} else {
/* Increment current grace period. */
qatomic_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
}
- wait_for_readers();
+ wait_for_readers(sleep);
}
}
@@ -205,7 +210,6 @@ static void enter_qs(void)
*/
static struct rcu_head dummy;
static struct rcu_head *head = &dummy, **tail = &dummy.next;
-static QemuEvent rcu_call_ready_event;
static void enqueue(struct rcu_head *node)
{
@@ -282,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
rcu_register_thread();
for (;;) {
+ bool sleep = true;
int n;
/*
@@ -289,7 +294,7 @@ static void *call_rcu_thread(void *opaque)
* added before enter_qs() starts.
*/
for (;;) {
- qemu_event_reset(&rcu_call_ready_event);
+ qemu_event_reset(&sync_event);
n = qatomic_read(&rcu_call_count);
if (n) {
break;
@@ -298,20 +303,36 @@ static void *call_rcu_thread(void *opaque)
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#endif
- qemu_event_wait(&rcu_call_ready_event);
+ qemu_event_wait(&sync_event);
+ }
+
+ /*
+ * Ensure that an event for a rcu_call_count change will not interrupt
+ * wait_for_readers().
+ */
+ qemu_event_reset(&sync_event);
+
+ /*
+ * Ensure that the forced variable has not been set after fetching
+ * rcu_call_count; otherwise we may get confused by a force quiescent
+ * state request for an element later than n.
+ */
+ while (qatomic_xchg(&forced, false)) {
+ sleep = false;
+ n = qatomic_read(&rcu_call_count);
}
- enter_qs();
+ enter_qs(sleep);
qatomic_sub(&rcu_call_count, n);
bql_lock();
while (n > 0) {
node = try_dequeue();
while (!node) {
bql_unlock();
- qemu_event_reset(&rcu_call_ready_event);
+ qemu_event_reset(&sync_event);
node = try_dequeue();
if (!node) {
- qemu_event_wait(&rcu_call_ready_event);
+ qemu_event_wait(&sync_event);
node = try_dequeue();
}
bql_lock();
@@ -329,8 +350,10 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
{
node->func = func;
enqueue(node);
- qatomic_inc(&rcu_call_count);
- qemu_event_set(&rcu_call_ready_event);
+
+ if (!qatomic_fetch_inc(&rcu_call_count)) {
+ qemu_event_set(&sync_event);
+ }
}
@@ -388,10 +411,9 @@ void drain_call_rcu(void)
* assumed.
*/
- qatomic_inc(&in_drain_call_rcu);
call_rcu(&sync, sync_rcu_callback, rcu);
+ force_rcu();
qemu_event_wait(&sync.complete_event);
- qatomic_dec(&in_drain_call_rcu);
if (locked) {
bql_lock();
@@ -435,7 +457,7 @@ static void rcu_init_complete(void)
qemu_mutex_init(&rcu_registry_lock);
qemu_event_init(&rcu_gp_event, true);
- qemu_event_init(&rcu_call_ready_event, false);
+ qemu_event_init(&sync_event, false);
/* The caller is assumed to have BQL, so the call_rcu thread
* must have been quiescent even after forking, just recreate it.
--
2.51.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-10-29 6:12 ` [PATCH 4/5] rcu: Wake the RCU thread when draining Akihiko Odaki
@ 2025-10-29 18:22 ` Peter Xu
2025-11-03 9:45 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-10-29 18:22 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Wed, Oct 29, 2025 at 03:12:48PM +0900, Akihiko Odaki wrote:
> drain_call_rcu() triggers the force quiescent state, but it can be
> delayed if the RCU thread is sleeping. Ensure the force quiescent state
> is immediately triggered by waking the RCU thread up.
>
> The logic to trigger the force quiescent state is decoupled as
> force_rcu() so that it can be used independently.
>
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> ---
> include/qemu/rcu.h | 1 +
> util/rcu.c | 106 ++++++++++++++++++++++++++++++++---------------------
> 2 files changed, 65 insertions(+), 42 deletions(-)
>
> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
> index 020dbe4d8b77..d6aa4e5854d3 100644
> --- a/include/qemu/rcu.h
> +++ b/include/qemu/rcu.h
> @@ -118,6 +118,7 @@ static inline void rcu_read_unlock(void)
> }
> }
>
> +void force_rcu(void);
> void synchronize_rcu(void);
>
> /*
> diff --git a/util/rcu.c b/util/rcu.c
> index 3c4af9d213c8..85f9333f5dff 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -49,10 +49,13 @@
> unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
>
> QemuEvent rcu_gp_event;
> -static int in_drain_call_rcu;
> +static bool forced;
> static int rcu_call_count;
> static QemuMutex rcu_registry_lock;
>
> +/* Set when the forced variable is set or rcu_call_count becomes non-zero. */
> +static QemuEvent sync_event;
> +
> /*
> * Check whether a quiescent state was crossed between the beginning of
> * update_counter_and_wait and now.
> @@ -74,36 +77,21 @@ QEMU_DEFINE_CO_TLS(struct rcu_reader_data, rcu_reader)
> typedef QLIST_HEAD(, rcu_reader_data) ThreadList;
> static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
>
> +void force_rcu(void)
> +{
> + qatomic_set(&forced, true);
> + qemu_event_set(&sync_event);
> +}
> +
> /* Wait for previous parity/grace period to be empty of readers. */
> -static void wait_for_readers(void)
> +static void wait_for_readers(bool sleep)
> {
> ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
> struct rcu_reader_data *index, *tmp;
> - int sleeps = 0;
> - bool forced = false;
> + int sleeps = sleep ? 5 : 0;
> + bool waiting = false;
>
> for (;;) {
> - /*
> - * Force the grace period to end and wait for it if any of the
> - * following heuristical conditions are satisfied:
> - * - A decent number of callbacks piled up.
> - * - It timed out.
> - * - It is in a drain_call_rcu() call.
> - *
> - * Otherwise, periodically poll the grace period, hoping it ends
> - * promptly.
> - */
> - if (!forced &&
> - (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> - sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
> - forced = true;
> -
> - QLIST_FOREACH(index, ®istry, node) {
> - notifier_list_notify(&index->force_rcu, NULL);
> - qatomic_set(&index->waiting, true);
> - }
> - }
IIUC this is the part to set index->waiting first whenever necessary, then
that'll guarantee the wait(rcu_gp_event) will be notified in rcu unlock.
Now we removed this chunk, then could it happen if waiting=true and the
wait(rcu_gp_event) may wait for more than it should (as nobody will wake it
up if all threads have waiting=false)?
The other thing is, right below here there's the code and comment:
/* Here, order the stores to index->waiting before the loads of
* index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
* ensuring that the loads of index->ctr are sequentially consistent.
*
* If this is the last iteration, this barrier also prevents
* frees from seeping upwards, and orders the two wait phases
* on architectures with 32-bit longs; see enter_qs().
*/
smp_mb_global();
IIUC it explains the mb_global() to order the updates of waiting and the
reads of index->ctr. It doesn't look like applicable anymore. Said that,
I think we should indeed still need some barrier to make sure we read
index->ctr at least to be after we update global gp_ctr (done before
calling wait_for_readers()). I'm not sure if it means the mb is needed,
however maybe at least the comment is outdated if so.
> -
> /* Here, order the stores to index->waiting before the loads of
> * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
> * ensuring that the loads of index->ctr are sequentially consistent.
> @@ -150,7 +138,8 @@ static void wait_for_readers(void)
> */
> qemu_mutex_unlock(&rcu_registry_lock);
>
> - if (forced) {
> + if (waiting) {
> + /* Wait for the forced quiescent state. */
> qemu_event_wait(&rcu_gp_event);
>
> /*
> @@ -158,9 +147,25 @@ static void wait_for_readers(void)
> * while we walk the list.
> */
> qemu_event_reset(&rcu_gp_event);
> + } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> + !sleeps || qemu_event_timedwait(&sync_event, 10)) {
> + /*
> + * Now one of the following heuristical conditions is satisfied:
> + * - A decent number of callbacks piled up.
> + * - It timed out.
> + * - force_rcu() was called.
> + *
> + * Force a quiescent state.
> + */
> + waiting = true;
> +
> + QLIST_FOREACH(index, ®istry, node) {
> + notifier_list_notify(&index->force_rcu, NULL);
> + qatomic_set(&index->waiting, true);
> + }
> } else {
> - g_usleep(10000);
> - sleeps++;
> + /* Try again. */
> + sleeps--;
> }
>
> qemu_mutex_lock(&rcu_registry_lock);
> @@ -170,7 +175,7 @@ static void wait_for_readers(void)
> QLIST_SWAP(®istry, &qsreaders, node);
> }
>
> -static void enter_qs(void)
> +static void enter_qs(bool sleep)
> {
> /* Write RCU-protected pointers before reading p_rcu_reader->ctr.
> * Pairs with smp_mb_placeholder() in rcu_read_lock().
> @@ -189,14 +194,14 @@ static void enter_qs(void)
> * Switch parity: 0 -> 1, 1 -> 0.
> */
> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
> - wait_for_readers();
> + wait_for_readers(sleep);
> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
> } else {
> /* Increment current grace period. */
> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
> }
>
> - wait_for_readers();
> + wait_for_readers(sleep);
> }
> }
>
> @@ -205,7 +210,6 @@ static void enter_qs(void)
> */
> static struct rcu_head dummy;
> static struct rcu_head *head = &dummy, **tail = &dummy.next;
> -static QemuEvent rcu_call_ready_event;
>
> static void enqueue(struct rcu_head *node)
> {
> @@ -282,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
> rcu_register_thread();
>
> for (;;) {
> + bool sleep = true;
> int n;
>
> /*
> @@ -289,7 +294,7 @@ static void *call_rcu_thread(void *opaque)
> * added before enter_qs() starts.
> */
> for (;;) {
> - qemu_event_reset(&rcu_call_ready_event);
> + qemu_event_reset(&sync_event);
> n = qatomic_read(&rcu_call_count);
> if (n) {
> break;
> @@ -298,20 +303,36 @@ static void *call_rcu_thread(void *opaque)
> #if defined(CONFIG_MALLOC_TRIM)
> malloc_trim(4 * 1024 * 1024);
> #endif
> - qemu_event_wait(&rcu_call_ready_event);
> + qemu_event_wait(&sync_event);
> + }
> +
> + /*
> + * Ensure that an event for a rcu_call_count change will not interrupt
> + * wait_for_readers().
> + */
> + qemu_event_reset(&sync_event);
> +
> + /*
> + * Ensure that the forced variable has not been set after fetching
> + * rcu_call_count; otherwise we may get confused by a force quiescent
> + * state request for an element later than n.
> + */
> + while (qatomic_xchg(&forced, false)) {
> + sleep = false;
> + n = qatomic_read(&rcu_call_count);
> }
This is pretty tricky, and I wonder if it will make the code easier to read
if we convert the sync_event to be a semaphore instead. When as a sem, it
will take account of whatever kick to it, either a call_rcu1() or an
enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
worry on an slightly outdated "n" read because the 2nd round of sem_wait()
will catch that new "n".
Instead, worst case is rcu thread runs one more round without seeing
callbacks on the queue.
I'm not sure if that could help simplying code, maybe also make it less
error-prone.
>
> - enter_qs();
> + enter_qs(sleep);
> qatomic_sub(&rcu_call_count, n);
> bql_lock();
> while (n > 0) {
> node = try_dequeue();
> while (!node) {
I have a pure question here not relevant to your changes.. do you know when
this "if" will trigger? It seems to me the enqueue() should always happen
before the increment of rcu_call_count:
void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
{
node->func = func;
enqueue(node);
if (!qatomic_fetch_inc(&rcu_call_count)) {
qemu_event_set(&sync_event);
}
}
I believe qatomic_fetch_inc() is RMW so it's strong mb and order
guaranteed. Then here why the node can be null even if we're sure >=n have
been enqueued?
Thanks,
> bql_unlock();
> - qemu_event_reset(&rcu_call_ready_event);
> + qemu_event_reset(&sync_event);
> node = try_dequeue();
> if (!node) {
> - qemu_event_wait(&rcu_call_ready_event);
> + qemu_event_wait(&sync_event);
> node = try_dequeue();
> }
> bql_lock();
> @@ -329,8 +350,10 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
> {
> node->func = func;
> enqueue(node);
> - qatomic_inc(&rcu_call_count);
> - qemu_event_set(&rcu_call_ready_event);
> +
> + if (!qatomic_fetch_inc(&rcu_call_count)) {
> + qemu_event_set(&sync_event);
> + }
> }
>
>
> @@ -388,10 +411,9 @@ void drain_call_rcu(void)
> * assumed.
> */
>
> - qatomic_inc(&in_drain_call_rcu);
> call_rcu(&sync, sync_rcu_callback, rcu);
> + force_rcu();
> qemu_event_wait(&sync.complete_event);
> - qatomic_dec(&in_drain_call_rcu);
>
> if (locked) {
> bql_lock();
> @@ -435,7 +457,7 @@ static void rcu_init_complete(void)
> qemu_mutex_init(&rcu_registry_lock);
> qemu_event_init(&rcu_gp_event, true);
>
> - qemu_event_init(&rcu_call_ready_event, false);
> + qemu_event_init(&sync_event, false);
>
> /* The caller is assumed to have BQL, so the call_rcu thread
> * must have been quiescent even after forking, just recreate it.
>
> --
> 2.51.0
>
>
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-10-29 18:22 ` Peter Xu
@ 2025-11-03 9:45 ` Akihiko Odaki
2025-11-05 20:43 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-03 9:45 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/10/30 3:22, Peter Xu wrote:
> On Wed, Oct 29, 2025 at 03:12:48PM +0900, Akihiko Odaki wrote:
>> drain_call_rcu() triggers the force quiescent state, but it can be
>> delayed if the RCU thread is sleeping. Ensure the force quiescent state
>> is immediately triggered by waking the RCU thread up.
>>
>> The logic to trigger the force quiescent state is decoupled as
>> force_rcu() so that it can be used independently.
>>
>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>> ---
>> include/qemu/rcu.h | 1 +
>> util/rcu.c | 106 ++++++++++++++++++++++++++++++++---------------------
>> 2 files changed, 65 insertions(+), 42 deletions(-)
>>
>> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
>> index 020dbe4d8b77..d6aa4e5854d3 100644
>> --- a/include/qemu/rcu.h
>> +++ b/include/qemu/rcu.h
>> @@ -118,6 +118,7 @@ static inline void rcu_read_unlock(void)
>> }
>> }
>>
>> +void force_rcu(void);
>> void synchronize_rcu(void);
>>
>> /*
>> diff --git a/util/rcu.c b/util/rcu.c
>> index 3c4af9d213c8..85f9333f5dff 100644
>> --- a/util/rcu.c
>> +++ b/util/rcu.c
>> @@ -49,10 +49,13 @@
>> unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
>>
>> QemuEvent rcu_gp_event;
>> -static int in_drain_call_rcu;
>> +static bool forced;
>> static int rcu_call_count;
>> static QemuMutex rcu_registry_lock;
>>
>> +/* Set when the forced variable is set or rcu_call_count becomes non-zero. */
>> +static QemuEvent sync_event;
>> +
>> /*
>> * Check whether a quiescent state was crossed between the beginning of
>> * update_counter_and_wait and now.
>> @@ -74,36 +77,21 @@ QEMU_DEFINE_CO_TLS(struct rcu_reader_data, rcu_reader)
>> typedef QLIST_HEAD(, rcu_reader_data) ThreadList;
>> static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
>>
>> +void force_rcu(void)
>> +{
>> + qatomic_set(&forced, true);
>> + qemu_event_set(&sync_event);
>> +}
>> +
>> /* Wait for previous parity/grace period to be empty of readers. */
>> -static void wait_for_readers(void)
>> +static void wait_for_readers(bool sleep)
>> {
>> ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
>> struct rcu_reader_data *index, *tmp;
>> - int sleeps = 0;
>> - bool forced = false;
>> + int sleeps = sleep ? 5 : 0;
>> + bool waiting = false;
>>
>> for (;;) {
>> - /*
>> - * Force the grace period to end and wait for it if any of the
>> - * following heuristical conditions are satisfied:
>> - * - A decent number of callbacks piled up.
>> - * - It timed out.
>> - * - It is in a drain_call_rcu() call.
>> - *
>> - * Otherwise, periodically poll the grace period, hoping it ends
>> - * promptly.
>> - */
>> - if (!forced &&
>> - (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>> - sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
>> - forced = true;
>> -
>> - QLIST_FOREACH(index, ®istry, node) {
>> - notifier_list_notify(&index->force_rcu, NULL);
>> - qatomic_set(&index->waiting, true);
>> - }
>> - }
>
> IIUC this is the part to set index->waiting first whenever necessary, then
> that'll guarantee the wait(rcu_gp_event) will be notified in rcu unlock.
>
> Now we removed this chunk, then could it happen if waiting=true and the
> wait(rcu_gp_event) may wait for more than it should (as nobody will wake it
> up if all threads have waiting=false)?
It is not removed, but it is moved along with the assignment of the
waiting variable. So index->waiting is still set whenever the waiting
variable is set and no hang up will happen.
>
> The other thing is, right below here there's the code and comment:
>
> /* Here, order the stores to index->waiting before the loads of
> * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
> * ensuring that the loads of index->ctr are sequentially consistent.
> *
> * If this is the last iteration, this barrier also prevents
> * frees from seeping upwards, and orders the two wait phases
> * on architectures with 32-bit longs; see enter_qs().
> */
> smp_mb_global();
>
> IIUC it explains the mb_global() to order the updates of waiting and the
> reads of index->ctr. It doesn't look like applicable anymore. Said that,
> I think we should indeed still need some barrier to make sure we read
> index->ctr at least to be after we update global gp_ctr (done before
> calling wait_for_readers()). I'm not sure if it means the mb is needed,
> however maybe at least the comment is outdated if so.
It is still applicable. The stores to index->waiting is still present
and needs to be ordered before the loads of index->ctr.
>
>> -
>> /* Here, order the stores to index->waiting before the loads of
>> * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
>> * ensuring that the loads of index->ctr are sequentially consistent.
>> @@ -150,7 +138,8 @@ static void wait_for_readers(void)
>> */
>> qemu_mutex_unlock(&rcu_registry_lock);
>>
>> - if (forced) {
>> + if (waiting) {
>> + /* Wait for the forced quiescent state. */
>> qemu_event_wait(&rcu_gp_event);
>>
>> /*
>> @@ -158,9 +147,25 @@ static void wait_for_readers(void)
>> * while we walk the list.
>> */
>> qemu_event_reset(&rcu_gp_event);
>> + } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>> + !sleeps || qemu_event_timedwait(&sync_event, 10)) {
>> + /*
>> + * Now one of the following heuristical conditions is satisfied:
>> + * - A decent number of callbacks piled up.
>> + * - It timed out.
>> + * - force_rcu() was called.
>> + *
>> + * Force a quiescent state.
>> + */
>> + waiting = true;
>> +
>> + QLIST_FOREACH(index, ®istry, node) {
>> + notifier_list_notify(&index->force_rcu, NULL);
>> + qatomic_set(&index->waiting, true);
>> + }
>> } else {
>> - g_usleep(10000);
>> - sleeps++;
>> + /* Try again. */
>> + sleeps--;
>> }
>>
>> qemu_mutex_lock(&rcu_registry_lock);
>> @@ -170,7 +175,7 @@ static void wait_for_readers(void)
>> QLIST_SWAP(®istry, &qsreaders, node);
>> }
>>
>> -static void enter_qs(void)
>> +static void enter_qs(bool sleep)
>> {
>> /* Write RCU-protected pointers before reading p_rcu_reader->ctr.
>> * Pairs with smp_mb_placeholder() in rcu_read_lock().
>> @@ -189,14 +194,14 @@ static void enter_qs(void)
>> * Switch parity: 0 -> 1, 1 -> 0.
>> */
>> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
>> - wait_for_readers();
>> + wait_for_readers(sleep);
>> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
>> } else {
>> /* Increment current grace period. */
>> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
>> }
>>
>> - wait_for_readers();
>> + wait_for_readers(sleep);
>> }
>> }
>>
>> @@ -205,7 +210,6 @@ static void enter_qs(void)
>> */
>> static struct rcu_head dummy;
>> static struct rcu_head *head = &dummy, **tail = &dummy.next;
>> -static QemuEvent rcu_call_ready_event;
>>
>> static void enqueue(struct rcu_head *node)
>> {
>> @@ -282,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
>> rcu_register_thread();
>>
>> for (;;) {
>> + bool sleep = true;
>> int n;
>>
>> /*
>> @@ -289,7 +294,7 @@ static void *call_rcu_thread(void *opaque)
>> * added before enter_qs() starts.
>> */
>> for (;;) {
>> - qemu_event_reset(&rcu_call_ready_event);
>> + qemu_event_reset(&sync_event);
>> n = qatomic_read(&rcu_call_count);
>> if (n) {
>> break;
>> @@ -298,20 +303,36 @@ static void *call_rcu_thread(void *opaque)
>> #if defined(CONFIG_MALLOC_TRIM)
>> malloc_trim(4 * 1024 * 1024);
>> #endif
>> - qemu_event_wait(&rcu_call_ready_event);
>> + qemu_event_wait(&sync_event);
>> + }
>> +
>> + /*
>> + * Ensure that an event for a rcu_call_count change will not interrupt
>> + * wait_for_readers().
>> + */
>> + qemu_event_reset(&sync_event);
>> +
>> + /*
>> + * Ensure that the forced variable has not been set after fetching
>> + * rcu_call_count; otherwise we may get confused by a force quiescent
>> + * state request for an element later than n.
>> + */
>> + while (qatomic_xchg(&forced, false)) {
>> + sleep = false;
>> + n = qatomic_read(&rcu_call_count);
>> }
>
> This is pretty tricky, and I wonder if it will make the code easier to read
> if we convert the sync_event to be a semaphore instead. When as a sem, it
> will take account of whatever kick to it, either a call_rcu1() or an
> enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
> worry on an slightly outdated "n" read because the 2nd round of sem_wait()
> will catch that new "n".
>
> Instead, worst case is rcu thread runs one more round without seeing
> callbacks on the queue.
>
> I'm not sure if that could help simplying code, maybe also make it less
> error-prone.
Semaphore is not applicable here because it will not de-duplicate
concurrent kicks of RCU threads.
>
>>
>> - enter_qs();
>> + enter_qs(sleep);
>> qatomic_sub(&rcu_call_count, n);
>> bql_lock();
>> while (n > 0) {
>> node = try_dequeue();
>> while (!node) {
>
> I have a pure question here not relevant to your changes.. do you know when
> this "if" will trigger? It seems to me the enqueue() should always happen
> before the increment of rcu_call_count:
>
> void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
> {
> node->func = func;
> enqueue(node);
>
> if (!qatomic_fetch_inc(&rcu_call_count)) {
> qemu_event_set(&sync_event);
> }
> }>
> I believe qatomic_fetch_inc() is RMW so it's strong mb and order
> guaranteed. Then here why the node can be null even if we're sure >=n have
> been enqueued?
Indeed, enqueue() always happens before the increment of rcu_call_count
performed by the same thread.
The node can still be NULL when there are two concurrent call_rcu1()
executions. In the following example, rcu_call_count will be greater
than the number of visible nodes after (A) and before (B):
Thread T Thread U
call_rcu1(O)
enqueue(O)
Load N from tail
tail = O->next
call_rcu1(P)
enqueue(P)
Load O->next from tail
tail = P
O->next = P
rcu_call_count++ (A)
N->next = O (B)
rcu_call_count++
Regards,
Akihiko Odaki
>
> Thanks,
>
>> bql_unlock();
>> - qemu_event_reset(&rcu_call_ready_event);
>> + qemu_event_reset(&sync_event);
>> node = try_dequeue();
>> if (!node) {
>> - qemu_event_wait(&rcu_call_ready_event);
>> + qemu_event_wait(&sync_event);
>> node = try_dequeue();
>> }
>> bql_lock();
>> @@ -329,8 +350,10 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
>> {
>> node->func = func;
>> enqueue(node);
>> - qatomic_inc(&rcu_call_count);
>> - qemu_event_set(&rcu_call_ready_event);
>> +
>> + if (!qatomic_fetch_inc(&rcu_call_count)) {
>> + qemu_event_set(&sync_event);
>> + }
>> }
>>
>>
>> @@ -388,10 +411,9 @@ void drain_call_rcu(void)
>> * assumed.
>> */
>>
>> - qatomic_inc(&in_drain_call_rcu);
>> call_rcu(&sync, sync_rcu_callback, rcu);
>> + force_rcu();
>> qemu_event_wait(&sync.complete_event);
>> - qatomic_dec(&in_drain_call_rcu);
>>
>> if (locked) {
>> bql_lock();
>> @@ -435,7 +457,7 @@ static void rcu_init_complete(void)
>> qemu_mutex_init(&rcu_registry_lock);
>> qemu_event_init(&rcu_gp_event, true);
>>
>> - qemu_event_init(&rcu_call_ready_event, false);
>> + qemu_event_init(&sync_event, false);
>>
>> /* The caller is assumed to have BQL, so the call_rcu thread
>> * must have been quiescent even after forking, just recreate it.
>>
>> --
>> 2.51.0
>>
>>
>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-03 9:45 ` Akihiko Odaki
@ 2025-11-05 20:43 ` Peter Xu
2025-11-06 1:40 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-11-05 20:43 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Mon, Nov 03, 2025 at 06:45:30PM +0900, Akihiko Odaki wrote:
> On 2025/10/30 3:22, Peter Xu wrote:
> > On Wed, Oct 29, 2025 at 03:12:48PM +0900, Akihiko Odaki wrote:
> > > drain_call_rcu() triggers the force quiescent state, but it can be
> > > delayed if the RCU thread is sleeping. Ensure the force quiescent state
> > > is immediately triggered by waking the RCU thread up.
> > >
> > > The logic to trigger the force quiescent state is decoupled as
> > > force_rcu() so that it can be used independently.
> > >
> > > Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> > > ---
> > > include/qemu/rcu.h | 1 +
> > > util/rcu.c | 106 ++++++++++++++++++++++++++++++++---------------------
> > > 2 files changed, 65 insertions(+), 42 deletions(-)
> > >
> > > diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
> > > index 020dbe4d8b77..d6aa4e5854d3 100644
> > > --- a/include/qemu/rcu.h
> > > +++ b/include/qemu/rcu.h
> > > @@ -118,6 +118,7 @@ static inline void rcu_read_unlock(void)
> > > }
> > > }
> > > +void force_rcu(void);
> > > void synchronize_rcu(void);
> > > /*
> > > diff --git a/util/rcu.c b/util/rcu.c
> > > index 3c4af9d213c8..85f9333f5dff 100644
> > > --- a/util/rcu.c
> > > +++ b/util/rcu.c
> > > @@ -49,10 +49,13 @@
> > > unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
> > > QemuEvent rcu_gp_event;
> > > -static int in_drain_call_rcu;
> > > +static bool forced;
> > > static int rcu_call_count;
> > > static QemuMutex rcu_registry_lock;
> > > +/* Set when the forced variable is set or rcu_call_count becomes non-zero. */
> > > +static QemuEvent sync_event;
> > > +
> > > /*
> > > * Check whether a quiescent state was crossed between the beginning of
> > > * update_counter_and_wait and now.
> > > @@ -74,36 +77,21 @@ QEMU_DEFINE_CO_TLS(struct rcu_reader_data, rcu_reader)
> > > typedef QLIST_HEAD(, rcu_reader_data) ThreadList;
> > > static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
> > > +void force_rcu(void)
> > > +{
> > > + qatomic_set(&forced, true);
> > > + qemu_event_set(&sync_event);
> > > +}
> > > +
> > > /* Wait for previous parity/grace period to be empty of readers. */
> > > -static void wait_for_readers(void)
> > > +static void wait_for_readers(bool sleep)
> > > {
> > > ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
> > > struct rcu_reader_data *index, *tmp;
> > > - int sleeps = 0;
> > > - bool forced = false;
> > > + int sleeps = sleep ? 5 : 0;
> > > + bool waiting = false;
> > > for (;;) {
> > > - /*
> > > - * Force the grace period to end and wait for it if any of the
> > > - * following heuristical conditions are satisfied:
> > > - * - A decent number of callbacks piled up.
> > > - * - It timed out.
> > > - * - It is in a drain_call_rcu() call.
> > > - *
> > > - * Otherwise, periodically poll the grace period, hoping it ends
> > > - * promptly.
> > > - */
> > > - if (!forced &&
> > > - (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> > > - sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
> > > - forced = true;
> > > -
> > > - QLIST_FOREACH(index, ®istry, node) {
> > > - notifier_list_notify(&index->force_rcu, NULL);
> > > - qatomic_set(&index->waiting, true);
> > > - }
> > > - }
> >
> > IIUC this is the part to set index->waiting first whenever necessary, then
> > that'll guarantee the wait(rcu_gp_event) will be notified in rcu unlock.
> >
> > Now we removed this chunk, then could it happen if waiting=true and the
> > wait(rcu_gp_event) may wait for more than it should (as nobody will wake it
> > up if all threads have waiting=false)?
>
> It is not removed, but it is moved along with the assignment of the waiting
> variable. So index->waiting is still set whenever the waiting variable is
> set and no hang up will happen.
Ah, I noticed the "waiting" is actually the global variable and I think
when I read it last time I somehow misread it with "sleep" (which was
passed down from the caller).
if (waiting) { <--------------------
qemu_event_wait(&rcu_gp_event);
qemu_event_reset(&rcu_gp_event);
} else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
Yeah, I think it's non-issue. Please ignore my question.
>
> >
> > The other thing is, right below here there's the code and comment:
> >
> > /* Here, order the stores to index->waiting before the loads of
> > * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
> > * ensuring that the loads of index->ctr are sequentially consistent.
> > *
> > * If this is the last iteration, this barrier also prevents
> > * frees from seeping upwards, and orders the two wait phases
> > * on architectures with 32-bit longs; see enter_qs().
> > */
> > smp_mb_global();
> >
> > IIUC it explains the mb_global() to order the updates of waiting and the
> > reads of index->ctr. It doesn't look like applicable anymore. Said that,
> > I think we should indeed still need some barrier to make sure we read
> > index->ctr at least to be after we update global gp_ctr (done before
> > calling wait_for_readers()). I'm not sure if it means the mb is needed,
> > however maybe at least the comment is outdated if so.
>
> It is still applicable. The stores to index->waiting is still present and
> needs to be ordered before the loads of index->ctr.
>
> >
> > > -
> > > /* Here, order the stores to index->waiting before the loads of
> > > * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
> > > * ensuring that the loads of index->ctr are sequentially consistent.
> > > @@ -150,7 +138,8 @@ static void wait_for_readers(void)
> > > */
> > > qemu_mutex_unlock(&rcu_registry_lock);
> > > - if (forced) {
> > > + if (waiting) {
> > > + /* Wait for the forced quiescent state. */
> > > qemu_event_wait(&rcu_gp_event);
> > > /*
> > > @@ -158,9 +147,25 @@ static void wait_for_readers(void)
> > > * while we walk the list.
> > > */
> > > qemu_event_reset(&rcu_gp_event);
> > > + } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> > > + !sleeps || qemu_event_timedwait(&sync_event, 10)) {
> > > + /*
> > > + * Now one of the following heuristical conditions is satisfied:
> > > + * - A decent number of callbacks piled up.
> > > + * - It timed out.
> > > + * - force_rcu() was called.
> > > + *
> > > + * Force a quiescent state.
> > > + */
> > > + waiting = true;
> > > +
> > > + QLIST_FOREACH(index, ®istry, node) {
> > > + notifier_list_notify(&index->force_rcu, NULL);
> > > + qatomic_set(&index->waiting, true);
> > > + }
> > > } else {
> > > - g_usleep(10000);
> > > - sleeps++;
> > > + /* Try again. */
> > > + sleeps--;
> > > }
> > > qemu_mutex_lock(&rcu_registry_lock);
> > > @@ -170,7 +175,7 @@ static void wait_for_readers(void)
> > > QLIST_SWAP(®istry, &qsreaders, node);
> > > }
> > > -static void enter_qs(void)
> > > +static void enter_qs(bool sleep)
> > > {
> > > /* Write RCU-protected pointers before reading p_rcu_reader->ctr.
> > > * Pairs with smp_mb_placeholder() in rcu_read_lock().
> > > @@ -189,14 +194,14 @@ static void enter_qs(void)
> > > * Switch parity: 0 -> 1, 1 -> 0.
> > > */
> > > qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
> > > - wait_for_readers();
> > > + wait_for_readers(sleep);
> > > qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
> > > } else {
> > > /* Increment current grace period. */
> > > qatomic_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
> > > }
> > > - wait_for_readers();
> > > + wait_for_readers(sleep);
> > > }
> > > }
> > > @@ -205,7 +210,6 @@ static void enter_qs(void)
> > > */
> > > static struct rcu_head dummy;
> > > static struct rcu_head *head = &dummy, **tail = &dummy.next;
> > > -static QemuEvent rcu_call_ready_event;
> > > static void enqueue(struct rcu_head *node)
> > > {
> > > @@ -282,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
> > > rcu_register_thread();
> > > for (;;) {
> > > + bool sleep = true;
> > > int n;
> > > /*
> > > @@ -289,7 +294,7 @@ static void *call_rcu_thread(void *opaque)
> > > * added before enter_qs() starts.
> > > */
> > > for (;;) {
> > > - qemu_event_reset(&rcu_call_ready_event);
> > > + qemu_event_reset(&sync_event);
> > > n = qatomic_read(&rcu_call_count);
> > > if (n) {
> > > break;
> > > @@ -298,20 +303,36 @@ static void *call_rcu_thread(void *opaque)
> > > #if defined(CONFIG_MALLOC_TRIM)
> > > malloc_trim(4 * 1024 * 1024);
> > > #endif
> > > - qemu_event_wait(&rcu_call_ready_event);
> > > + qemu_event_wait(&sync_event);
> > > + }
> > > +
> > > + /*
> > > + * Ensure that an event for a rcu_call_count change will not interrupt
> > > + * wait_for_readers().
> > > + */
> > > + qemu_event_reset(&sync_event);
> > > +
> > > + /*
> > > + * Ensure that the forced variable has not been set after fetching
> > > + * rcu_call_count; otherwise we may get confused by a force quiescent
> > > + * state request for an element later than n.
> > > + */
> > > + while (qatomic_xchg(&forced, false)) {
> > > + sleep = false;
> > > + n = qatomic_read(&rcu_call_count);
> > > }
> >
> > This is pretty tricky, and I wonder if it will make the code easier to read
> > if we convert the sync_event to be a semaphore instead. When as a sem, it
> > will take account of whatever kick to it, either a call_rcu1() or an
> > enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
> > worry on an slightly outdated "n" read because the 2nd round of sem_wait()
> > will catch that new "n".
> >
> > Instead, worst case is rcu thread runs one more round without seeing
> > callbacks on the queue.
> >
> > I'm not sure if that could help simplying code, maybe also make it less
> > error-prone.
>
> Semaphore is not applicable here because it will not de-duplicate concurrent
> kicks of RCU threads.
Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
loops some more rounds reading "n", which looks all safe. No?
>
> >
> > > - enter_qs();
> > > + enter_qs(sleep);
> > > qatomic_sub(&rcu_call_count, n);
> > > bql_lock();
> > > while (n > 0) {
> > > node = try_dequeue();
> > > while (!node) {
> >
> > I have a pure question here not relevant to your changes.. do you know when
> > this "if" will trigger? It seems to me the enqueue() should always happen
> > before the increment of rcu_call_count:
> >
> > void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
> > {
> > node->func = func;
> > enqueue(node);
> >
> > if (!qatomic_fetch_inc(&rcu_call_count)) {
> > qemu_event_set(&sync_event);
> > }
> > }>
> > I believe qatomic_fetch_inc() is RMW so it's strong mb and order
> > guaranteed. Then here why the node can be null even if we're sure >=n have
> > been enqueued?
>
> Indeed, enqueue() always happens before the increment of rcu_call_count
> performed by the same thread.
>
> The node can still be NULL when there are two concurrent call_rcu1()
> executions. In the following example, rcu_call_count will be greater than
> the number of visible nodes after (A) and before (B):
>
> Thread T Thread U
> call_rcu1(O)
> enqueue(O)
> Load N from tail
> tail = O->next
> call_rcu1(P)
> enqueue(P)
> Load O->next from tail
> tail = P
> O->next = P
> rcu_call_count++ (A)
> N->next = O (B)
> rcu_call_count++
Thanks, yeah it makes sense. If you think worthwhile, maybe we could add a
comment after the first try_dequeue().
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-05 20:43 ` Peter Xu
@ 2025-11-06 1:40 ` Akihiko Odaki
2025-11-06 21:52 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-06 1:40 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/06 5:43, Peter Xu wrote:
> On Mon, Nov 03, 2025 at 06:45:30PM +0900, Akihiko Odaki wrote:
>> On 2025/10/30 3:22, Peter Xu wrote:
>>> On Wed, Oct 29, 2025 at 03:12:48PM +0900, Akihiko Odaki wrote:
>>>> drain_call_rcu() triggers the force quiescent state, but it can be
>>>> delayed if the RCU thread is sleeping. Ensure the force quiescent state
>>>> is immediately triggered by waking the RCU thread up.
>>>>
>>>> The logic to trigger the force quiescent state is decoupled as
>>>> force_rcu() so that it can be used independently.
>>>>
>>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>> ---
>>>> include/qemu/rcu.h | 1 +
>>>> util/rcu.c | 106 ++++++++++++++++++++++++++++++++---------------------
>>>> 2 files changed, 65 insertions(+), 42 deletions(-)
>>>>
>>>> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
>>>> index 020dbe4d8b77..d6aa4e5854d3 100644
>>>> --- a/include/qemu/rcu.h
>>>> +++ b/include/qemu/rcu.h
>>>> @@ -118,6 +118,7 @@ static inline void rcu_read_unlock(void)
>>>> }
>>>> }
>>>> +void force_rcu(void);
>>>> void synchronize_rcu(void);
>>>> /*
>>>> diff --git a/util/rcu.c b/util/rcu.c
>>>> index 3c4af9d213c8..85f9333f5dff 100644
>>>> --- a/util/rcu.c
>>>> +++ b/util/rcu.c
>>>> @@ -49,10 +49,13 @@
>>>> unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
>>>> QemuEvent rcu_gp_event;
>>>> -static int in_drain_call_rcu;
>>>> +static bool forced;
>>>> static int rcu_call_count;
>>>> static QemuMutex rcu_registry_lock;
>>>> +/* Set when the forced variable is set or rcu_call_count becomes non-zero. */
>>>> +static QemuEvent sync_event;
>>>> +
>>>> /*
>>>> * Check whether a quiescent state was crossed between the beginning of
>>>> * update_counter_and_wait and now.
>>>> @@ -74,36 +77,21 @@ QEMU_DEFINE_CO_TLS(struct rcu_reader_data, rcu_reader)
>>>> typedef QLIST_HEAD(, rcu_reader_data) ThreadList;
>>>> static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
>>>> +void force_rcu(void)
>>>> +{
>>>> + qatomic_set(&forced, true);
>>>> + qemu_event_set(&sync_event);
>>>> +}
>>>> +
>>>> /* Wait for previous parity/grace period to be empty of readers. */
>>>> -static void wait_for_readers(void)
>>>> +static void wait_for_readers(bool sleep)
>>>> {
>>>> ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
>>>> struct rcu_reader_data *index, *tmp;
>>>> - int sleeps = 0;
>>>> - bool forced = false;
>>>> + int sleeps = sleep ? 5 : 0;
>>>> + bool waiting = false;
>>>> for (;;) {
>>>> - /*
>>>> - * Force the grace period to end and wait for it if any of the
>>>> - * following heuristical conditions are satisfied:
>>>> - * - A decent number of callbacks piled up.
>>>> - * - It timed out.
>>>> - * - It is in a drain_call_rcu() call.
>>>> - *
>>>> - * Otherwise, periodically poll the grace period, hoping it ends
>>>> - * promptly.
>>>> - */
>>>> - if (!forced &&
>>>> - (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>>>> - sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
>>>> - forced = true;
>>>> -
>>>> - QLIST_FOREACH(index, ®istry, node) {
>>>> - notifier_list_notify(&index->force_rcu, NULL);
>>>> - qatomic_set(&index->waiting, true);
>>>> - }
>>>> - }
>>>
>>> IIUC this is the part to set index->waiting first whenever necessary, then
>>> that'll guarantee the wait(rcu_gp_event) will be notified in rcu unlock.
>>>
>>> Now we removed this chunk, then could it happen if waiting=true and the
>>> wait(rcu_gp_event) may wait for more than it should (as nobody will wake it
>>> up if all threads have waiting=false)?
>>
>> It is not removed, but it is moved along with the assignment of the waiting
>> variable. So index->waiting is still set whenever the waiting variable is
>> set and no hang up will happen.
>
> Ah, I noticed the "waiting" is actually the global variable and I think
> when I read it last time I somehow misread it with "sleep" (which was
> passed down from the caller).
waiting is local. There are several variables and some are global and
the others are local so please be careful.
>
> if (waiting) { <--------------------
> qemu_event_wait(&rcu_gp_event);
> qemu_event_reset(&rcu_gp_event);
> } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>
> Yeah, I think it's non-issue. Please ignore my question.
>
>>
>>>
>>> The other thing is, right below here there's the code and comment:
>>>
>>> /* Here, order the stores to index->waiting before the loads of
>>> * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
>>> * ensuring that the loads of index->ctr are sequentially consistent.
>>> *
>>> * If this is the last iteration, this barrier also prevents
>>> * frees from seeping upwards, and orders the two wait phases
>>> * on architectures with 32-bit longs; see enter_qs().
>>> */
>>> smp_mb_global();
>>>
>>> IIUC it explains the mb_global() to order the updates of waiting and the
>>> reads of index->ctr. It doesn't look like applicable anymore. Said that,
>>> I think we should indeed still need some barrier to make sure we read
>>> index->ctr at least to be after we update global gp_ctr (done before
>>> calling wait_for_readers()). I'm not sure if it means the mb is needed,
>>> however maybe at least the comment is outdated if so.
>>
>> It is still applicable. The stores to index->waiting is still present and
>> needs to be ordered before the loads of index->ctr.
>>
>>>
>>>> -
>>>> /* Here, order the stores to index->waiting before the loads of
>>>> * index->ctr. Pairs with smp_mb_placeholder() in rcu_read_unlock(),
>>>> * ensuring that the loads of index->ctr are sequentially consistent.
>>>> @@ -150,7 +138,8 @@ static void wait_for_readers(void)
>>>> */
>>>> qemu_mutex_unlock(&rcu_registry_lock);
>>>> - if (forced) {
>>>> + if (waiting) {
>>>> + /* Wait for the forced quiescent state. */
>>>> qemu_event_wait(&rcu_gp_event);
>>>> /*
>>>> @@ -158,9 +147,25 @@ static void wait_for_readers(void)
>>>> * while we walk the list.
>>>> */
>>>> qemu_event_reset(&rcu_gp_event);
>>>> + } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>>>> + !sleeps || qemu_event_timedwait(&sync_event, 10)) {
>>>> + /*
>>>> + * Now one of the following heuristical conditions is satisfied:
>>>> + * - A decent number of callbacks piled up.
>>>> + * - It timed out.
>>>> + * - force_rcu() was called.
>>>> + *
>>>> + * Force a quiescent state.
>>>> + */
>>>> + waiting = true;
>>>> +
>>>> + QLIST_FOREACH(index, ®istry, node) {
>>>> + notifier_list_notify(&index->force_rcu, NULL);
>>>> + qatomic_set(&index->waiting, true);
>>>> + }
>>>> } else {
>>>> - g_usleep(10000);
>>>> - sleeps++;
>>>> + /* Try again. */
>>>> + sleeps--;
>>>> }
>>>> qemu_mutex_lock(&rcu_registry_lock);
>>>> @@ -170,7 +175,7 @@ static void wait_for_readers(void)
>>>> QLIST_SWAP(®istry, &qsreaders, node);
>>>> }
>>>> -static void enter_qs(void)
>>>> +static void enter_qs(bool sleep)
>>>> {
>>>> /* Write RCU-protected pointers before reading p_rcu_reader->ctr.
>>>> * Pairs with smp_mb_placeholder() in rcu_read_lock().
>>>> @@ -189,14 +194,14 @@ static void enter_qs(void)
>>>> * Switch parity: 0 -> 1, 1 -> 0.
>>>> */
>>>> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
>>>> - wait_for_readers();
>>>> + wait_for_readers(sleep);
>>>> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr ^ RCU_GP_CTR);
>>>> } else {
>>>> /* Increment current grace period. */
>>>> qatomic_set(&rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
>>>> }
>>>> - wait_for_readers();
>>>> + wait_for_readers(sleep);
>>>> }
>>>> }
>>>> @@ -205,7 +210,6 @@ static void enter_qs(void)
>>>> */
>>>> static struct rcu_head dummy;
>>>> static struct rcu_head *head = &dummy, **tail = &dummy.next;
>>>> -static QemuEvent rcu_call_ready_event;
>>>> static void enqueue(struct rcu_head *node)
>>>> {
>>>> @@ -282,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
>>>> rcu_register_thread();
>>>> for (;;) {
>>>> + bool sleep = true;
>>>> int n;
>>>> /*
>>>> @@ -289,7 +294,7 @@ static void *call_rcu_thread(void *opaque)
>>>> * added before enter_qs() starts.
>>>> */
>>>> for (;;) {
>>>> - qemu_event_reset(&rcu_call_ready_event);
>>>> + qemu_event_reset(&sync_event);
>>>> n = qatomic_read(&rcu_call_count);
>>>> if (n) {
>>>> break;
>>>> @@ -298,20 +303,36 @@ static void *call_rcu_thread(void *opaque)
>>>> #if defined(CONFIG_MALLOC_TRIM)
>>>> malloc_trim(4 * 1024 * 1024);
>>>> #endif
>>>> - qemu_event_wait(&rcu_call_ready_event);
>>>> + qemu_event_wait(&sync_event);
>>>> + }
>>>> +
>>>> + /*
>>>> + * Ensure that an event for a rcu_call_count change will not interrupt
>>>> + * wait_for_readers().
>>>> + */
>>>> + qemu_event_reset(&sync_event);
>>>> +
>>>> + /*
>>>> + * Ensure that the forced variable has not been set after fetching
>>>> + * rcu_call_count; otherwise we may get confused by a force quiescent
>>>> + * state request for an element later than n.
>>>> + */
>>>> + while (qatomic_xchg(&forced, false)) {
>>>> + sleep = false;
>>>> + n = qatomic_read(&rcu_call_count);
>>>> }
>>>
>>> This is pretty tricky, and I wonder if it will make the code easier to read
>>> if we convert the sync_event to be a semaphore instead. When as a sem, it
>>> will take account of whatever kick to it, either a call_rcu1() or an
>>> enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
>>> worry on an slightly outdated "n" read because the 2nd round of sem_wait()
>>> will catch that new "n".
>>>
>>> Instead, worst case is rcu thread runs one more round without seeing
>>> callbacks on the queue.
>>>
>>> I'm not sure if that could help simplying code, maybe also make it less
>>> error-prone.
>>
>> Semaphore is not applicable here because it will not de-duplicate concurrent
>> kicks of RCU threads.
>
> Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
> thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
> loops some more rounds reading "n", which looks all safe. No?
It is safe but incurs overheads and confusing. QemuEvent represents the
boolean semantics better.
I also have difficulty to understand how converting sync_event to a
semaphore simplifies the code. Perhaps some (pseudo)code to show how the
code will look like may be useful.
>
>>
>>>
>>>> - enter_qs();
>>>> + enter_qs(sleep);
>>>> qatomic_sub(&rcu_call_count, n);
>>>> bql_lock();
>>>> while (n > 0) {
>>>> node = try_dequeue();
>>>> while (!node) {
>>>
>>> I have a pure question here not relevant to your changes.. do you know when
>>> this "if" will trigger? It seems to me the enqueue() should always happen
>>> before the increment of rcu_call_count:
>>>
>>> void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
>>> {
>>> node->func = func;
>>> enqueue(node);
>>>
>>> if (!qatomic_fetch_inc(&rcu_call_count)) {
>>> qemu_event_set(&sync_event);
>>> }
>>> }>
>>> I believe qatomic_fetch_inc() is RMW so it's strong mb and order
>>> guaranteed. Then here why the node can be null even if we're sure >=n have
>>> been enqueued?
>>
>> Indeed, enqueue() always happens before the increment of rcu_call_count
>> performed by the same thread.
>>
>> The node can still be NULL when there are two concurrent call_rcu1()
>> executions. In the following example, rcu_call_count will be greater than
>> the number of visible nodes after (A) and before (B):
>>
>> Thread T Thread U
>> call_rcu1(O)
>> enqueue(O)
>> Load N from tail
>> tail = O->next
>> call_rcu1(P)
>> enqueue(P)
>> Load O->next from tail
>> tail = P
>> O->next = P
>> rcu_call_count++ (A)
>> N->next = O (B)
>> rcu_call_count++
>
> Thanks, yeah it makes sense. If you think worthwhile, maybe we could add a
> comment after the first try_dequeue().
>
try_dequeue() and enqueue() do have comments that say that enqueue() is
asynchronous and explain why, but perhaps they can be expanded by adding
the example I showed.
Besides enqueue() may be renamed like enqueue_async() so that it will be
clear that enqueuing does not happen immediately when you look at
call_rcu1(). If you wonder enqueue_async() is asynchronous, you will
look into enqueue_async() and find the explanatory comments.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-06 1:40 ` Akihiko Odaki
@ 2025-11-06 21:52 ` Peter Xu
2025-11-07 1:47 ` Akihiko Odaki
[not found] ` <1b318ad8-48b3-4968-86ca-c62aef3b3bd4@rsg.ci.i.u-tokyo.ac.jp>
0 siblings, 2 replies; 29+ messages in thread
From: Peter Xu @ 2025-11-06 21:52 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
> > > > > + /*
> > > > > + * Ensure that the forced variable has not been set after fetching
> > > > > + * rcu_call_count; otherwise we may get confused by a force quiescent
> > > > > + * state request for an element later than n.
> > > > > + */
> > > > > + while (qatomic_xchg(&forced, false)) {
> > > > > + sleep = false;
> > > > > + n = qatomic_read(&rcu_call_count);
> > > > > }
> > > >
> > > > This is pretty tricky, and I wonder if it will make the code easier to read
> > > > if we convert the sync_event to be a semaphore instead. When as a sem, it
> > > > will take account of whatever kick to it, either a call_rcu1() or an
> > > > enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
> > > > worry on an slightly outdated "n" read because the 2nd round of sem_wait()
> > > > will catch that new "n".
> > > >
> > > > Instead, worst case is rcu thread runs one more round without seeing
> > > > callbacks on the queue.
> > > >
> > > > I'm not sure if that could help simplying code, maybe also make it less
> > > > error-prone.
> > >
> > > Semaphore is not applicable here because it will not de-duplicate concurrent
> > > kicks of RCU threads.
> >
> > Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
> > thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
> > loops some more rounds reading "n", which looks all safe. No?
>
> It is safe but incurs overheads and confusing. QemuEvent represents the
> boolean semantics better.
>
> I also have difficulty to understand how converting sync_event to a
> semaphore simplifies the code. Perhaps some (pseudo)code to show how the
> code will look like may be useful.
I prepared a patch on top of your current patchset to show what I meant. I
also added comments and some test results showing why I think it might be
fine that the sem overhead should be small.
In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
poweroff, I only saw <1000 rcu_call1 users in total. That should be the
max-bound of sem overhead on looping in rcu thread.
It's in patch format but still treat it as a comment instead to discuss
with you. Attaching it is just easier for me.
===8<===
From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 6 Nov 2025 16:03:00 -0500
Subject: [PATCH] rcu: Make sync_event a semaphore
It could simply all reset logic, especially after enforced rcu is
introduced we'll also need a tweak to re-read "n", which can be avoided too
when with a sem.
However, the sem can introduce an overhead in high frequecy rcu frees.
This patch is drafted with the assumption that rcu free is at least very
rare in QEMU, hence it's not a problem.
When I tested with this command:
qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
-m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
-accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
-monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
-device e1000,netdev=net0 -device virtio-balloon $DISK
I booted a pre-installed Linux, login and poweroff, wait until VM
completely shutdowns. I captured less than 1000 rcu_free1() calls in
summary. It means for the whole lifetime of such VM the max overhead of
the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
util/rcu.c | 36 ++++++++----------------------------
1 file changed, 8 insertions(+), 28 deletions(-)
diff --git a/util/rcu.c b/util/rcu.c
index 85f9333f5d..dfe031a5c9 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -54,7 +54,7 @@ static int rcu_call_count;
static QemuMutex rcu_registry_lock;
/* Set when the forced variable is set or rcu_call_count becomes non-zero. */
-static QemuEvent sync_event;
+static QemuSemaphore sync_event;
/*
* Check whether a quiescent state was crossed between the beginning of
@@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
void force_rcu(void)
{
qatomic_set(&forced, true);
- qemu_event_set(&sync_event);
+ qemu_sem_post(&sync_event);
}
/* Wait for previous parity/grace period to be empty of readers. */
@@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
*/
qemu_event_reset(&rcu_gp_event);
} else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
- !sleeps || qemu_event_timedwait(&sync_event, 10)) {
+ !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
/*
* Now one of the following heuristical conditions is satisfied:
* - A decent number of callbacks piled up.
@@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
rcu_register_thread();
for (;;) {
- bool sleep = true;
int n;
/*
@@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
* added before enter_qs() starts.
*/
for (;;) {
- qemu_event_reset(&sync_event);
n = qatomic_read(&rcu_call_count);
if (n) {
break;
@@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#endif
- qemu_event_wait(&sync_event);
+ qemu_sem_wait(&sync_event);
}
- /*
- * Ensure that an event for a rcu_call_count change will not interrupt
- * wait_for_readers().
- */
- qemu_event_reset(&sync_event);
-
- /*
- * Ensure that the forced variable has not been set after fetching
- * rcu_call_count; otherwise we may get confused by a force quiescent
- * state request for an element later than n.
- */
- while (qatomic_xchg(&forced, false)) {
- sleep = false;
- n = qatomic_read(&rcu_call_count);
- }
-
- enter_qs(sleep);
+ enter_qs(!qatomic_xchg(&forced, false));
qatomic_sub(&rcu_call_count, n);
bql_lock();
while (n > 0) {
node = try_dequeue();
while (!node) {
bql_unlock();
- qemu_event_reset(&sync_event);
node = try_dequeue();
if (!node) {
- qemu_event_wait(&sync_event);
+ qemu_sem_wait(&sync_event);
node = try_dequeue();
}
bql_lock();
@@ -352,7 +333,7 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
enqueue(node);
if (!qatomic_fetch_inc(&rcu_call_count)) {
- qemu_event_set(&sync_event);
+ qemu_sem_post(&sync_event);
}
}
@@ -456,8 +437,7 @@ static void rcu_init_complete(void)
qemu_mutex_init(&rcu_registry_lock);
qemu_event_init(&rcu_gp_event, true);
-
- qemu_event_init(&sync_event, false);
+ qemu_sem_init(&sync_event, 0);
/* The caller is assumed to have BQL, so the call_rcu thread
* must have been quiescent even after forking, just recreate it.
--
2.50.1
===8<===
When I was having a closer look, I found some other issues, I'll list it
all here.
1. I found that rcu_gp_event was initialized as "true". I'm not sure
whether it should be false. This has nothing to do with your series as
well, only some observation of mine.
qemu_event_init(&rcu_gp_event, true);
2. It looks to me your patch here checked the wrong retval of
qemu_event_timedwait()..
} else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
!sleeps || qemu_event_timedwait(&sync_event, 10)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
should it be "!qemu_event_timedwait(&sync_event, 10)"? Side note: maybe
we should cleanup all _timedwait() for QEMU's eventfd, sem, cond,
... because they don't return the same retval.. but if you think sem is
good, then we don't need eventfd's timedwait() in this series (your
initial two patches).
3. I doubt if malloc_trim() feature is broken with your current patchset,
because now the loop looks like:
for (;;) {
qemu_event_reset(&sync_event);
n = qatomic_read(&rcu_call_count);
if (n) {
break;
}
#if defined(CONFIG_MALLOC_TRIM)
malloc_trim(4 * 1024 * 1024);
#endif
qemu_event_wait(&sync_event);
}
I don't know if n can be zero here at all.. if you use the sem change
this might trigger but it's not designed for it. When using sem we may
want to periodically trim. But I'd like to know how you think in general
on the sem idea first (e.g. do we need to be prepared for high freq rcu
frees).
Thanks,
--
Peter Xu
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-06 21:52 ` Peter Xu
@ 2025-11-07 1:47 ` Akihiko Odaki
2025-11-07 14:00 ` Peter Xu
[not found] ` <1b318ad8-48b3-4968-86ca-c62aef3b3bd4@rsg.ci.i.u-tokyo.ac.jp>
1 sibling, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-07 1:47 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/07 6:52, Peter Xu wrote:
> On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
>>>>>> + /*
>>>>>> + * Ensure that the forced variable has not been set after fetching
>>>>>> + * rcu_call_count; otherwise we may get confused by a force quiescent
>>>>>> + * state request for an element later than n.
>>>>>> + */
>>>>>> + while (qatomic_xchg(&forced, false)) {
>>>>>> + sleep = false;
>>>>>> + n = qatomic_read(&rcu_call_count);
>>>>>> }
>>>>>
>>>>> This is pretty tricky, and I wonder if it will make the code easier to read
>>>>> if we convert the sync_event to be a semaphore instead. When as a sem, it
>>>>> will take account of whatever kick to it, either a call_rcu1() or an
>>>>> enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
>>>>> worry on an slightly outdated "n" read because the 2nd round of sem_wait()
>>>>> will catch that new "n".
>>>>>
>>>>> Instead, worst case is rcu thread runs one more round without seeing
>>>>> callbacks on the queue.
>>>>>
>>>>> I'm not sure if that could help simplying code, maybe also make it less
>>>>> error-prone.
>>>>
>>>> Semaphore is not applicable here because it will not de-duplicate concurrent
>>>> kicks of RCU threads.
>>>
>>> Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
>>> thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
>>> loops some more rounds reading "n", which looks all safe. No?
>>
>> It is safe but incurs overheads and confusing. QemuEvent represents the
>> boolean semantics better.
>>
>> I also have difficulty to understand how converting sync_event to a
>> semaphore simplifies the code. Perhaps some (pseudo)code to show how the
>> code will look like may be useful.
>
> I prepared a patch on top of your current patchset to show what I meant. I
> also added comments and some test results showing why I think it might be
> fine that the sem overhead should be small.
>
> In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
> poweroff, I only saw <1000 rcu_call1 users in total. That should be the
> max-bound of sem overhead on looping in rcu thread.
>
> It's in patch format but still treat it as a comment instead to discuss
> with you. Attaching it is just easier for me.
>
> ===8<===
> From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 6 Nov 2025 16:03:00 -0500
> Subject: [PATCH] rcu: Make sync_event a semaphore
>
> It could simply all reset logic, especially after enforced rcu is
> introduced we'll also need a tweak to re-read "n", which can be avoided too
> when with a sem.
>
> However, the sem can introduce an overhead in high frequecy rcu frees.
> This patch is drafted with the assumption that rcu free is at least very
> rare in QEMU, hence it's not a problem.
>
> When I tested with this command:
>
> qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
> -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
> -accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
> -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
> -device e1000,netdev=net0 -device virtio-balloon $DISK
>
> I booted a pre-installed Linux, login and poweroff, wait until VM
> completely shutdowns. I captured less than 1000 rcu_free1() calls in
> summary. It means for the whole lifetime of such VM the max overhead of
> the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> util/rcu.c | 36 ++++++++----------------------------
> 1 file changed, 8 insertions(+), 28 deletions(-)
>
> diff --git a/util/rcu.c b/util/rcu.c
> index 85f9333f5d..dfe031a5c9 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -54,7 +54,7 @@ static int rcu_call_count;
> static QemuMutex rcu_registry_lock;
>
> /* Set when the forced variable is set or rcu_call_count becomes non-zero. */
> -static QemuEvent sync_event;
> +static QemuSemaphore sync_event;
>
> /*
> * Check whether a quiescent state was crossed between the beginning of
> @@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
> void force_rcu(void)
> {
> qatomic_set(&forced, true);
> - qemu_event_set(&sync_event);
> + qemu_sem_post(&sync_event);
> }
>
> /* Wait for previous parity/grace period to be empty of readers. */
> @@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
> */
> qemu_event_reset(&rcu_gp_event);
> } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> - !sleeps || qemu_event_timedwait(&sync_event, 10)) {
> + !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
> /*
> * Now one of the following heuristical conditions is satisfied:
> * - A decent number of callbacks piled up.
> @@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
> rcu_register_thread();
>
> for (;;) {
> - bool sleep = true;
> int n;
>
> /*
> @@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
> * added before enter_qs() starts.
> */
> for (;;) {
> - qemu_event_reset(&sync_event);
> n = qatomic_read(&rcu_call_count);
> if (n) {
> break;
> @@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
> #if defined(CONFIG_MALLOC_TRIM)
> malloc_trim(4 * 1024 * 1024);
> #endif
> - qemu_event_wait(&sync_event);
> + qemu_sem_wait(&sync_event);
> }
>
> - /*
> - * Ensure that an event for a rcu_call_count change will not interrupt
> - * wait_for_readers().
> - */
> - qemu_event_reset(&sync_event);
> -
> - /*
> - * Ensure that the forced variable has not been set after fetching
> - * rcu_call_count; otherwise we may get confused by a force quiescent
> - * state request for an element later than n.
> - */
> - while (qatomic_xchg(&forced, false)) {
> - sleep = false;
> - n = qatomic_read(&rcu_call_count);
> - }
> -
> - enter_qs(sleep);
> + enter_qs(!qatomic_xchg(&forced, false));
This is not OK; the forced variable may be set after rcu_call_count is
fetched. In that case, we should avoid unsetting the force quiescent
state request for the elements later than "n" or refetch "n".
> qatomic_sub(&rcu_call_count, n);
> bql_lock();
> while (n > 0) {
> node = try_dequeue();
> while (!node) {
> bql_unlock();
> - qemu_event_reset(&sync_event);
> node = try_dequeue();
> if (!node) {
> - qemu_event_wait(&sync_event);
> + qemu_sem_wait(&sync_event);
> node = try_dequeue();
> }
> bql_lock();
> @@ -352,7 +333,7 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct rcu_head *node))
> enqueue(node);
>
> if (!qatomic_fetch_inc(&rcu_call_count)) {
> - qemu_event_set(&sync_event);
> + qemu_sem_post(&sync_event);
> }
> }
>
> @@ -456,8 +437,7 @@ static void rcu_init_complete(void)
>
> qemu_mutex_init(&rcu_registry_lock);
> qemu_event_init(&rcu_gp_event, true);
> -
> - qemu_event_init(&sync_event, false);
> + qemu_sem_init(&sync_event, 0);
>
> /* The caller is assumed to have BQL, so the call_rcu thread
> * must have been quiescent even after forking, just recreate it.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-07 1:47 ` Akihiko Odaki
@ 2025-11-07 14:00 ` Peter Xu
2025-11-08 1:47 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-11-07 14:00 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Fri, Nov 07, 2025 at 10:47:35AM +0900, Akihiko Odaki wrote:
> On 2025/11/07 6:52, Peter Xu wrote:
> > On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
> > > > > > > + /*
> > > > > > > + * Ensure that the forced variable has not been set after fetching
> > > > > > > + * rcu_call_count; otherwise we may get confused by a force quiescent
> > > > > > > + * state request for an element later than n.
> > > > > > > + */
> > > > > > > + while (qatomic_xchg(&forced, false)) {
> > > > > > > + sleep = false;
> > > > > > > + n = qatomic_read(&rcu_call_count);
> > > > > > > }
> > > > > >
> > > > > > This is pretty tricky, and I wonder if it will make the code easier to read
> > > > > > if we convert the sync_event to be a semaphore instead. When as a sem, it
> > > > > > will take account of whatever kick to it, either a call_rcu1() or an
> > > > > > enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
> > > > > > worry on an slightly outdated "n" read because the 2nd round of sem_wait()
> > > > > > will catch that new "n".
> > > > > >
> > > > > > Instead, worst case is rcu thread runs one more round without seeing
> > > > > > callbacks on the queue.
> > > > > >
> > > > > > I'm not sure if that could help simplying code, maybe also make it less
> > > > > > error-prone.
> > > > >
> > > > > Semaphore is not applicable here because it will not de-duplicate concurrent
> > > > > kicks of RCU threads.
> > > >
> > > > Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
> > > > thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
> > > > loops some more rounds reading "n", which looks all safe. No?
> > >
> > > It is safe but incurs overheads and confusing. QemuEvent represents the
> > > boolean semantics better.
> > >
> > > I also have difficulty to understand how converting sync_event to a
> > > semaphore simplifies the code. Perhaps some (pseudo)code to show how the
> > > code will look like may be useful.
> >
> > I prepared a patch on top of your current patchset to show what I meant. I
> > also added comments and some test results showing why I think it might be
> > fine that the sem overhead should be small.
> >
> > In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
> > poweroff, I only saw <1000 rcu_call1 users in total. That should be the
> > max-bound of sem overhead on looping in rcu thread.
> >
> > It's in patch format but still treat it as a comment instead to discuss
> > with you. Attaching it is just easier for me.
> >
> > ===8<===
> > From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Thu, 6 Nov 2025 16:03:00 -0500
> > Subject: [PATCH] rcu: Make sync_event a semaphore
> >
> > It could simply all reset logic, especially after enforced rcu is
> > introduced we'll also need a tweak to re-read "n", which can be avoided too
> > when with a sem.
> >
> > However, the sem can introduce an overhead in high frequecy rcu frees.
> > This patch is drafted with the assumption that rcu free is at least very
> > rare in QEMU, hence it's not a problem.
> >
> > When I tested with this command:
> >
> > qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
> > -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
> > -accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
> > -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
> > -device e1000,netdev=net0 -device virtio-balloon $DISK
> >
> > I booted a pre-installed Linux, login and poweroff, wait until VM
> > completely shutdowns. I captured less than 1000 rcu_free1() calls in
> > summary. It means for the whole lifetime of such VM the max overhead of
> > the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > util/rcu.c | 36 ++++++++----------------------------
> > 1 file changed, 8 insertions(+), 28 deletions(-)
> >
> > diff --git a/util/rcu.c b/util/rcu.c
> > index 85f9333f5d..dfe031a5c9 100644
> > --- a/util/rcu.c
> > +++ b/util/rcu.c
> > @@ -54,7 +54,7 @@ static int rcu_call_count;
> > static QemuMutex rcu_registry_lock;
> > /* Set when the forced variable is set or rcu_call_count becomes non-zero. */
> > -static QemuEvent sync_event;
> > +static QemuSemaphore sync_event;
> > /*
> > * Check whether a quiescent state was crossed between the beginning of
> > @@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
> > void force_rcu(void)
> > {
> > qatomic_set(&forced, true);
> > - qemu_event_set(&sync_event);
> > + qemu_sem_post(&sync_event);
> > }
> > /* Wait for previous parity/grace period to be empty of readers. */
> > @@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
> > */
> > qemu_event_reset(&rcu_gp_event);
> > } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> > - !sleeps || qemu_event_timedwait(&sync_event, 10)) {
> > + !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
> > /*
> > * Now one of the following heuristical conditions is satisfied:
> > * - A decent number of callbacks piled up.
> > @@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
> > rcu_register_thread();
> > for (;;) {
> > - bool sleep = true;
> > int n;
> > /*
> > @@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
> > * added before enter_qs() starts.
> > */
> > for (;;) {
> > - qemu_event_reset(&sync_event);
> > n = qatomic_read(&rcu_call_count);
> > if (n) {
> > break;
> > @@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
> > #if defined(CONFIG_MALLOC_TRIM)
> > malloc_trim(4 * 1024 * 1024);
> > #endif
> > - qemu_event_wait(&sync_event);
> > + qemu_sem_wait(&sync_event);
> > }
> > - /*
> > - * Ensure that an event for a rcu_call_count change will not interrupt
> > - * wait_for_readers().
> > - */
> > - qemu_event_reset(&sync_event);
> > -
> > - /*
> > - * Ensure that the forced variable has not been set after fetching
> > - * rcu_call_count; otherwise we may get confused by a force quiescent
> > - * state request for an element later than n.
> > - */
> > - while (qatomic_xchg(&forced, false)) {
> > - sleep = false;
> > - n = qatomic_read(&rcu_call_count);
> > - }
> > -
> > - enter_qs(sleep);
> > + enter_qs(!qatomic_xchg(&forced, false));
>
> This is not OK; the forced variable may be set after rcu_call_count is
> fetched. In that case, we should avoid unsetting the force quiescent state
> request for the elements later than "n" or refetch "n".
Indeed I missed that part, but it should be trivial to fix, on top of my
previous patch:
===8<===
diff --git a/util/rcu.c b/util/rcu.c
index dfe031a5c9..aff98d9ee2 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -286,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
rcu_register_thread();
for (;;) {
+ bool sleep;
int n;
/*
@@ -293,6 +294,7 @@ static void *call_rcu_thread(void *opaque)
* added before enter_qs() starts.
*/
for (;;) {
+ sleep = !qatomic_xchg(&forced, false);
n = qatomic_read(&rcu_call_count);
if (n) {
break;
@@ -304,7 +306,7 @@ static void *call_rcu_thread(void *opaque)
qemu_sem_wait(&sync_event);
}
- enter_qs(!qatomic_xchg(&forced, false));
+ enter_qs(sleep);
qatomic_sub(&rcu_call_count, n);
bql_lock();
while (n > 0) {
===8<===
The idea is still the same, using semaphore can avoid explicit resets and a
lot of other ordering constraints like reading call_count, etc.
E.g. even before this series, we still need to properly reset at explicit
time to make sure we can capture a set() correct. When with sem, all these
issues are gone simply because we won't miss post() when it's a counter not
boolean.
Also, would you please also have a look at other comments I left in the
same email (after the patch I attached)?
https://lore.kernel.org/qemu-devel/aQ0Ys09WtlSPoapm@x1.local/
Can search "When I was having a closer look, I found some other issues".
Thanks,
--
Peter Xu
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-07 14:00 ` Peter Xu
@ 2025-11-08 1:47 ` Akihiko Odaki
2025-11-13 17:03 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-08 1:47 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/07 23:00, Peter Xu wrote:
> On Fri, Nov 07, 2025 at 10:47:35AM +0900, Akihiko Odaki wrote:
>> On 2025/11/07 6:52, Peter Xu wrote:
>>> On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
>>>>>>>> + /*
>>>>>>>> + * Ensure that the forced variable has not been set after fetching
>>>>>>>> + * rcu_call_count; otherwise we may get confused by a force quiescent
>>>>>>>> + * state request for an element later than n.
>>>>>>>> + */
>>>>>>>> + while (qatomic_xchg(&forced, false)) {
>>>>>>>> + sleep = false;
>>>>>>>> + n = qatomic_read(&rcu_call_count);
>>>>>>>> }
>>>>>>>
>>>>>>> This is pretty tricky, and I wonder if it will make the code easier to read
>>>>>>> if we convert the sync_event to be a semaphore instead. When as a sem, it
>>>>>>> will take account of whatever kick to it, either a call_rcu1() or an
>>>>>>> enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
>>>>>>> worry on an slightly outdated "n" read because the 2nd round of sem_wait()
>>>>>>> will catch that new "n".
>>>>>>>
>>>>>>> Instead, worst case is rcu thread runs one more round without seeing
>>>>>>> callbacks on the queue.
>>>>>>>
>>>>>>> I'm not sure if that could help simplying code, maybe also make it less
>>>>>>> error-prone.
>>>>>>
>>>>>> Semaphore is not applicable here because it will not de-duplicate concurrent
>>>>>> kicks of RCU threads.
>>>>>
>>>>> Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
>>>>> thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
>>>>> loops some more rounds reading "n", which looks all safe. No?
>>>>
>>>> It is safe but incurs overheads and confusing. QemuEvent represents the
>>>> boolean semantics better.
>>>>
>>>> I also have difficulty to understand how converting sync_event to a
>>>> semaphore simplifies the code. Perhaps some (pseudo)code to show how the
>>>> code will look like may be useful.
>>>
>>> I prepared a patch on top of your current patchset to show what I meant. I
>>> also added comments and some test results showing why I think it might be
>>> fine that the sem overhead should be small.
>>>
>>> In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
>>> poweroff, I only saw <1000 rcu_call1 users in total. That should be the
>>> max-bound of sem overhead on looping in rcu thread.
>>>
>>> It's in patch format but still treat it as a comment instead to discuss
>>> with you. Attaching it is just easier for me.
>>>
>>> ===8<===
>>> From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
>>> From: Peter Xu <peterx@redhat.com>
>>> Date: Thu, 6 Nov 2025 16:03:00 -0500
>>> Subject: [PATCH] rcu: Make sync_event a semaphore
>>>
>>> It could simply all reset logic, especially after enforced rcu is
>>> introduced we'll also need a tweak to re-read "n", which can be avoided too
>>> when with a sem.
>>>
>>> However, the sem can introduce an overhead in high frequecy rcu frees.
>>> This patch is drafted with the assumption that rcu free is at least very
>>> rare in QEMU, hence it's not a problem.
>>>
>>> When I tested with this command:
>>>
>>> qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
>>> -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
>>> -accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
>>> -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
>>> -device e1000,netdev=net0 -device virtio-balloon $DISK
>>>
>>> I booted a pre-installed Linux, login and poweroff, wait until VM
>>> completely shutdowns. I captured less than 1000 rcu_free1() calls in
>>> summary. It means for the whole lifetime of such VM the max overhead of
>>> the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>> util/rcu.c | 36 ++++++++----------------------------
>>> 1 file changed, 8 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/util/rcu.c b/util/rcu.c
>>> index 85f9333f5d..dfe031a5c9 100644
>>> --- a/util/rcu.c
>>> +++ b/util/rcu.c
>>> @@ -54,7 +54,7 @@ static int rcu_call_count;
>>> static QemuMutex rcu_registry_lock;
>>> /* Set when the forced variable is set or rcu_call_count becomes non-zero. */
>>> -static QemuEvent sync_event;
>>> +static QemuSemaphore sync_event;
>>> /*
>>> * Check whether a quiescent state was crossed between the beginning of
>>> @@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
>>> void force_rcu(void)
>>> {
>>> qatomic_set(&forced, true);
>>> - qemu_event_set(&sync_event);
>>> + qemu_sem_post(&sync_event);
>>> }
>>> /* Wait for previous parity/grace period to be empty of readers. */
>>> @@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
>>> */
>>> qemu_event_reset(&rcu_gp_event);
>>> } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>>> - !sleeps || qemu_event_timedwait(&sync_event, 10)) {
>>> + !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
>>> /*
>>> * Now one of the following heuristical conditions is satisfied:
>>> * - A decent number of callbacks piled up.
>>> @@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
>>> rcu_register_thread();
>>> for (;;) {
>>> - bool sleep = true;
>>> int n;
>>> /*
>>> @@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
>>> * added before enter_qs() starts.
>>> */
>>> for (;;) {
>>> - qemu_event_reset(&sync_event);
>>> n = qatomic_read(&rcu_call_count);
>>> if (n) {
>>> break;
>>> @@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
>>> #if defined(CONFIG_MALLOC_TRIM)
>>> malloc_trim(4 * 1024 * 1024);
>>> #endif
>>> - qemu_event_wait(&sync_event);
>>> + qemu_sem_wait(&sync_event);
>>> }
>>> - /*
>>> - * Ensure that an event for a rcu_call_count change will not interrupt
>>> - * wait_for_readers().
>>> - */
>>> - qemu_event_reset(&sync_event);
>>> -
>>> - /*
>>> - * Ensure that the forced variable has not been set after fetching
>>> - * rcu_call_count; otherwise we may get confused by a force quiescent
>>> - * state request for an element later than n.
>>> - */
>>> - while (qatomic_xchg(&forced, false)) {
>>> - sleep = false;
>>> - n = qatomic_read(&rcu_call_count);
>>> - }
>>> -
>>> - enter_qs(sleep);
>>> + enter_qs(!qatomic_xchg(&forced, false));
>>
>> This is not OK; the forced variable may be set after rcu_call_count is
>> fetched. In that case, we should avoid unsetting the force quiescent state
>> request for the elements later than "n" or refetch "n".
>
> Indeed I missed that part, but it should be trivial to fix, on top of my
> previous patch:
>
> ===8<===
> diff --git a/util/rcu.c b/util/rcu.c
> index dfe031a5c9..aff98d9ee2 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -286,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
> rcu_register_thread();
>
> for (;;) {
> + bool sleep;
> int n;
>
> /*
> @@ -293,6 +294,7 @@ static void *call_rcu_thread(void *opaque)
> * added before enter_qs() starts.
> */
> for (;;) {
> + sleep = !qatomic_xchg(&forced, false);
This doesn't work either; the following sequence may happen (assume
forced is false at beginning):
qatomic_xchg(&forced, false) |
| call_rcu1()
| qatomic_fetch_inc(&rcu_call_count)
| force_rcu()
| qatomic_set(&forced, true)
qatomic_read(&rcu_call_count)
We need to enter the force quiescent state for the node added with the
call_rcu1() call in this sequence, but this code doesn't.
> n = qatomic_read(&rcu_call_count);
> if (n) {
> break;
> @@ -304,7 +306,7 @@ static void *call_rcu_thread(void *opaque)
> qemu_sem_wait(&sync_event);
> }
>
> - enter_qs(!qatomic_xchg(&forced, false));
> + enter_qs(sleep);
> qatomic_sub(&rcu_call_count, n);
> bql_lock();
> while (n > 0) {
> ===8<===
>
> The idea is still the same, using semaphore can avoid explicit resets and a
> lot of other ordering constraints like reading call_count, etc.
>
> E.g. even before this series, we still need to properly reset at explicit
> time to make sure we can capture a set() correct. When with sem, all these
> issues are gone simply because we won't miss post() when it's a counter not
> boolean.
>
> Also, would you please also have a look at other comments I left in the
> same email (after the patch I attached)?
>
> https://lore.kernel.org/qemu-devel/aQ0Ys09WtlSPoapm@x1.local/
>
> Can search "When I was having a closer look, I found some other issues".
I have just replied to the email. My mailer ignored the part after "--".
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-08 1:47 ` Akihiko Odaki
@ 2025-11-13 17:03 ` Peter Xu
2025-11-14 1:24 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-11-13 17:03 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Sat, Nov 08, 2025 at 10:47:16AM +0900, Akihiko Odaki wrote:
> On 2025/11/07 23:00, Peter Xu wrote:
> > On Fri, Nov 07, 2025 at 10:47:35AM +0900, Akihiko Odaki wrote:
> > > On 2025/11/07 6:52, Peter Xu wrote:
> > > > On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
> > > > > > > > > + /*
> > > > > > > > > + * Ensure that the forced variable has not been set after fetching
> > > > > > > > > + * rcu_call_count; otherwise we may get confused by a force quiescent
> > > > > > > > > + * state request for an element later than n.
> > > > > > > > > + */
> > > > > > > > > + while (qatomic_xchg(&forced, false)) {
> > > > > > > > > + sleep = false;
> > > > > > > > > + n = qatomic_read(&rcu_call_count);
> > > > > > > > > }
> > > > > > > >
> > > > > > > > This is pretty tricky, and I wonder if it will make the code easier to read
> > > > > > > > if we convert the sync_event to be a semaphore instead. When as a sem, it
> > > > > > > > will take account of whatever kick to it, either a call_rcu1() or an
> > > > > > > > enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
> > > > > > > > worry on an slightly outdated "n" read because the 2nd round of sem_wait()
> > > > > > > > will catch that new "n".
> > > > > > > >
> > > > > > > > Instead, worst case is rcu thread runs one more round without seeing
> > > > > > > > callbacks on the queue.
> > > > > > > >
> > > > > > > > I'm not sure if that could help simplying code, maybe also make it less
> > > > > > > > error-prone.
> > > > > > >
> > > > > > > Semaphore is not applicable here because it will not de-duplicate concurrent
> > > > > > > kicks of RCU threads.
> > > > > >
> > > > > > Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
> > > > > > thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
> > > > > > loops some more rounds reading "n", which looks all safe. No?
> > > > >
> > > > > It is safe but incurs overheads and confusing. QemuEvent represents the
> > > > > boolean semantics better.
> > > > >
> > > > > I also have difficulty to understand how converting sync_event to a
> > > > > semaphore simplifies the code. Perhaps some (pseudo)code to show how the
> > > > > code will look like may be useful.
> > > >
> > > > I prepared a patch on top of your current patchset to show what I meant. I
> > > > also added comments and some test results showing why I think it might be
> > > > fine that the sem overhead should be small.
> > > >
> > > > In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
> > > > poweroff, I only saw <1000 rcu_call1 users in total. That should be the
> > > > max-bound of sem overhead on looping in rcu thread.
> > > >
> > > > It's in patch format but still treat it as a comment instead to discuss
> > > > with you. Attaching it is just easier for me.
> > > >
> > > > ===8<===
> > > > From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
> > > > From: Peter Xu <peterx@redhat.com>
> > > > Date: Thu, 6 Nov 2025 16:03:00 -0500
> > > > Subject: [PATCH] rcu: Make sync_event a semaphore
> > > >
> > > > It could simply all reset logic, especially after enforced rcu is
> > > > introduced we'll also need a tweak to re-read "n", which can be avoided too
> > > > when with a sem.
> > > >
> > > > However, the sem can introduce an overhead in high frequecy rcu frees.
> > > > This patch is drafted with the assumption that rcu free is at least very
> > > > rare in QEMU, hence it's not a problem.
> > > >
> > > > When I tested with this command:
> > > >
> > > > qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
> > > > -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
> > > > -accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
> > > > -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
> > > > -device e1000,netdev=net0 -device virtio-balloon $DISK
> > > >
> > > > I booted a pre-installed Linux, login and poweroff, wait until VM
> > > > completely shutdowns. I captured less than 1000 rcu_free1() calls in
> > > > summary. It means for the whole lifetime of such VM the max overhead of
> > > > the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.
> > > >
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > > util/rcu.c | 36 ++++++++----------------------------
> > > > 1 file changed, 8 insertions(+), 28 deletions(-)
> > > >
> > > > diff --git a/util/rcu.c b/util/rcu.c
> > > > index 85f9333f5d..dfe031a5c9 100644
> > > > --- a/util/rcu.c
> > > > +++ b/util/rcu.c
> > > > @@ -54,7 +54,7 @@ static int rcu_call_count;
> > > > static QemuMutex rcu_registry_lock;
> > > > /* Set when the forced variable is set or rcu_call_count becomes non-zero. */
> > > > -static QemuEvent sync_event;
> > > > +static QemuSemaphore sync_event;
> > > > /*
> > > > * Check whether a quiescent state was crossed between the beginning of
> > > > @@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
> > > > void force_rcu(void)
> > > > {
> > > > qatomic_set(&forced, true);
> > > > - qemu_event_set(&sync_event);
> > > > + qemu_sem_post(&sync_event);
> > > > }
> > > > /* Wait for previous parity/grace period to be empty of readers. */
> > > > @@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
> > > > */
> > > > qemu_event_reset(&rcu_gp_event);
> > > > } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> > > > - !sleeps || qemu_event_timedwait(&sync_event, 10)) {
> > > > + !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
> > > > /*
> > > > * Now one of the following heuristical conditions is satisfied:
> > > > * - A decent number of callbacks piled up.
> > > > @@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
> > > > rcu_register_thread();
> > > > for (;;) {
> > > > - bool sleep = true;
> > > > int n;
> > > > /*
> > > > @@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
> > > > * added before enter_qs() starts.
> > > > */
> > > > for (;;) {
> > > > - qemu_event_reset(&sync_event);
> > > > n = qatomic_read(&rcu_call_count);
> > > > if (n) {
> > > > break;
> > > > @@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
> > > > #if defined(CONFIG_MALLOC_TRIM)
> > > > malloc_trim(4 * 1024 * 1024);
> > > > #endif
> > > > - qemu_event_wait(&sync_event);
> > > > + qemu_sem_wait(&sync_event);
> > > > }
> > > > - /*
> > > > - * Ensure that an event for a rcu_call_count change will not interrupt
> > > > - * wait_for_readers().
> > > > - */
> > > > - qemu_event_reset(&sync_event);
> > > > -
> > > > - /*
> > > > - * Ensure that the forced variable has not been set after fetching
> > > > - * rcu_call_count; otherwise we may get confused by a force quiescent
> > > > - * state request for an element later than n.
> > > > - */
> > > > - while (qatomic_xchg(&forced, false)) {
> > > > - sleep = false;
> > > > - n = qatomic_read(&rcu_call_count);
> > > > - }
> > > > -
> > > > - enter_qs(sleep);
> > > > + enter_qs(!qatomic_xchg(&forced, false));
> > >
> > > This is not OK; the forced variable may be set after rcu_call_count is
> > > fetched. In that case, we should avoid unsetting the force quiescent state
> > > request for the elements later than "n" or refetch "n".
> >
> > Indeed I missed that part, but it should be trivial to fix, on top of my
> > previous patch:
> >
> > ===8<===
> > diff --git a/util/rcu.c b/util/rcu.c
> > index dfe031a5c9..aff98d9ee2 100644
> > --- a/util/rcu.c
> > +++ b/util/rcu.c
> > @@ -286,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
> > rcu_register_thread();
> > for (;;) {
> > + bool sleep;
> > int n;
> > /*
> > @@ -293,6 +294,7 @@ static void *call_rcu_thread(void *opaque)
> > * added before enter_qs() starts.
> > */
> > for (;;) {
> > + sleep = !qatomic_xchg(&forced, false);
>
> This doesn't work either; the following sequence may happen (assume forced
> is false at beginning):
>
> qatomic_xchg(&forced, false) |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> | force_rcu()
> | qatomic_set(&forced, true) <---------- [1]
> qatomic_read(&rcu_call_count)
>
> We need to enter the force quiescent state for the node added with the
> call_rcu1() call in this sequence, but this code doesn't.
We don't necessarily need to identify "this sequence" or "next sequence",
but what we want to make sure when forced rcu triggered, rcu thread doesn't
sleep until flushing the callback injected right before force_rcu(), right?
IOW, at [1] above after setting forced, we still will post to sem, so IIUC
what will happen is both the normal call_rcu1() and force_rcu() will start
to post to the sem, hence the value of sem can be 2. If I add the whole
process into above picture, there must be a pre-existing call_rcu1() that
kicks the rcu thread first at [0] otherwise rcu thread should be asleep,
then with the help of the 2nd post at [2] it should guarantee the rcu
callback of force_rcu() be finally invoked without sleep:
rcu thread other thread
---------- ------------
call_rcu1() |
qemu_sem_post(&sync_event)| <---------- [0]
|
... rcu thread waked up ...
|
qatomic_xchg(&forced, false) |
| call_rcu1() <---------- [a]
| qatomic_fetch_inc(&rcu_call_count)
| ... assume count>0 already, no post ...
| force_rcu()
| qatomic_set(&forced, true) <---------- [1]
| ... force_rcu() always post ...
| qemu_sem_post(&sync_event) <---------- [2]
qatomic_read(&rcu_call_count) |
enter_qs() |
wait_for_readers(sleep=N) |
... finish without explicit sleep in wait_for_readers(), due to [2] above...
(even if sleep=N)
... invoked the rcu callback injected at [a] ...
... the next sequence of rcu will see force=true, but it's fine to see
force=true (false positive in this case when [a] callback is
already flushed) ...
IOW, IIUC there're two things needed to make sure the rcu thread finish
asap, one is the set of force=true, the other is the event_set(sync_event).
IMHO it's the same here when using sem, just that sem can remember more
than 1 count so it'll be able to identify concurrent call_rcu1() and
force_rcu().
Thanks,
>
> > n = qatomic_read(&rcu_call_count);
> > if (n) {
> > break;
> > @@ -304,7 +306,7 @@ static void *call_rcu_thread(void *opaque)
> > qemu_sem_wait(&sync_event);
> > }
> > - enter_qs(!qatomic_xchg(&forced, false));
> > + enter_qs(sleep);
> > qatomic_sub(&rcu_call_count, n);
> > bql_lock();
> > while (n > 0) {
> > ===8<===
> >
> > The idea is still the same, using semaphore can avoid explicit resets and a
> > lot of other ordering constraints like reading call_count, etc.
> >
> > E.g. even before this series, we still need to properly reset at explicit
> > time to make sure we can capture a set() correct. When with sem, all these
> > issues are gone simply because we won't miss post() when it's a counter not
> > boolean.
> >
> > Also, would you please also have a look at other comments I left in the
> > same email (after the patch I attached)?
> >
> > https://lore.kernel.org/qemu-devel/aQ0Ys09WtlSPoapm@x1.local/
> >
> > Can search "When I was having a closer look, I found some other issues".
>
> I have just replied to the email. My mailer ignored the part after "--".
>
> Regards,
> Akihiko Odaki
>
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-13 17:03 ` Peter Xu
@ 2025-11-14 1:24 ` Akihiko Odaki
2025-11-14 15:30 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-14 1:24 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/14 2:03, Peter Xu wrote:
> On Sat, Nov 08, 2025 at 10:47:16AM +0900, Akihiko Odaki wrote:
>> On 2025/11/07 23:00, Peter Xu wrote:
>>> On Fri, Nov 07, 2025 at 10:47:35AM +0900, Akihiko Odaki wrote:
>>>> On 2025/11/07 6:52, Peter Xu wrote:
>>>>> On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
>>>>>>>>>> + /*
>>>>>>>>>> + * Ensure that the forced variable has not been set after fetching
>>>>>>>>>> + * rcu_call_count; otherwise we may get confused by a force quiescent
>>>>>>>>>> + * state request for an element later than n.
>>>>>>>>>> + */
>>>>>>>>>> + while (qatomic_xchg(&forced, false)) {
>>>>>>>>>> + sleep = false;
>>>>>>>>>> + n = qatomic_read(&rcu_call_count);
>>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> This is pretty tricky, and I wonder if it will make the code easier to read
>>>>>>>>> if we convert the sync_event to be a semaphore instead. When as a sem, it
>>>>>>>>> will take account of whatever kick to it, either a call_rcu1() or an
>>>>>>>>> enforced rcu flush, so that we don't need to reset it. Meanwhile, we don't
>>>>>>>>> worry on an slightly outdated "n" read because the 2nd round of sem_wait()
>>>>>>>>> will catch that new "n".
>>>>>>>>>
>>>>>>>>> Instead, worst case is rcu thread runs one more round without seeing
>>>>>>>>> callbacks on the queue.
>>>>>>>>>
>>>>>>>>> I'm not sure if that could help simplying code, maybe also make it less
>>>>>>>>> error-prone.
>>>>>>>>
>>>>>>>> Semaphore is not applicable here because it will not de-duplicate concurrent
>>>>>>>> kicks of RCU threads.
>>>>>>>
>>>>>>> Why concurrent kicks of rcu threads is a problem? QemuSemaphore is
>>>>>>> thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
>>>>>>> loops some more rounds reading "n", which looks all safe. No?
>>>>>>
>>>>>> It is safe but incurs overheads and confusing. QemuEvent represents the
>>>>>> boolean semantics better.
>>>>>>
>>>>>> I also have difficulty to understand how converting sync_event to a
>>>>>> semaphore simplifies the code. Perhaps some (pseudo)code to show how the
>>>>>> code will look like may be useful.
>>>>>
>>>>> I prepared a patch on top of your current patchset to show what I meant. I
>>>>> also added comments and some test results showing why I think it might be
>>>>> fine that the sem overhead should be small.
>>>>>
>>>>> In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
>>>>> poweroff, I only saw <1000 rcu_call1 users in total. That should be the
>>>>> max-bound of sem overhead on looping in rcu thread.
>>>>>
>>>>> It's in patch format but still treat it as a comment instead to discuss
>>>>> with you. Attaching it is just easier for me.
>>>>>
>>>>> ===8<===
>>>>> From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
>>>>> From: Peter Xu <peterx@redhat.com>
>>>>> Date: Thu, 6 Nov 2025 16:03:00 -0500
>>>>> Subject: [PATCH] rcu: Make sync_event a semaphore
>>>>>
>>>>> It could simply all reset logic, especially after enforced rcu is
>>>>> introduced we'll also need a tweak to re-read "n", which can be avoided too
>>>>> when with a sem.
>>>>>
>>>>> However, the sem can introduce an overhead in high frequecy rcu frees.
>>>>> This patch is drafted with the assumption that rcu free is at least very
>>>>> rare in QEMU, hence it's not a problem.
>>>>>
>>>>> When I tested with this command:
>>>>>
>>>>> qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
>>>>> -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
>>>>> -accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
>>>>> -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
>>>>> -device e1000,netdev=net0 -device virtio-balloon $DISK
>>>>>
>>>>> I booted a pre-installed Linux, login and poweroff, wait until VM
>>>>> completely shutdowns. I captured less than 1000 rcu_free1() calls in
>>>>> summary. It means for the whole lifetime of such VM the max overhead of
>>>>> the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>> util/rcu.c | 36 ++++++++----------------------------
>>>>> 1 file changed, 8 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/util/rcu.c b/util/rcu.c
>>>>> index 85f9333f5d..dfe031a5c9 100644
>>>>> --- a/util/rcu.c
>>>>> +++ b/util/rcu.c
>>>>> @@ -54,7 +54,7 @@ static int rcu_call_count;
>>>>> static QemuMutex rcu_registry_lock;
>>>>> /* Set when the forced variable is set or rcu_call_count becomes non-zero. */
>>>>> -static QemuEvent sync_event;
>>>>> +static QemuSemaphore sync_event;
>>>>> /*
>>>>> * Check whether a quiescent state was crossed between the beginning of
>>>>> @@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
>>>>> void force_rcu(void)
>>>>> {
>>>>> qatomic_set(&forced, true);
>>>>> - qemu_event_set(&sync_event);
>>>>> + qemu_sem_post(&sync_event);
>>>>> }
>>>>> /* Wait for previous parity/grace period to be empty of readers. */
>>>>> @@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
>>>>> */
>>>>> qemu_event_reset(&rcu_gp_event);
>>>>> } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
>>>>> - !sleeps || qemu_event_timedwait(&sync_event, 10)) {
>>>>> + !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
>>>>> /*
>>>>> * Now one of the following heuristical conditions is satisfied:
>>>>> * - A decent number of callbacks piled up.
>>>>> @@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
>>>>> rcu_register_thread();
>>>>> for (;;) {
>>>>> - bool sleep = true;
>>>>> int n;
>>>>> /*
>>>>> @@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
>>>>> * added before enter_qs() starts.
>>>>> */
>>>>> for (;;) {
>>>>> - qemu_event_reset(&sync_event);
>>>>> n = qatomic_read(&rcu_call_count);
>>>>> if (n) {
>>>>> break;
>>>>> @@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
>>>>> #if defined(CONFIG_MALLOC_TRIM)
>>>>> malloc_trim(4 * 1024 * 1024);
>>>>> #endif
>>>>> - qemu_event_wait(&sync_event);
>>>>> + qemu_sem_wait(&sync_event);
>>>>> }
>>>>> - /*
>>>>> - * Ensure that an event for a rcu_call_count change will not interrupt
>>>>> - * wait_for_readers().
>>>>> - */
>>>>> - qemu_event_reset(&sync_event);
>>>>> -
>>>>> - /*
>>>>> - * Ensure that the forced variable has not been set after fetching
>>>>> - * rcu_call_count; otherwise we may get confused by a force quiescent
>>>>> - * state request for an element later than n.
>>>>> - */
>>>>> - while (qatomic_xchg(&forced, false)) {
>>>>> - sleep = false;
>>>>> - n = qatomic_read(&rcu_call_count);
>>>>> - }
>>>>> -
>>>>> - enter_qs(sleep);
>>>>> + enter_qs(!qatomic_xchg(&forced, false));
>>>>
>>>> This is not OK; the forced variable may be set after rcu_call_count is
>>>> fetched. In that case, we should avoid unsetting the force quiescent state
>>>> request for the elements later than "n" or refetch "n".
>>>
>>> Indeed I missed that part, but it should be trivial to fix, on top of my
>>> previous patch:
>>>
>>> ===8<===
>>> diff --git a/util/rcu.c b/util/rcu.c
>>> index dfe031a5c9..aff98d9ee2 100644
>>> --- a/util/rcu.c
>>> +++ b/util/rcu.c
>>> @@ -286,6 +286,7 @@ static void *call_rcu_thread(void *opaque)
>>> rcu_register_thread();
>>> for (;;) {
>>> + bool sleep;
>>> int n;
>>> /*
>>> @@ -293,6 +294,7 @@ static void *call_rcu_thread(void *opaque)
>>> * added before enter_qs() starts.
>>> */
>>> for (;;) {
>>> + sleep = !qatomic_xchg(&forced, false);
>>
>> This doesn't work either; the following sequence may happen (assume forced
>> is false at beginning):
>>
>> qatomic_xchg(&forced, false) |
>> | call_rcu1()
>> | qatomic_fetch_inc(&rcu_call_count)
>> | force_rcu()
>> | qatomic_set(&forced, true) <---------- [1]
>> qatomic_read(&rcu_call_count)
>>
>> We need to enter the force quiescent state for the node added with the
>> call_rcu1() call in this sequence, but this code doesn't.
>
> We don't necessarily need to identify "this sequence" or "next sequence",
> but what we want to make sure when forced rcu triggered, rcu thread doesn't
> sleep until flushing the callback injected right before force_rcu(), right?
>
> IOW, at [1] above after setting forced, we still will post to sem, so IIUC
> what will happen is both the normal call_rcu1() and force_rcu() will start
> to post to the sem, hence the value of sem can be 2. If I add the whole
> process into above picture, there must be a pre-existing call_rcu1() that
> kicks the rcu thread first at [0] otherwise rcu thread should be asleep,
> then with the help of the 2nd post at [2] it should guarantee the rcu
> callback of force_rcu() be finally invoked without sleep:
>
> rcu thread other thread
> ---------- ------------
>
> call_rcu1() |
> qemu_sem_post(&sync_event)| <---------- [0]
> |
>
> ... rcu thread waked up ...
>
> |
> qatomic_xchg(&forced, false) |
> | call_rcu1() <---------- [a]
> | qatomic_fetch_inc(&rcu_call_count)
> | ... assume count>0 already, no post ...
> | force_rcu()
> | qatomic_set(&forced, true) <---------- [1]
> | ... force_rcu() always post ...
> | qemu_sem_post(&sync_event) <---------- [2]
> qatomic_read(&rcu_call_count) |
> enter_qs() |
> wait_for_readers(sleep=N) |
>
> ... finish without explicit sleep in wait_for_readers(), due to [2] above...
> (even if sleep=N)
>
> ... invoked the rcu callback injected at [a] ...
>
> ... the next sequence of rcu will see force=true, but it's fine to see
> force=true (false positive in this case when [a] callback is
> already flushed) ...
>
> IOW, IIUC there're two things needed to make sure the rcu thread finish
> asap, one is the set of force=true, the other is the event_set(sync_event).
>
> IMHO it's the same here when using sem, just that sem can remember more
> than 1 count so it'll be able to identify concurrent call_rcu1() and
> force_rcu().
I missed that you removed qemu_event_reset(&sync_event) that follows
qatomic_read(&rcu_call_count). With that change the force quiescent
state request will not be dismissed.
However it creates another problem. Think of the following sequence:
call_rcu_thread() |
| call_rcu1()
| qatomic_fetch_inc(&rcu_call_count)
| qemu_sem_post(&sync_event)
|
qatomic_read(&rcu_call_count) |
enter_qs(false) |
wait_for_readers(false) |
qemu_sem_timedwait( |
&sync_event, 10) |
qemu_sem_timedwait() incorrectly interrupts the RCU thread and enters
the force quiescent state.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-14 1:24 ` Akihiko Odaki
@ 2025-11-14 15:30 ` Peter Xu
2025-11-15 1:58 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-11-14 15:30 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Fri, Nov 14, 2025 at 10:24:40AM +0900, Akihiko Odaki wrote:
> However it creates another problem. Think of the following sequence:
>
> call_rcu_thread() |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> | qemu_sem_post(&sync_event)
> |
> qatomic_read(&rcu_call_count) |
> enter_qs(false) |
> wait_for_readers(false) |
> qemu_sem_timedwait( |
> &sync_event, 10) |
>
> qemu_sem_timedwait() incorrectly interrupts the RCU thread and enters the
> force quiescent state.
First thing to mention is, IIUC above can't happen because if
call_rcu_thread() is already waked up and reaching enter_qs(), then there
should have been, for example, a prior call_rcu1() that incremented
rcu_call_count and posted to sync_event, hence rcu_call_count cannot be 0
anymore in the call_rcu1() above, because the sub happens later:
call_rcu_thread:
... sees rcu_call_count > 0, quit loop ...
...
enter_qs(sleep);
qatomic_sub(&rcu_call_count, n); <-------------------------
...
That means the concurrent call_rcu1() above will not post sem anymore
because it will only post it if rcu_call_count==0.
Besides, IMHO replacing the event with sem shouldn't change similar
behavior comparing to when using events. Because any spot that can post()
concurrently can also does qemu_event_set() concurrently... after all, we
only have a few spots resettting the event in the original patch, after the
reset a concurrent qemu_event_set() will re-activate it.
Sem does introduce possible false positives on events, but as I was trying
to justify, a generic VM boots and shutdown should need less than 1000 rcu
calls, so worst case is we loop read rcu_call_count 1000 times... I also
didn't measure how many of such call_rcu1() will see N>0 so they'll not
post() to sem at all, I believe it'll be common. So the perf overhead
should really be rare, IMHO.
Note that I was not requesting you to replace this event to sem before
merging. Frankly, I think your work is good enough to land.
However I still want to discuss a sem replacement, not only because this is
the best place to discuss it (after all, force rcu starts to introduce a
genuine second user of the event besides call_rcu1, hence the "being able
to remember more than 1" will start to make more sense than avoid resets),
it looks also cleaner to remove some new code force rcu adds.
When reviewing I found it also hard to justify qemu_event_reset() are
proper all over the places. Sem will remove that problem. So at least to
me, if you agree with using sem is better, it'll at least make me more
confident to ack the series (not that merging needs my ack.. but I do feel
more uncertain when the event needs explicit resets). OTOH, false positive
causes some loops in rcu thread which looks much safer even if happens
(however in reality, I think sem shouldn't be more than 2).
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-14 15:30 ` Peter Xu
@ 2025-11-15 1:58 ` Akihiko Odaki
2025-11-15 2:59 ` Akihiko Odaki
2025-11-17 16:39 ` Peter Xu
0 siblings, 2 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-15 1:58 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/15 0:30, Peter Xu wrote:
> On Fri, Nov 14, 2025 at 10:24:40AM +0900, Akihiko Odaki wrote:
>> However it creates another problem. Think of the following sequence:
>>
>> call_rcu_thread() |
>> | call_rcu1()
>> | qatomic_fetch_inc(&rcu_call_count)
>> | qemu_sem_post(&sync_event)
>> |
>> qatomic_read(&rcu_call_count) |
>> enter_qs(false) |
>> wait_for_readers(false) |
>> qemu_sem_timedwait( |
>> &sync_event, 10) |
>>
>> qemu_sem_timedwait() incorrectly interrupts the RCU thread and enters the
>> force quiescent state.
>
> First thing to mention is, IIUC above can't happen because if
> call_rcu_thread() is already waked up and reaching enter_qs(), then there
> should have been, for example, a prior call_rcu1() that incremented
> rcu_call_count and posted to sync_event, hence rcu_call_count cannot be 0
> anymore in the call_rcu1() above, because the sub happens later:
>
> call_rcu_thread:
> ... sees rcu_call_count > 0, quit loop ...
> ...
> enter_qs(sleep);
> qatomic_sub(&rcu_call_count, n); <-------------------------
> ...
>
> That means the concurrent call_rcu1() above will not post sem anymore
> because it will only post it if rcu_call_count==0.
Below is an extended version of the sequence:
call_rcu_thread() |
qatomic_sub(&rcu_call_count, n) |
(sets rcu_call_count to 0) |
| call_rcu1()
| qatomic_fetch_inc(&rcu_call_count)
| qemu_sem_post(&sync_event)
qatomic_read(&rcu_call_count) |
enter_qs(false) |
wait_for_readers(false) |
qemu_sem_timedwait( |
&sync_event, 10) |
Note that there may not be a qemu_sem_wait(&sync_event) call between
qatomic_sub(&rcu_call_count, n) and qatomic_read(&rcu_call_count).
>
> Besides, IMHO replacing the event with sem shouldn't change similar
> behavior comparing to when using events. Because any spot that can post()
> concurrently can also does qemu_event_set() concurrently... after all, we
> only have a few spots resettting the event in the original patch, after the
> reset a concurrent qemu_event_set() will re-activate it.
The sequence I showed is handled properly with properly placed
qemu_event_reset():
call_rcu_thread() |
qatomic_sub(&rcu_call_count, n) |
(sets rcu_call_count to 0) |
| call_rcu1()
| qatomic_fetch_inc(&rcu_call_count)
| qemu_event_set(&sync_event)
qatomic_read(&rcu_call_count) |
qemu_event_reset(&sync_event) |
enter_qs(false) |
wait_for_readers(false) |
qemu_sem_timedwait( |
&sync_event, 10) |
Note that a concurrent qemu_event_set() after resetting the event can
only triggered with force_rcu(), which is intended to interrupt
wait_for_readers().
>
> Sem does introduce possible false positives on events, but as I was trying
> to justify, a generic VM boots and shutdown should need less than 1000 rcu
> calls, so worst case is we loop read rcu_call_count 1000 times... I also
> didn't measure how many of such call_rcu1() will see N>0 so they'll not
> post() to sem at all, I believe it'll be common. So the perf overhead
> should really be rare, IMHO.
>
> Note that I was not requesting you to replace this event to sem before
> merging. Frankly, I think your work is good enough to land.
>
> However I still want to discuss a sem replacement, not only because this is
> the best place to discuss it (after all, force rcu starts to introduce a
> genuine second user of the event besides call_rcu1, hence the "being able
> to remember more than 1" will start to make more sense than avoid resets),
> it looks also cleaner to remove some new code force rcu adds.
How do "being able to remember more than 1" help?
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-15 1:58 ` Akihiko Odaki
@ 2025-11-15 2:59 ` Akihiko Odaki
2025-11-17 16:42 ` Peter Xu
2025-11-17 16:39 ` Peter Xu
1 sibling, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-15 2:59 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/15 10:58, Akihiko Odaki wrote:
> On 2025/11/15 0:30, Peter Xu wrote:
>> On Fri, Nov 14, 2025 at 10:24:40AM +0900, Akihiko Odaki wrote:
>>> However it creates another problem. Think of the following sequence:
>>>
>>> call_rcu_thread() |
>>> | call_rcu1()
>>> | qatomic_fetch_inc(&rcu_call_count)
>>> | qemu_sem_post(&sync_event)
>>> |
>>> qatomic_read(&rcu_call_count) |
>>> enter_qs(false) |
>>> wait_for_readers(false) |
>>> qemu_sem_timedwait( |
>>> &sync_event, 10) |
>>>
>>> qemu_sem_timedwait() incorrectly interrupts the RCU thread and enters
>>> the
>>> force quiescent state.
>>
>> First thing to mention is, IIUC above can't happen because if
>> call_rcu_thread() is already waked up and reaching enter_qs(), then there
>> should have been, for example, a prior call_rcu1() that incremented
>> rcu_call_count and posted to sync_event, hence rcu_call_count cannot be 0
>> anymore in the call_rcu1() above, because the sub happens later:
>>
>> call_rcu_thread:
>> ... sees rcu_call_count > 0, quit loop ...
>> ...
>> enter_qs(sleep);
>> qatomic_sub(&rcu_call_count, n); <-------------------------
>> ...
>>
>> That means the concurrent call_rcu1() above will not post sem anymore
>> because it will only post it if rcu_call_count==0.
> Below is an extended version of the sequence:
>
> call_rcu_thread() |
> qatomic_sub(&rcu_call_count, n) |
> (sets rcu_call_count to 0) |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> | qemu_sem_post(&sync_event)
> qatomic_read(&rcu_call_count) |
> enter_qs(false) |
> wait_for_readers(false) |
> qemu_sem_timedwait( |
> &sync_event, 10) |
>
> Note that there may not be a qemu_sem_wait(&sync_event) call between
> qatomic_sub(&rcu_call_count, n) and qatomic_read(&rcu_call_count).
>
>>
>> Besides, IMHO replacing the event with sem shouldn't change similar
>> behavior comparing to when using events. Because any spot that can
>> post()
>> concurrently can also does qemu_event_set() concurrently... after all, we
>> only have a few spots resettting the event in the original patch,
>> after the
>> reset a concurrent qemu_event_set() will re-activate it.
>
> The sequence I showed is handled properly with properly placed
> qemu_event_reset():
>
> call_rcu_thread() |
> qatomic_sub(&rcu_call_count, n) |
> (sets rcu_call_count to 0) |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> | qemu_event_set(&sync_event)
> qatomic_read(&rcu_call_count) |
> qemu_event_reset(&sync_event) |
> enter_qs(false) |
> wait_for_readers(false) |
> qemu_sem_timedwait( |
> &sync_event, 10) |
>
> Note that a concurrent qemu_event_set() after resetting the event can
> only triggered with force_rcu(), which is intended to interrupt
> wait_for_readers().
This is wrong. The following can still happen:
call_rcu_thread() |
qatomic_sub(&rcu_call_count, n) |
(sets rcu_call_count to 0) |
| call_rcu1()
| qatomic_fetch_inc(&rcu_call_count)
qatomic_read(&rcu_call_count) |
qemu_event_reset(&sync_event) |
| qemu_event_set(&sync_event)
enter_qs(false) |
wait_for_readers(false) |
qemu_sem_timedwait( |
&sync_event, 10) |
I'll fix it with the next version.
>
>>
>> Sem does introduce possible false positives on events, but as I was
>> trying
>> to justify, a generic VM boots and shutdown should need less than 1000
>> rcu
>> calls, so worst case is we loop read rcu_call_count 1000 times... I also
>> didn't measure how many of such call_rcu1() will see N>0 so they'll not
>> post() to sem at all, I believe it'll be common. So the perf overhead
>> should really be rare, IMHO.
>>
>> Note that I was not requesting you to replace this event to sem before
>> merging. Frankly, I think your work is good enough to land.
>>
>> However I still want to discuss a sem replacement, not only because
>> this is
>> the best place to discuss it (after all, force rcu starts to introduce a
>> genuine second user of the event besides call_rcu1, hence the "being able
>> to remember more than 1" will start to make more sense than avoid
>> resets),
>> it looks also cleaner to remove some new code force rcu adds.
>
> How do "being able to remember more than 1" help?
>
> Regards,
> Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-15 2:59 ` Akihiko Odaki
@ 2025-11-17 16:42 ` Peter Xu
2025-11-17 22:53 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-11-17 16:42 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Sat, Nov 15, 2025 at 11:59:01AM +0900, Akihiko Odaki wrote:
> This is wrong. The following can still happen:
>
> call_rcu_thread() |
> qatomic_sub(&rcu_call_count, n) |
> (sets rcu_call_count to 0) |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> qatomic_read(&rcu_call_count) |
> qemu_event_reset(&sync_event) |
> | qemu_event_set(&sync_event)
> enter_qs(false) |
> wait_for_readers(false) |
> qemu_sem_timedwait( |
> &sync_event, 10) |
>
> I'll fix it with the next version.
Please take this as an example of why I think these orderings are very hard
to make 100% accurate. Consider when someone else who is less familiar
with the rcu code and may mess up with some of the orderings without being
noticed.
That's also why I personally liked a sem because the important thing here
is not missing an event, which sem always guarantees as it doesn't have
resetting at all. Then we can be open to false positives on events as long
as keeping it as minimum as possible.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-17 16:42 ` Peter Xu
@ 2025-11-17 22:53 ` Akihiko Odaki
0 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-11-17 22:53 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On 2025/11/18 1:42, Peter Xu wrote:
> On Sat, Nov 15, 2025 at 11:59:01AM +0900, Akihiko Odaki wrote:
>> This is wrong. The following can still happen:
>>
>> call_rcu_thread() |
>> qatomic_sub(&rcu_call_count, n) |
>> (sets rcu_call_count to 0) |
>> | call_rcu1()
>> | qatomic_fetch_inc(&rcu_call_count)
>> qatomic_read(&rcu_call_count) |
>> qemu_event_reset(&sync_event) |
>> | qemu_event_set(&sync_event)
>> enter_qs(false) |
>> wait_for_readers(false) |
>> qemu_sem_timedwait( |
>> &sync_event, 10) |
>>
>> I'll fix it with the next version.
>
> Please take this as an example of why I think these orderings are very hard
> to make 100% accurate. Consider when someone else who is less familiar
> with the rcu code and may mess up with some of the orderings without being
> noticed.
>
> That's also why I personally liked a sem because the important thing here
> is not missing an event, which sem always guarantees as it doesn't have
> resetting at all. Then we can be open to false positives on events as long
> as keeping it as minimum as possible.
It doesn't matter if it is a semaphore or event in this case. You can
simply drop qemu_event_reset(&sync_event) here and the situation will be
same for both event and semaphore.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] rcu: Wake the RCU thread when draining
2025-11-15 1:58 ` Akihiko Odaki
2025-11-15 2:59 ` Akihiko Odaki
@ 2025-11-17 16:39 ` Peter Xu
1 sibling, 0 replies; 29+ messages in thread
From: Peter Xu @ 2025-11-17 16:39 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin,
Alex Bennée
On Sat, Nov 15, 2025 at 10:58:11AM +0900, Akihiko Odaki wrote:
> On 2025/11/15 0:30, Peter Xu wrote:
> > On Fri, Nov 14, 2025 at 10:24:40AM +0900, Akihiko Odaki wrote:
> > > However it creates another problem. Think of the following sequence:
> > >
> > > call_rcu_thread() |
> > > | call_rcu1()
> > > | qatomic_fetch_inc(&rcu_call_count)
> > > | qemu_sem_post(&sync_event)
> > > |
> > > qatomic_read(&rcu_call_count) |
> > > enter_qs(false) |
> > > wait_for_readers(false) |
> > > qemu_sem_timedwait( |
> > > &sync_event, 10) |
> > >
> > > qemu_sem_timedwait() incorrectly interrupts the RCU thread and enters the
> > > force quiescent state.
> >
> > First thing to mention is, IIUC above can't happen because if
> > call_rcu_thread() is already waked up and reaching enter_qs(), then there
> > should have been, for example, a prior call_rcu1() that incremented
> > rcu_call_count and posted to sync_event, hence rcu_call_count cannot be 0
> > anymore in the call_rcu1() above, because the sub happens later:
> >
> > call_rcu_thread:
> > ... sees rcu_call_count > 0, quit loop ...
> > ...
> > enter_qs(sleep);
> > qatomic_sub(&rcu_call_count, n); <-------------------------
> > ...
> >
> > That means the concurrent call_rcu1() above will not post sem anymore
> > because it will only post it if rcu_call_count==0.
> Below is an extended version of the sequence:
>
> call_rcu_thread() |
> qatomic_sub(&rcu_call_count, n) |
> (sets rcu_call_count to 0) |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> | qemu_sem_post(&sync_event)
> qatomic_read(&rcu_call_count) |
> enter_qs(false) |
> wait_for_readers(false) |
> qemu_sem_timedwait( |
> &sync_event, 10) |
>
> Note that there may not be a qemu_sem_wait(&sync_event) call between
> qatomic_sub(&rcu_call_count, n) and qatomic_read(&rcu_call_count).
Yes, with a semaphore this may happen.
It may also happen that the sem keeps accumulating like this (for example,
wait_for_readers() may always see a lot of callbacks registered that is
more than RCU_CALL_MIN_SIZE), so it can be even more than 2. Then when the
RCU callback storm finishes, call_rcu_thread() may see that >2 sem counter
and loop over in reading rcu_call_count a few times seeing zeros.
I think it's fine because rcu thread only runs slightly faster on recycling
in extreme rare cases like this, or a few more loops also only in extreme
corner cases (as I mentioned before, I was trying to ask whether high
frequency RCU free matters to us, that's why I was trying to count it in a
VM lifespan). Hence IMHO it's not an error either, but a benign false
positive.
>
> >
> > Besides, IMHO replacing the event with sem shouldn't change similar
> > behavior comparing to when using events. Because any spot that can post()
> > concurrently can also does qemu_event_set() concurrently... after all, we
> > only have a few spots resettting the event in the original patch, after the
> > reset a concurrent qemu_event_set() will re-activate it.
>
> The sequence I showed is handled properly with properly placed
> qemu_event_reset():
>
> call_rcu_thread() |
> qatomic_sub(&rcu_call_count, n) |
> (sets rcu_call_count to 0) |
> | call_rcu1()
> | qatomic_fetch_inc(&rcu_call_count)
> | qemu_event_set(&sync_event)
> qatomic_read(&rcu_call_count) |
> qemu_event_reset(&sync_event) |
> enter_qs(false) |
> wait_for_readers(false) |
> qemu_sem_timedwait( |
> &sync_event, 10) |
>
> Note that a concurrent qemu_event_set() after resetting the event can only
> triggered with force_rcu(), which is intended to interrupt
> wait_for_readers().
I see your point here. I agree I can't think of any way to trigger false
positives like when sem is used.
>
> >
> > Sem does introduce possible false positives on events, but as I was trying
> > to justify, a generic VM boots and shutdown should need less than 1000 rcu
> > calls, so worst case is we loop read rcu_call_count 1000 times... I also
> > didn't measure how many of such call_rcu1() will see N>0 so they'll not
> > post() to sem at all, I believe it'll be common. So the perf overhead
> > should really be rare, IMHO.
> >
> > Note that I was not requesting you to replace this event to sem before
> > merging. Frankly, I think your work is good enough to land.
> >
> > However I still want to discuss a sem replacement, not only because this is
> > the best place to discuss it (after all, force rcu starts to introduce a
> > genuine second user of the event besides call_rcu1, hence the "being able
> > to remember more than 1" will start to make more sense than avoid resets),
> > it looks also cleaner to remove some new code force rcu adds.
>
> How do "being able to remember more than 1" help?
It helps to avoid resets and make the code easier to follow.
The current rcu code, especially after your series lands, will have quite
some (1) more deliberate event reset needed, and (2) ordering constraints
between any of: event waits, event resets, reading "force" var, reading
"rcu_call_count" var, etc. These things need very careful attention,
missing or making it wrong any spot may introduce hard to debug functional
issues.
The sem proposal was almost about trying to remove almost all of those
complexities.
But I agree the drawback is sem may introduce false positives like above,
so at least it's not an ideal proposal. It's just that as mentioned it
should be all safe and fine IMHO, unless we care about high freq rcu frees.
As I said, I don't have a strong opinion. I'll leave that to you to decide.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <1b318ad8-48b3-4968-86ca-c62aef3b3bd4@rsg.ci.i.u-tokyo.ac.jp>]
* [PATCH 5/5] virtio-gpu: Force RCU when unmapping blob
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
` (3 preceding siblings ...)
2025-10-29 6:12 ` [PATCH 4/5] rcu: Wake the RCU thread when draining Akihiko Odaki
@ 2025-10-29 6:12 ` Akihiko Odaki
2025-10-30 11:18 ` Dmitry Osipenko
2025-10-30 11:17 ` [PATCH 0/5] " Dmitry Osipenko
` (2 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-10-29 6:12 UTC (permalink / raw)
To: qemu-devel, Dmitry Osipenko
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée,
Akihiko Odaki
Unmapping a blob changes the memory map, which is protected with RCU.
RCU is designed to minimize the read-side overhead at the cost of
reclamation delay. While this design usually makes sense, it is
problematic when unmapping a blob because the operation blocks all
virtio-gpu commands and causes perceivable disruption.
Minimize such the disruption with force_rcu(), which minimizes the
reclamation delay at the cost of a read-side overhead.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
hw/display/virtio-gpu-virgl.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/hw/display/virtio-gpu-virgl.c b/hw/display/virtio-gpu-virgl.c
index 07f6355ad62e..71cde671c193 100644
--- a/hw/display/virtio-gpu-virgl.c
+++ b/hw/display/virtio-gpu-virgl.c
@@ -187,6 +187,7 @@ virtio_gpu_virgl_unmap_resource_blob(VirtIOGPU *g,
memory_region_set_enabled(mr, false);
memory_region_del_subregion(&b->hostmem, mr);
object_unparent(OBJECT(mr));
+ force_rcu();
}
return 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 5/5] virtio-gpu: Force RCU when unmapping blob
2025-10-29 6:12 ` [PATCH 5/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
@ 2025-10-30 11:18 ` Dmitry Osipenko
0 siblings, 0 replies; 29+ messages in thread
From: Dmitry Osipenko @ 2025-10-30 11:18 UTC (permalink / raw)
To: Akihiko Odaki, qemu-devel
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée
On 10/29/25 09:12, Akihiko Odaki wrote:
> Unmapping a blob changes the memory map, which is protected with RCU.
> RCU is designed to minimize the read-side overhead at the cost of
> reclamation delay. While this design usually makes sense, it is
> problematic when unmapping a blob because the operation blocks all
> virtio-gpu commands and causes perceivable disruption.
>
> Minimize such the disruption with force_rcu(), which minimizes the
> reclamation delay at the cost of a read-side overhead.
>
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> ---
> hw/display/virtio-gpu-virgl.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/hw/display/virtio-gpu-virgl.c b/hw/display/virtio-gpu-virgl.c
> index 07f6355ad62e..71cde671c193 100644
> --- a/hw/display/virtio-gpu-virgl.c
> +++ b/hw/display/virtio-gpu-virgl.c
> @@ -187,6 +187,7 @@ virtio_gpu_virgl_unmap_resource_blob(VirtIOGPU *g,
> memory_region_set_enabled(mr, false);
> memory_region_del_subregion(&b->hostmem, mr);
> object_unparent(OBJECT(mr));
> + force_rcu();
> }
>
> return 0;
>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
--
Best regards,
Dmitry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
` (4 preceding siblings ...)
2025-10-29 6:12 ` [PATCH 5/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
@ 2025-10-30 11:17 ` Dmitry Osipenko
2025-10-30 17:59 ` Alex Bennée
2025-10-31 21:32 ` Alex Bennée
7 siblings, 0 replies; 29+ messages in thread
From: Dmitry Osipenko @ 2025-10-30 11:17 UTC (permalink / raw)
To: Akihiko Odaki, qemu-devel
Cc: Paolo Bonzini, Michael S. Tsirkin, Alex Bennée
Hello Akihiko,
On 10/29/25 09:12, Akihiko Odaki wrote:
> Based-on: <20251016-force-v1-1-919a82112498@rsg.ci.i.u-tokyo.ac.jp>
> ("[PATCH] rcu: Unify force quiescent state")
>
> Unmapping a blob changes the memory map, which is protected with RCU.
> RCU is designed to minimize the read-side overhead at the cost of
> reclamation delay. While this design usually makes sense, it is
> problematic when unmapping a blob because the operation blocks all
> virtio-gpu commands and causes perceivable disruption.
>
> Minimize such the disruption with force_rcu(), which minimizes the
> reclamation delay at the cost of a read-side overhead.
>
> Dmitry, can you see if this change makes difference?
Tested this series with venus and native contexts.
The improvement is very noticeable. There are almost no stalls with
venus and much less stalls with native context. The stall now takes
2-10ms at max in oppose to 50ms that was observed previously. No
stability issues spotted, everything works.
Thank you for working on this improvement.
--
Best regards,
Dmitry
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
` (5 preceding siblings ...)
2025-10-30 11:17 ` [PATCH 0/5] " Dmitry Osipenko
@ 2025-10-30 17:59 ` Alex Bennée
2025-10-31 21:32 ` Alex Bennée
7 siblings, 0 replies; 29+ messages in thread
From: Alex Bennée @ 2025-10-30 17:59 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> writes:
> Based-on: <20251016-force-v1-1-919a82112498@rsg.ci.i.u-tokyo.ac.jp>
> ("[PATCH] rcu: Unify force quiescent state")
>
> Unmapping a blob changes the memory map, which is protected with RCU.
> RCU is designed to minimize the read-side overhead at the cost of
> reclamation delay. While this design usually makes sense, it is
> problematic when unmapping a blob because the operation blocks all
> virtio-gpu commands and causes perceivable disruption.
>
> Minimize such the disruption with force_rcu(), which minimizes the
> reclamation delay at the cost of a read-side overhead.
>
> Dmitry, can you see if this change makes difference?
Also works with the blob test:
➜ ./pyvenv/bin/meson test --setup thorough func-aarch64-gpu_blob
ninja: Entering directory `/home/alex/lsrc/qemu.git/builds/all'
[1/6] Generating qemu-version.h with a custom command (wrapped by meson to capture output)
1/1 qemu:func-thorough+func-aarch64-thorough+thorough / func-aarch64-gpu_blob OK 0.37s 1 subtests passed
Ok: 1
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 0
Timeout: 0
Full log written to /home/alex/lsrc/qemu.git/builds/all/meson-logs/testlog-thorough.txt
🕙17:57:38 alex@draig:qemu.git/builds/all on virtio-gpu/next [$!?]
so a Tested-by: Alex Bennée <alex.bennee@linaro.org> from me.
>
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> ---
> Akihiko Odaki (5):
> futex: Add qemu_futex_timedwait()
> qemu-thread: Add qemu_event_timedwait()
> rcu: Use call_rcu() in synchronize_rcu()
> rcu: Wake the RCU thread when draining
> virtio-gpu: Force RCU when unmapping blob
>
> include/qemu/futex.h | 29 ++++++--
> include/qemu/rcu.h | 1 +
> include/qemu/thread-posix.h | 11 +++
> include/qemu/thread.h | 8 ++-
> hw/display/virtio-gpu-virgl.c | 1 +
> util/event.c | 34 ++++++++--
> util/qemu-thread-posix.c | 11 +--
> util/rcu.c | 153 ++++++++++++++++++++++++------------------
> 8 files changed, 163 insertions(+), 85 deletions(-)
> ---
> base-commit: ee7fbe81705732785aef2cb568bbc5d8f7d2fce1
> change-id: 20251027-force_rcu-616c743373f7
>
> Best regards,
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob
2025-10-29 6:12 [PATCH 0/5] virtio-gpu: Force RCU when unmapping blob Akihiko Odaki
` (6 preceding siblings ...)
2025-10-30 17:59 ` Alex Bennée
@ 2025-10-31 21:32 ` Alex Bennée
7 siblings, 0 replies; 29+ messages in thread
From: Alex Bennée @ 2025-10-31 21:32 UTC (permalink / raw)
To: Akihiko Odaki
Cc: qemu-devel, Dmitry Osipenko, Paolo Bonzini, Michael S. Tsirkin
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> writes:
> Based-on: <20251016-force-v1-1-919a82112498@rsg.ci.i.u-tokyo.ac.jp>
> ("[PATCH] rcu: Unify force quiescent state")
>
> Unmapping a blob changes the memory map, which is protected with RCU.
> RCU is designed to minimize the read-side overhead at the cost of
> reclamation delay. While this design usually makes sense, it is
> problematic when unmapping a blob because the operation blocks all
> virtio-gpu commands and causes perceivable disruption.
>
> Minimize such the disruption with force_rcu(), which minimizes the
> reclamation delay at the cost of a read-side overhead.
>
> Dmitry, can you see if this change makes difference?
>
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Are you planning a re-spin now the rcu patch is merged? If the rcu
maintainers are happy I'm fine to take it via virtio-gpu/next with the
testcase.
> ---
> Akihiko Odaki (5):
> futex: Add qemu_futex_timedwait()
> qemu-thread: Add qemu_event_timedwait()
> rcu: Use call_rcu() in synchronize_rcu()
> rcu: Wake the RCU thread when draining
> virtio-gpu: Force RCU when unmapping blob
>
> include/qemu/futex.h | 29 ++++++--
> include/qemu/rcu.h | 1 +
> include/qemu/thread-posix.h | 11 +++
> include/qemu/thread.h | 8 ++-
> hw/display/virtio-gpu-virgl.c | 1 +
> util/event.c | 34 ++++++++--
> util/qemu-thread-posix.c | 11 +--
> util/rcu.c | 153 ++++++++++++++++++++++++------------------
> 8 files changed, 163 insertions(+), 85 deletions(-)
> ---
> base-commit: ee7fbe81705732785aef2cb568bbc5d8f7d2fce1
> change-id: 20251027-force_rcu-616c743373f7
>
> Best regards,
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 29+ messages in thread