* [PATCH v3 00/10] Improve futex usage
@ 2025-05-11 6:08 Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 01/10] futex: Check value after qemu_futex_wait() Akihiko Odaki
` (9 more replies)
0 siblings, 10 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
In a recent discussion, Phil Dennis-Jordan pointed out a quirk in
QemuEvent destruction due to futex-like abstraction, which prevented
the usage of QemuEvent in new and existing code[1]. With some more
thoughts after this discussion, I also found other problem and room
of improvement in futex usage. Here is a stack of patches to resolve
them.
Patch "futex: Check value after qemu_futex_wait()" ensures
qemu_futex_wait() is used in loops as suggested in the man page.
Patch "futex: Support Windows" implements futex functions for Windows.
Patch "qemu-thread: Avoid futex abstraction for non-Linux" and
"qemu-thread: Use futex for QemuEvent on Windows" enable destroying
QemuEvent immediately after qemu_event_wait().
Patch "qemu-thread: Use futex for QemuEvent on Windows" and
"qemu-thread: Use futex if available for QemuLockCnt" make the use of
futex functions added for Windows.
Patches "migration: Replace QemuSemaphore with QemuEvent",
"migration/colo: Replace QemuSemaphore with QemuEvent",
"migration/postcopy: Replace QemuSemaphore with QemuEvent", and
"hw/display/apple-gfx: Replace QemuSemaphore with QemuEvent" replace
some QemuSemaphores with QemuEvents, which can utilize futex. Some of
them rely on that QemuEvent can be destroyed immediately after
qemu_event_wait().
[1]: https://lore.kernel.org/r/CAAibmn3HZeDeK8FrYhHa1GGwc+N8rBuB2VvMRm7LCt0mUGmsYQ@mail.gmail.com
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
Changes in v3:
- Fixed race between qemu_event_reset() and qemu_event_set().
- Prepared for spurious pthread_cond_wait() wakeups.
- Added patch "futex: Replace __linux__ with CONFIG_LINUX".
- Link to v2: https://lore.kernel.org/qemu-devel/20250510-event-v2-0-7953177ce1b8@daynix.com
Changes in v2:
- Rebased.
- Added patch
"hw/display/apple-gfx: Replace QemuSemaphore with QemuEvent".
- Link to v1: https://lore.kernel.org/r/20241225-event-v1-0-a58c8d63eb70@daynix.com
---
Akihiko Odaki (10):
futex: Check value after qemu_futex_wait()
futex: Support Windows
futex: Replace __linux__ with CONFIG_LINUX
qemu-thread: Avoid futex abstraction for non-Linux
qemu-thread: Use futex for QemuEvent on Windows
qemu-thread: Use futex if available for QemuLockCnt
migration: Replace QemuSemaphore with QemuEvent
migration/colo: Replace QemuSemaphore with QemuEvent
migration/postcopy: Replace QemuSemaphore with QemuEvent
hw/display/apple-gfx: Replace QemuSemaphore with QemuEvent
meson.build | 2 +
include/qemu/futex.h | 44 +++++++++++-
include/qemu/lockcnt.h | 2 +-
include/qemu/thread-posix.h | 9 ---
include/qemu/thread-win32.h | 6 --
include/qemu/thread.h | 11 ++-
migration/migration.h | 16 ++---
migration/colo.c | 20 +++---
migration/migration.c | 29 ++++----
migration/postcopy-ram.c | 10 +--
migration/savevm.c | 2 +-
tests/unit/test-aio-multithread.c | 6 +-
util/event.c | 148 ++++++++++++++++++++++++++++++++++++++
util/lockcnt.c | 9 +--
util/qemu-thread-posix.c | 148 --------------------------------------
util/qemu-thread-win32.c | 129 ---------------------------------
hw/display/apple-gfx.m | 10 +--
util/meson.build | 3 +-
18 files changed, 255 insertions(+), 349 deletions(-)
---
base-commit: 4b1f5b73e060971c434e70694d571adfee553027
change-id: 20241031-event-785a2f0dda4a
Best regards,
--
Akihiko Odaki <akihiko.odaki@daynix.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-13 14:39 ` Peter Xu
2025-05-11 6:08 ` [PATCH v3 02/10] futex: Support Windows Akihiko Odaki
` (8 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
futex(2) - Linux manual page
https://man7.org/linux/man-pages/man2/futex.2.html
> Note that a wake-up can also be caused by common futex usage patterns
> in unrelated code that happened to have previously used the futex
> word's memory location (e.g., typical futex-based implementations of
> Pthreads mutexes can cause this under some conditions). Therefore,
> callers should always conservatively assume that a return value of 0
> can mean a spurious wake-up, and use the futex word's value (i.e.,
> the user-space synchronization scheme) to decide whether to continue
> to block or not.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
include/qemu/futex.h | 9 +++++++++
tests/unit/test-aio-multithread.c | 4 +++-
util/qemu-thread-posix.c | 28 ++++++++++++++++------------
3 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/include/qemu/futex.h b/include/qemu/futex.h
index 91ae88966e12..f57774005330 100644
--- a/include/qemu/futex.h
+++ b/include/qemu/futex.h
@@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
}
+/*
+ * Note that a wake-up can also be caused by common futex usage patterns in
+ * unrelated code that happened to have previously used the futex word's
+ * memory location (e.g., typical futex-based implementations of Pthreads
+ * mutexes can cause this under some conditions). Therefore, callers should
+ * always conservatively assume that it is a spurious wake-up, and use the futex
+ * word's value (i.e., the user-space synchronization scheme) to decide whether
+ * to continue to block or not.
+ */
static inline void qemu_futex_wait(void *f, unsigned val)
{
while (qemu_futex(f, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
diff --git a/tests/unit/test-aio-multithread.c b/tests/unit/test-aio-multithread.c
index 08d4570ccb14..8c2e41545a29 100644
--- a/tests/unit/test-aio-multithread.c
+++ b/tests/unit/test-aio-multithread.c
@@ -305,7 +305,9 @@ static void mcs_mutex_lock(void)
prev = qatomic_xchg(&mutex_head, id);
if (prev != -1) {
qatomic_set(&nodes[prev].next, id);
- qemu_futex_wait(&nodes[id].locked, 1);
+ while (qatomic_read(&nodes[id].locked) == 1) {
+ qemu_futex_wait(&nodes[id].locked, 1);
+ }
}
}
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index b2e26e21205b..eade5311d175 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -428,17 +428,21 @@ void qemu_event_wait(QemuEvent *ev)
assert(ev->initialized);
- /*
- * qemu_event_wait must synchronize with qemu_event_set even if it does
- * not go down the slow path, so this load-acquire is needed that
- * synchronizes with the first memory barrier in qemu_event_set().
- *
- * If we do go down the slow path, there is no requirement at all: we
- * might miss a qemu_event_set() here but ultimately the memory barrier in
- * qemu_futex_wait() will ensure the check is done correctly.
- */
- value = qatomic_load_acquire(&ev->value);
- if (value != EV_SET) {
+ while (true) {
+ /*
+ * qemu_event_wait must synchronize with qemu_event_set even if it does
+ * not go down the slow path, so this load-acquire is needed that
+ * synchronizes with the first memory barrier in qemu_event_set().
+ *
+ * If we do go down the slow path, there is no requirement at all: we
+ * might miss a qemu_event_set() here but ultimately the memory barrier
+ * in qemu_futex_wait() will ensure the check is done correctly.
+ */
+ value = qatomic_load_acquire(&ev->value);
+ if (value == EV_SET) {
+ break;
+ }
+
if (value == EV_FREE) {
/*
* Leave the event reset and tell qemu_event_set that there are
@@ -452,7 +456,7 @@ void qemu_event_wait(QemuEvent *ev)
* like the load above.
*/
if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
- return;
+ break;
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 02/10] futex: Support Windows
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 01/10] futex: Check value after qemu_futex_wait() Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 03/10] futex: Replace __linux__ with CONFIG_LINUX Akihiko Odaki
` (7 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
Windows supports futex-like APIs since Windows 8 and Windows Server
2012.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
meson.build | 2 ++
include/qemu/futex.h | 53 ++++++++++++++++++++++++++++++---------
tests/unit/test-aio-multithread.c | 2 +-
util/lockcnt.c | 2 +-
util/qemu-thread-posix.c | 4 +--
util/meson.build | 2 +-
6 files changed, 48 insertions(+), 17 deletions(-)
diff --git a/meson.build b/meson.build
index 27f115015285..11d42e4a7d39 100644
--- a/meson.build
+++ b/meson.build
@@ -835,11 +835,13 @@ emulator_link_args = []
midl = not_found
widl = not_found
pathcch = not_found
+synchronization = not_found
host_dsosuf = '.so'
if host_os == 'windows'
midl = find_program('midl', required: false)
widl = find_program('widl', required: false)
pathcch = cc.find_library('pathcch')
+ synchronization = cc.find_library('Synchronization')
socket = cc.find_library('ws2_32')
winmm = cc.find_library('winmm')
diff --git a/include/qemu/futex.h b/include/qemu/futex.h
index f57774005330..607613eec835 100644
--- a/include/qemu/futex.h
+++ b/include/qemu/futex.h
@@ -1,5 +1,5 @@
/*
- * Wrappers around Linux futex syscall
+ * Wrappers around Linux futex syscall and similar
*
* Copyright Red Hat, Inc. 2017
*
@@ -11,28 +11,37 @@
*
*/
+/*
+ * Note that a wake-up can also be caused by common futex usage patterns in
+ * unrelated code that happened to have previously used the futex word's
+ * memory location (e.g., typical futex-based implementations of Pthreads
+ * mutexes can cause this under some conditions). Therefore, qemu_futex_wait()
+ * callers should always conservatively assume that it is a spurious wake-up,
+ * and use the futex word's value (i.e., the user-space synchronization scheme)
+ * to decide whether to continue to block or not.
+ */
+
#ifndef QEMU_FUTEX_H
#define QEMU_FUTEX_H
+#define HAVE_FUTEX
+
+#ifdef CONFIG_LINUX
#include <sys/syscall.h>
#include <linux/futex.h>
#define qemu_futex(...) syscall(__NR_futex, __VA_ARGS__)
-static inline void qemu_futex_wake(void *f, int n)
+static inline void qemu_futex_wake_all(void *f)
{
- qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
+ qemu_futex(f, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
+}
+
+static inline void qemu_futex_wake_single(void *f)
+{
+ qemu_futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
}
-/*
- * Note that a wake-up can also be caused by common futex usage patterns in
- * unrelated code that happened to have previously used the futex word's
- * memory location (e.g., typical futex-based implementations of Pthreads
- * mutexes can cause this under some conditions). Therefore, callers should
- * always conservatively assume that it is a spurious wake-up, and use the futex
- * word's value (i.e., the user-space synchronization scheme) to decide whether
- * to continue to block or not.
- */
static inline void qemu_futex_wait(void *f, unsigned val)
{
while (qemu_futex(f, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
@@ -46,5 +55,25 @@ static inline void qemu_futex_wait(void *f, unsigned val)
}
}
}
+#elif defined(CONFIG_WIN32)
+#include <synchapi.h>
+
+static inline void qemu_futex_wake_all(void *f)
+{
+ WakeByAddressAll(f);
+}
+
+static inline void qemu_futex_wake_single(void *f)
+{
+ WakeByAddressSingle(f);
+}
+
+static inline void qemu_futex_wait(void *f, unsigned val)
+{
+ WaitOnAddress(f, &val, sizeof(val), INFINITE);
+}
+#else
+#undef HAVE_FUTEX
+#endif
#endif /* QEMU_FUTEX_H */
diff --git a/tests/unit/test-aio-multithread.c b/tests/unit/test-aio-multithread.c
index 8c2e41545a29..0ead6bf34ad1 100644
--- a/tests/unit/test-aio-multithread.c
+++ b/tests/unit/test-aio-multithread.c
@@ -330,7 +330,7 @@ static void mcs_mutex_unlock(void)
/* Wake up the next in line. */
next = qatomic_read(&nodes[id].next);
nodes[next].locked = 0;
- qemu_futex_wake(&nodes[next].locked, 1);
+ qemu_futex_wake_single(&nodes[next].locked);
}
static void test_multi_fair_mutex_entry(void *opaque)
diff --git a/util/lockcnt.c b/util/lockcnt.c
index d07c6cc5cee4..ca27d8e61a5c 100644
--- a/util/lockcnt.c
+++ b/util/lockcnt.c
@@ -106,7 +106,7 @@ static bool qemu_lockcnt_cmpxchg_or_wait(QemuLockCnt *lockcnt, int *val,
static void lockcnt_wake(QemuLockCnt *lockcnt)
{
trace_lockcnt_futex_wake(lockcnt);
- qemu_futex_wake(&lockcnt->count, 1);
+ qemu_futex_wake_single(&lockcnt->count);
}
void qemu_lockcnt_inc(QemuLockCnt *lockcnt)
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index eade5311d175..13459e44c768 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -345,7 +345,7 @@ static inline void qemu_futex_wait(QemuEvent *ev, unsigned val)
/* Valid transitions:
* - free->set, when setting the event
- * - busy->set, when setting the event, followed by qemu_futex_wake
+ * - busy->set, when setting the event, followed by qemu_futex_wake_all
* - set->free, when resetting the event
* - free->busy, when waiting
*
@@ -400,7 +400,7 @@ void qemu_event_set(QemuEvent *ev)
smp_mb__after_rmw();
if (old == EV_BUSY) {
/* There were waiters, wake them up. */
- qemu_futex_wake(ev, INT_MAX);
+ qemu_futex_wake_all(ev);
}
}
}
diff --git a/util/meson.build b/util/meson.build
index e5cd327e2767..c756558d6763 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -27,7 +27,7 @@ else
util_ss.add(files('event_notifier-win32.c'))
util_ss.add(files('oslib-win32.c'))
util_ss.add(files('qemu-thread-win32.c'))
- util_ss.add(winmm, pathcch)
+ util_ss.add(winmm, pathcch, synchronization)
endif
util_ss.add(when: linux_io_uring, if_true: files('fdmon-io_uring.c'))
if glib_has_gslice
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 03/10] futex: Replace __linux__ with CONFIG_LINUX
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 01/10] futex: Check value after qemu_futex_wait() Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 02/10] futex: Support Windows Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux Akihiko Odaki
` (6 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
scripts/checkpatch.pl warns for __linux__ saying "architecture specific
defines should be avoided".
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
include/qemu/thread-posix.h | 2 +-
util/qemu-thread-posix.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index 5f2f3d1386bc..c412623a9143 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -33,7 +33,7 @@ struct QemuSemaphore {
};
struct QemuEvent {
-#ifndef __linux__
+#ifndef CONFIG_LINUX
pthread_mutex_t lock;
pthread_cond_t cond;
#endif
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 13459e44c768..dfe26b8d020c 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -317,7 +317,7 @@ void qemu_sem_wait(QemuSemaphore *sem)
qemu_mutex_unlock(&sem->mutex);
}
-#ifdef __linux__
+#ifdef CONFIG_LINUX
#include "qemu/futex.h"
#else
static inline void qemu_futex_wake(QemuEvent *ev, int n)
@@ -363,7 +363,7 @@ static inline void qemu_futex_wait(QemuEvent *ev, unsigned val)
void qemu_event_init(QemuEvent *ev, bool init)
{
-#ifndef __linux__
+#ifndef CONFIG_LINUX
pthread_mutex_init(&ev->lock, NULL);
pthread_cond_init(&ev->cond, NULL);
#endif
@@ -376,7 +376,7 @@ void qemu_event_destroy(QemuEvent *ev)
{
assert(ev->initialized);
ev->initialized = false;
-#ifndef __linux__
+#ifndef CONFIG_LINUX
pthread_mutex_destroy(&ev->lock);
pthread_cond_destroy(&ev->cond);
#endif
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (2 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 03/10] futex: Replace __linux__ with CONFIG_LINUX Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-14 0:51 ` Paolo Bonzini
2025-05-11 6:08 ` [PATCH v3 05/10] qemu-thread: Use futex for QemuEvent on Windows Akihiko Odaki
` (5 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
qemu-thread used to abstract pthread primitives into futex for the
QemuEvent implementation of POSIX systems other than Linux. However,
this abstraction has one key difference: unlike futex, pthread
primitives require an explicit destruction, and it must be ordered after
wait and wake operations.
It would be easier to perform destruction if a wait operation ensures
the corresponding wake operation finishes as POSIX semaphore does, but
that requires to protect state accesses in qemu_event_set() and
qemu_event_wait() with a mutex. On the other hand, real futex does not
need such a protection but needs complex barrier and atomic operations
to ensure ordering between the two functions.
Add special implementations of qemu_event_set() and qemu_event_wait()
using pthread primitives. qemu_event_wait() will ensure qemu_event_set()
finishes, and these functions will avoid complex barrier and atomic
operations to ensure ordering between them.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Phil Dennis-Jordan <phil@philjordan.eu>
Reviewed-by: Phil Dennis-Jordan <phil@philjordan.eu>
---
util/qemu-thread-posix.c | 47 +++++++++++++++++++++--------------------------
1 file changed, 21 insertions(+), 26 deletions(-)
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index dfe26b8d020c..6dc6f2b25dce 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -319,28 +319,6 @@ void qemu_sem_wait(QemuSemaphore *sem)
#ifdef CONFIG_LINUX
#include "qemu/futex.h"
-#else
-static inline void qemu_futex_wake(QemuEvent *ev, int n)
-{
- assert(ev->initialized);
- pthread_mutex_lock(&ev->lock);
- if (n == 1) {
- pthread_cond_signal(&ev->cond);
- } else {
- pthread_cond_broadcast(&ev->cond);
- }
- pthread_mutex_unlock(&ev->lock);
-}
-
-static inline void qemu_futex_wait(QemuEvent *ev, unsigned val)
-{
- assert(ev->initialized);
- pthread_mutex_lock(&ev->lock);
- if (ev->value == val) {
- pthread_cond_wait(&ev->cond, &ev->lock);
- }
- pthread_mutex_unlock(&ev->lock);
-}
#endif
/* Valid transitions:
@@ -387,12 +365,17 @@ void qemu_event_set(QemuEvent *ev)
assert(ev->initialized);
/*
- * Pairs with both qemu_event_reset() and qemu_event_wait().
+ * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
*
* qemu_event_set has release semantics, but because it *loads*
* ev->value we need a full memory barrier here.
+ *
+ * qemu_event_wait() do not have a paired barrier on non-Linux because
+ * ev->lock ensures ordering.
*/
smp_mb();
+
+#ifdef CONFIG_LINUX
if (qatomic_read(&ev->value) != EV_SET) {
int old = qatomic_xchg(&ev->value, EV_SET);
@@ -403,6 +386,12 @@ void qemu_event_set(QemuEvent *ev)
qemu_futex_wake_all(ev);
}
}
+#else
+ pthread_mutex_lock(&ev->lock);
+ qatomic_set(&ev->value, EV_SET);
+ pthread_cond_broadcast(&ev->cond);
+ pthread_mutex_unlock(&ev->lock);
+#endif
}
void qemu_event_reset(QemuEvent *ev)
@@ -424,10 +413,9 @@ void qemu_event_reset(QemuEvent *ev)
void qemu_event_wait(QemuEvent *ev)
{
- unsigned value;
-
assert(ev->initialized);
+#ifdef CONFIG_LINUX
while (true) {
/*
* qemu_event_wait must synchronize with qemu_event_set even if it does
@@ -438,7 +426,7 @@ void qemu_event_wait(QemuEvent *ev)
* might miss a qemu_event_set() here but ultimately the memory barrier
* in qemu_futex_wait() will ensure the check is done correctly.
*/
- value = qatomic_load_acquire(&ev->value);
+ unsigned value = qatomic_load_acquire(&ev->value);
if (value == EV_SET) {
break;
}
@@ -467,6 +455,13 @@ void qemu_event_wait(QemuEvent *ev)
*/
qemu_futex_wait(ev, EV_BUSY);
}
+#else
+ pthread_mutex_lock(&ev->lock);
+ while (qatomic_read(&ev->value) != EV_SET) {
+ pthread_cond_wait(&ev->cond, &ev->lock);
+ }
+ pthread_mutex_unlock(&ev->lock);
+#endif
}
static __thread NotifierList thread_exit;
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 05/10] qemu-thread: Use futex for QemuEvent on Windows
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (3 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 06/10] qemu-thread: Use futex if available for QemuLockCnt Akihiko Odaki
` (4 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
Use the futex-based implementation of QemuEvent on Windows to
remove code duplication and remove the overhead of event object
construction and destruction.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
include/qemu/thread-posix.h | 9 ---
include/qemu/thread-win32.h | 6 --
include/qemu/thread.h | 11 +++-
util/event.c | 148 ++++++++++++++++++++++++++++++++++++++++++++
util/qemu-thread-posix.c | 147 -------------------------------------------
util/qemu-thread-win32.c | 129 --------------------------------------
util/meson.build | 1 +
7 files changed, 159 insertions(+), 292 deletions(-)
diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index c412623a9143..758808b705e4 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -32,15 +32,6 @@ struct QemuSemaphore {
unsigned int count;
};
-struct QemuEvent {
-#ifndef CONFIG_LINUX
- pthread_mutex_t lock;
- pthread_cond_t cond;
-#endif
- unsigned value;
- bool initialized;
-};
-
struct QemuThread {
pthread_t thread;
};
diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
index d95af4498fc9..da9e7322990c 100644
--- a/include/qemu/thread-win32.h
+++ b/include/qemu/thread-win32.h
@@ -28,12 +28,6 @@ struct QemuSemaphore {
bool initialized;
};
-struct QemuEvent {
- int value;
- HANDLE event;
- bool initialized;
-};
-
typedef struct QemuThreadData QemuThreadData;
struct QemuThread {
QemuThreadData *data;
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 6f800aad31a9..573f8c9ede20 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -3,13 +3,22 @@
#include "qemu/processor.h"
#include "qemu/atomic.h"
+#include "qemu/futex.h"
typedef struct QemuCond QemuCond;
typedef struct QemuSemaphore QemuSemaphore;
-typedef struct QemuEvent QemuEvent;
typedef struct QemuLockCnt QemuLockCnt;
typedef struct QemuThread QemuThread;
+typedef struct QemuEvent {
+#ifndef HAVE_FUTEX
+ pthread_mutex_t lock;
+ pthread_cond_t cond;
+#endif
+ unsigned value;
+ bool initialized;
+} QemuEvent;
+
#ifdef _WIN32
#include "qemu/thread-win32.h"
#else
diff --git a/util/event.c b/util/event.c
new file mode 100644
index 000000000000..20853d61a7cd
--- /dev/null
+++ b/util/event.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include "qemu/osdep.h"
+#include "qemu/thread.h"
+
+/*
+ * Valid transitions:
+ * - free->set, when setting the event
+ * - busy->set, when setting the event, followed by qemu_futex_wake_all
+ * - set->free, when resetting the event
+ * - free->busy, when waiting
+ *
+ * set->busy does not happen (it can be observed from the outside but
+ * it really is set->free->busy).
+ *
+ * busy->free provably cannot happen; to enforce it, the set->free transition
+ * is done with an OR, which becomes a no-op if the event has concurrently
+ * transitioned to free or busy.
+ */
+
+#define EV_SET 0
+#define EV_FREE 1
+#define EV_BUSY -1
+
+void qemu_event_init(QemuEvent *ev, bool init)
+{
+#ifndef HAVE_FUTEX
+ pthread_mutex_init(&ev->lock, NULL);
+ pthread_cond_init(&ev->cond, NULL);
+#endif
+
+ ev->value = (init ? EV_SET : EV_FREE);
+ ev->initialized = true;
+}
+
+void qemu_event_destroy(QemuEvent *ev)
+{
+ assert(ev->initialized);
+ ev->initialized = false;
+#ifndef HAVE_FUTEX
+ pthread_mutex_destroy(&ev->lock);
+ pthread_cond_destroy(&ev->cond);
+#endif
+}
+
+void qemu_event_set(QemuEvent *ev)
+{
+ assert(ev->initialized);
+
+ /*
+ * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
+ *
+ * qemu_event_set has release semantics, but because it *loads*
+ * ev->value we need a full memory barrier here.
+ *
+ * qemu_event_wait() do not have a paired barrier on non-Linux because
+ * ev->lock ensures ordering.
+ */
+ smp_mb();
+
+#ifdef HAVE_FUTEX
+ if (qatomic_read(&ev->value) != EV_SET) {
+ int old = qatomic_xchg(&ev->value, EV_SET);
+
+ /* Pairs with memory barrier in kernel futex_wait system call. */
+ smp_mb__after_rmw();
+ if (old == EV_BUSY) {
+ /* There were waiters, wake them up. */
+ qemu_futex_wake_all(ev);
+ }
+ }
+#else
+ pthread_mutex_lock(&ev->lock);
+ qatomic_set(&ev->value, EV_SET);
+ pthread_cond_broadcast(&ev->cond);
+ pthread_mutex_unlock(&ev->lock);
+#endif
+}
+
+void qemu_event_reset(QemuEvent *ev)
+{
+ assert(ev->initialized);
+
+ /*
+ * If there was a concurrent reset (or even reset+wait),
+ * do nothing. Otherwise change EV_SET->EV_FREE.
+ */
+ qatomic_or(&ev->value, EV_FREE);
+
+ /*
+ * Order reset before checking the condition in the caller.
+ * Pairs with the first memory barrier in qemu_event_set().
+ */
+ smp_mb__after_rmw();
+}
+
+void qemu_event_wait(QemuEvent *ev)
+{
+ assert(ev->initialized);
+
+#ifdef HAVE_FUTEX
+ while (true) {
+ /*
+ * qemu_event_wait must synchronize with qemu_event_set even if it does
+ * not go down the slow path, so this load-acquire is needed that
+ * synchronizes with the first memory barrier in qemu_event_set().
+ *
+ * If we do go down the slow path, there is no requirement at all: we
+ * might miss a qemu_event_set() here but ultimately the memory barrier
+ * in qemu_futex_wait() will ensure the check is done correctly.
+ */
+ unsigned value = qatomic_load_acquire(&ev->value);
+ if (value == EV_SET) {
+ break;
+ }
+
+ if (value == EV_FREE) {
+ /*
+ * Leave the event reset and tell qemu_event_set that there are
+ * waiters. No need to retry, because there cannot be a concurrent
+ * busy->free transition. After the CAS, the event will be either
+ * set or busy.
+ *
+ * This cmpxchg doesn't have particular ordering requirements if it
+ * succeeds (moving the store earlier can only cause
+ * qemu_event_set() to issue _more_ wakeups), the failing case needs
+ * acquire semantics like the load above.
+ */
+ if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
+ break;
+ }
+ }
+
+ /*
+ * This is the final check for a concurrent set, so it does need
+ * a smp_mb() pairing with the second barrier of qemu_event_set().
+ * The barrier is inside the FUTEX_WAIT system call.
+ */
+ qemu_futex_wait(ev, EV_BUSY);
+ }
+#else
+ pthread_mutex_lock(&ev->lock);
+ while (qatomic_read(&ev->value) != EV_SET) {
+ pthread_cond_wait(&ev->cond, &ev->lock);
+ }
+ pthread_mutex_unlock(&ev->lock);
+#endif
+}
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 6dc6f2b25dce..ba725444ba63 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -317,153 +317,6 @@ void qemu_sem_wait(QemuSemaphore *sem)
qemu_mutex_unlock(&sem->mutex);
}
-#ifdef CONFIG_LINUX
-#include "qemu/futex.h"
-#endif
-
-/* Valid transitions:
- * - free->set, when setting the event
- * - busy->set, when setting the event, followed by qemu_futex_wake_all
- * - set->free, when resetting the event
- * - free->busy, when waiting
- *
- * set->busy does not happen (it can be observed from the outside but
- * it really is set->free->busy).
- *
- * busy->free provably cannot happen; to enforce it, the set->free transition
- * is done with an OR, which becomes a no-op if the event has concurrently
- * transitioned to free or busy.
- */
-
-#define EV_SET 0
-#define EV_FREE 1
-#define EV_BUSY -1
-
-void qemu_event_init(QemuEvent *ev, bool init)
-{
-#ifndef CONFIG_LINUX
- pthread_mutex_init(&ev->lock, NULL);
- pthread_cond_init(&ev->cond, NULL);
-#endif
-
- ev->value = (init ? EV_SET : EV_FREE);
- ev->initialized = true;
-}
-
-void qemu_event_destroy(QemuEvent *ev)
-{
- assert(ev->initialized);
- ev->initialized = false;
-#ifndef CONFIG_LINUX
- pthread_mutex_destroy(&ev->lock);
- pthread_cond_destroy(&ev->cond);
-#endif
-}
-
-void qemu_event_set(QemuEvent *ev)
-{
- assert(ev->initialized);
-
- /*
- * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
- *
- * qemu_event_set has release semantics, but because it *loads*
- * ev->value we need a full memory barrier here.
- *
- * qemu_event_wait() do not have a paired barrier on non-Linux because
- * ev->lock ensures ordering.
- */
- smp_mb();
-
-#ifdef CONFIG_LINUX
- if (qatomic_read(&ev->value) != EV_SET) {
- int old = qatomic_xchg(&ev->value, EV_SET);
-
- /* Pairs with memory barrier in kernel futex_wait system call. */
- smp_mb__after_rmw();
- if (old == EV_BUSY) {
- /* There were waiters, wake them up. */
- qemu_futex_wake_all(ev);
- }
- }
-#else
- pthread_mutex_lock(&ev->lock);
- qatomic_set(&ev->value, EV_SET);
- pthread_cond_broadcast(&ev->cond);
- pthread_mutex_unlock(&ev->lock);
-#endif
-}
-
-void qemu_event_reset(QemuEvent *ev)
-{
- assert(ev->initialized);
-
- /*
- * If there was a concurrent reset (or even reset+wait),
- * do nothing. Otherwise change EV_SET->EV_FREE.
- */
- qatomic_or(&ev->value, EV_FREE);
-
- /*
- * Order reset before checking the condition in the caller.
- * Pairs with the first memory barrier in qemu_event_set().
- */
- smp_mb__after_rmw();
-}
-
-void qemu_event_wait(QemuEvent *ev)
-{
- assert(ev->initialized);
-
-#ifdef CONFIG_LINUX
- while (true) {
- /*
- * qemu_event_wait must synchronize with qemu_event_set even if it does
- * not go down the slow path, so this load-acquire is needed that
- * synchronizes with the first memory barrier in qemu_event_set().
- *
- * If we do go down the slow path, there is no requirement at all: we
- * might miss a qemu_event_set() here but ultimately the memory barrier
- * in qemu_futex_wait() will ensure the check is done correctly.
- */
- unsigned value = qatomic_load_acquire(&ev->value);
- if (value == EV_SET) {
- break;
- }
-
- if (value == EV_FREE) {
- /*
- * Leave the event reset and tell qemu_event_set that there are
- * waiters. No need to retry, because there cannot be a concurrent
- * busy->free transition. After the CAS, the event will be either
- * set or busy.
- *
- * This cmpxchg doesn't have particular ordering requirements if it
- * succeeds (moving the store earlier can only cause qemu_event_set()
- * to issue _more_ wakeups), the failing case needs acquire semantics
- * like the load above.
- */
- if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
- break;
- }
- }
-
- /*
- * This is the final check for a concurrent set, so it does need
- * a smp_mb() pairing with the second barrier of qemu_event_set().
- * The barrier is inside the FUTEX_WAIT system call.
- */
- qemu_futex_wait(ev, EV_BUSY);
- }
-#else
- pthread_mutex_lock(&ev->lock);
- while (qatomic_read(&ev->value) != EV_SET) {
- pthread_cond_wait(&ev->cond, &ev->lock);
- }
- pthread_mutex_unlock(&ev->lock);
-#endif
-}
-
static __thread NotifierList thread_exit;
/*
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index a7fe3cc345f0..ca2e0b512e26 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -231,135 +231,6 @@ void qemu_sem_wait(QemuSemaphore *sem)
}
}
-/* Wrap a Win32 manual-reset event with a fast userspace path. The idea
- * is to reset the Win32 event lazily, as part of a test-reset-test-wait
- * sequence. Such a sequence is, indeed, how QemuEvents are used by
- * RCU and other subsystems!
- *
- * Valid transitions:
- * - free->set, when setting the event
- * - busy->set, when setting the event, followed by SetEvent
- * - set->free, when resetting the event
- * - free->busy, when waiting
- *
- * set->busy does not happen (it can be observed from the outside but
- * it really is set->free->busy).
- *
- * busy->free provably cannot happen; to enforce it, the set->free transition
- * is done with an OR, which becomes a no-op if the event has concurrently
- * transitioned to free or busy (and is faster than cmpxchg).
- */
-
-#define EV_SET 0
-#define EV_FREE 1
-#define EV_BUSY -1
-
-void qemu_event_init(QemuEvent *ev, bool init)
-{
- /* Manual reset. */
- ev->event = CreateEvent(NULL, TRUE, TRUE, NULL);
- ev->value = (init ? EV_SET : EV_FREE);
- ev->initialized = true;
-}
-
-void qemu_event_destroy(QemuEvent *ev)
-{
- assert(ev->initialized);
- ev->initialized = false;
- CloseHandle(ev->event);
-}
-
-void qemu_event_set(QemuEvent *ev)
-{
- assert(ev->initialized);
-
- /*
- * Pairs with both qemu_event_reset() and qemu_event_wait().
- *
- * qemu_event_set has release semantics, but because it *loads*
- * ev->value we need a full memory barrier here.
- */
- smp_mb();
- if (qatomic_read(&ev->value) != EV_SET) {
- int old = qatomic_xchg(&ev->value, EV_SET);
-
- /* Pairs with memory barrier after ResetEvent. */
- smp_mb__after_rmw();
- if (old == EV_BUSY) {
- /* There were waiters, wake them up. */
- SetEvent(ev->event);
- }
- }
-}
-
-void qemu_event_reset(QemuEvent *ev)
-{
- assert(ev->initialized);
-
- /*
- * If there was a concurrent reset (or even reset+wait),
- * do nothing. Otherwise change EV_SET->EV_FREE.
- */
- qatomic_or(&ev->value, EV_FREE);
-
- /*
- * Order reset before checking the condition in the caller.
- * Pairs with the first memory barrier in qemu_event_set().
- */
- smp_mb__after_rmw();
-}
-
-void qemu_event_wait(QemuEvent *ev)
-{
- unsigned value;
-
- assert(ev->initialized);
-
- /*
- * qemu_event_wait must synchronize with qemu_event_set even if it does
- * not go down the slow path, so this load-acquire is needed that
- * synchronizes with the first memory barrier in qemu_event_set().
- *
- * If we do go down the slow path, there is no requirement at all: we
- * might miss a qemu_event_set() here but ultimately the memory barrier in
- * qemu_futex_wait() will ensure the check is done correctly.
- */
- value = qatomic_load_acquire(&ev->value);
- if (value != EV_SET) {
- if (value == EV_FREE) {
- /*
- * Here the underlying kernel event is reset, but qemu_event_set is
- * not yet going to call SetEvent. However, there will be another
- * check for EV_SET below when setting EV_BUSY. At that point it
- * is safe to call WaitForSingleObject.
- */
- ResetEvent(ev->event);
-
- /*
- * It is not clear whether ResetEvent provides this barrier; kernel
- * APIs (KeResetEvent/KeClearEvent) do not. Better safe than sorry!
- */
- smp_mb();
-
- /*
- * Leave the event reset and tell qemu_event_set that there are
- * waiters. No need to retry, because there cannot be a concurrent
- * busy->free transition. After the CAS, the event will be either
- * set or busy.
- */
- if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
- return;
- }
- }
-
- /*
- * ev->value is now EV_BUSY. Since we didn't observe EV_SET,
- * qemu_event_set() must observe EV_BUSY and call SetEvent().
- */
- WaitForSingleObject(ev->event, INFINITE);
- }
-}
-
struct QemuThreadData {
/* Passed to win32_start_routine. */
void *(*start_routine)(void *);
diff --git a/util/meson.build b/util/meson.build
index c756558d6763..58033131328a 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -35,6 +35,7 @@ if glib_has_gslice
endif
util_ss.add(files('defer-call.c'))
util_ss.add(files('envlist.c', 'path.c', 'module.c'))
+util_ss.add(files('event.c'))
util_ss.add(files('host-utils.c'))
util_ss.add(files('bitmap.c', 'bitops.c'))
util_ss.add(files('fifo8.c'))
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 06/10] qemu-thread: Use futex if available for QemuLockCnt
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (4 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 05/10] qemu-thread: Use futex for QemuEvent on Windows Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent Akihiko Odaki
` (3 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
This unlocks the futex-based implementation of QemuLockCnt to Windows.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
include/qemu/lockcnt.h | 2 +-
util/lockcnt.c | 7 ++++---
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/qemu/lockcnt.h b/include/qemu/lockcnt.h
index f4b62a3f7011..5a2800e3f182 100644
--- a/include/qemu/lockcnt.h
+++ b/include/qemu/lockcnt.h
@@ -17,7 +17,7 @@
typedef struct QemuLockCnt QemuLockCnt;
struct QemuLockCnt {
-#ifndef CONFIG_LINUX
+#ifndef HAVE_FUTEX
QemuMutex mutex;
#endif
unsigned count;
diff --git a/util/lockcnt.c b/util/lockcnt.c
index ca27d8e61a5c..92c9f8ceca88 100644
--- a/util/lockcnt.c
+++ b/util/lockcnt.c
@@ -12,10 +12,11 @@
#include "qemu/atomic.h"
#include "trace.h"
-#ifdef CONFIG_LINUX
-#include "qemu/futex.h"
+#ifdef HAVE_FUTEX
-/* On Linux, bits 0-1 are a futex-based lock, bits 2-31 are the counter.
+/*
+ * When futex is available, bits 0-1 are a futex-based lock, bits 2-31 are the
+ * counter.
* For the mutex algorithm see Ulrich Drepper's "Futexes Are Tricky" (ok,
* this is not the most relaxing citation I could make...). It is similar
* to mutex2 in the paper.
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (5 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 06/10] qemu-thread: Use futex if available for QemuLockCnt Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-13 19:13 ` Peter Xu
2025-05-11 6:08 ` [PATCH v3 08/10] migration/colo: " Akihiko Odaki
` (2 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
rp_pong_acks tells if it has ever received one pong. QemuEvent is
better suited for this usage because it represents a boolean rather
than integer and will not decrement with the wait operation.
pause_event can utilize qemu_event_reset() to discard events.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
migration/migration.h | 6 +++---
migration/migration.c | 29 +++++++++++++----------------
2 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index d53f7cad84d8..11dba5f4da77 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -342,11 +342,11 @@ struct MigrationState {
*/
QemuSemaphore rp_sem;
/*
- * We post to this when we got one PONG from dest. So far it's an
+ * We set this when we got one PONG from dest. So far it's an
* easy way to know the main channel has successfully established
* on dest QEMU.
*/
- QemuSemaphore rp_pong_acks;
+ QemuEvent rp_pong_acks;
} rp_state;
double mbps;
@@ -379,7 +379,7 @@ struct MigrationState {
QemuSemaphore wait_unplug_sem;
/* Migration is paused due to pause-before-switchover */
- QemuSemaphore pause_sem;
+ QemuEvent pause_event;
/* The semaphore is used to notify COLO thread that failover is finished */
QemuSemaphore colo_exit_sem;
diff --git a/migration/migration.c b/migration/migration.c
index 4697732bef91..053c23c58d82 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1630,7 +1630,7 @@ void migration_cancel(void)
}
/* If the migration is paused, kick it out of the pause */
if (old_state == MIGRATION_STATUS_PRE_SWITCHOVER) {
- qemu_sem_post(&s->pause_sem);
+ qemu_event_set(&s->pause_event);
}
migrate_set_state(&s->state, old_state, MIGRATION_STATUS_CANCELLING);
} while (s->state != MIGRATION_STATUS_CANCELLING);
@@ -2342,7 +2342,7 @@ void qmp_migrate_continue(MigrationStatus state, Error **errp)
MigrationStatus_str(s->state));
return;
}
- qemu_sem_post(&s->pause_sem);
+ qemu_event_set(&s->pause_event);
}
int migration_rp_wait(MigrationState *s)
@@ -2550,7 +2550,7 @@ static void *source_return_path_thread(void *opaque)
case MIG_RP_MSG_PONG:
tmp32 = ldl_be_p(buf);
trace_source_return_path_thread_pong(tmp32);
- qemu_sem_post(&ms->rp_state.rp_pong_acks);
+ qemu_event_set(&ms->rp_state.rp_pong_acks);
break;
case MIG_RP_MSG_REQ_PAGES:
@@ -2693,7 +2693,7 @@ static inline void
migration_wait_main_channel(MigrationState *ms)
{
/* Wait until one PONG message received */
- qemu_sem_wait(&ms->rp_state.rp_pong_acks);
+ qemu_event_wait(&ms->rp_state.rp_pong_acks);
}
/*
@@ -2911,21 +2911,18 @@ static bool migration_switchover_prepare(MigrationState *s)
return true;
}
- /* Since leaving this state is not atomic with posting the semaphore
+ /*
+ * Since leaving this state is not atomic with setting the event
* it's possible that someone could have issued multiple migrate_continue
- * and the semaphore is incorrectly positive at this point;
- * the docs say it's undefined to reinit a semaphore that's already
- * init'd, so use timedwait to eat up any existing posts.
+ * and the event is incorrectly set at this point so reset it.
*/
- while (qemu_sem_timedwait(&s->pause_sem, 1) == 0) {
- /* This block intentionally left blank */
- }
+ qemu_event_reset(&s->pause_event);
/* Update [POSTCOPY_]ACTIVE to PRE_SWITCHOVER */
migrate_set_state(&s->state, s->state, MIGRATION_STATUS_PRE_SWITCHOVER);
bql_unlock();
- qemu_sem_wait(&s->pause_sem);
+ qemu_event_wait(&s->pause_event);
bql_lock();
/*
@@ -4057,10 +4054,10 @@ static void migration_instance_finalize(Object *obj)
qemu_mutex_destroy(&ms->qemu_file_lock);
qemu_sem_destroy(&ms->wait_unplug_sem);
qemu_sem_destroy(&ms->rate_limit_sem);
- qemu_sem_destroy(&ms->pause_sem);
+ qemu_event_destroy(&ms->pause_event);
qemu_sem_destroy(&ms->postcopy_pause_sem);
qemu_sem_destroy(&ms->rp_state.rp_sem);
- qemu_sem_destroy(&ms->rp_state.rp_pong_acks);
+ qemu_event_destroy(&ms->rp_state.rp_pong_acks);
qemu_sem_destroy(&ms->postcopy_qemufile_src_sem);
error_free(ms->error);
}
@@ -4072,14 +4069,14 @@ static void migration_instance_init(Object *obj)
ms->state = MIGRATION_STATUS_NONE;
ms->mbps = -1;
ms->pages_per_second = -1;
- qemu_sem_init(&ms->pause_sem, 0);
+ qemu_event_init(&ms->pause_event, false);
qemu_mutex_init(&ms->error_mutex);
migrate_params_init(&ms->parameters);
qemu_sem_init(&ms->postcopy_pause_sem, 0);
qemu_sem_init(&ms->rp_state.rp_sem, 0);
- qemu_sem_init(&ms->rp_state.rp_pong_acks, 0);
+ qemu_event_init(&ms->rp_state.rp_pong_acks, false);
qemu_sem_init(&ms->rate_limit_sem, 0);
qemu_sem_init(&ms->wait_unplug_sem, 0);
qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0);
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 08/10] migration/colo: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (6 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-12 15:12 ` Fabiano Rosas
2025-05-11 6:08 ` [PATCH v3 09/10] migration/postcopy: " Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 10/10] hw/display/apple-gfx: " Akihiko Odaki
9 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
colo_exit_sem and colo_incoming_sem represent one-shot events so they
can be converted into QemuEvent, which is more lightweight.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
migration/migration.h | 6 +++---
migration/colo.c | 20 ++++++++++----------
2 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index 11dba5f4da77..eec49bf3f893 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -186,7 +186,7 @@ struct MigrationIncomingState {
/* The coroutine we should enter (back) after failover */
Coroutine *colo_incoming_co;
- QemuSemaphore colo_incoming_sem;
+ QemuEvent colo_incoming_event;
/* Optional load threads pool and its thread exit request flag */
ThreadPool *load_threads;
@@ -381,8 +381,8 @@ struct MigrationState {
/* Migration is paused due to pause-before-switchover */
QemuEvent pause_event;
- /* The semaphore is used to notify COLO thread that failover is finished */
- QemuSemaphore colo_exit_sem;
+ /* The event is used to notify COLO thread that failover is finished */
+ QemuEvent colo_exit_event;
/* The event is used to notify COLO thread to do checkpoint */
QemuEvent colo_checkpoint_event;
diff --git a/migration/colo.c b/migration/colo.c
index c976b3ff344d..e0f713c837f5 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -146,7 +146,7 @@ static void secondary_vm_do_failover(void)
return;
}
/* Notify COLO incoming thread that failover work is finished */
- qemu_sem_post(&mis->colo_incoming_sem);
+ qemu_event_set(&mis->colo_incoming_event);
/* For Secondary VM, jump to incoming co */
if (mis->colo_incoming_co) {
@@ -195,7 +195,7 @@ static void primary_vm_do_failover(void)
}
/* Notify COLO thread that failover work is finished */
- qemu_sem_post(&s->colo_exit_sem);
+ qemu_event_set(&s->colo_exit_event);
}
COLOMode get_colo_mode(void)
@@ -620,8 +620,8 @@ out:
}
/* Hope this not to be too long to wait here */
- qemu_sem_wait(&s->colo_exit_sem);
- qemu_sem_destroy(&s->colo_exit_sem);
+ qemu_event_wait(&s->colo_exit_event);
+ qemu_event_destroy(&s->colo_exit_event);
/*
* It is safe to unregister notifier after failover finished.
@@ -651,7 +651,7 @@ void migrate_start_colo_process(MigrationState *s)
s->colo_delay_timer = timer_new_ms(QEMU_CLOCK_HOST,
colo_checkpoint_notify_timer, NULL);
- qemu_sem_init(&s->colo_exit_sem, 0);
+ qemu_event_init(&s->colo_exit_event, false);
colo_process_checkpoint(s);
bql_lock();
}
@@ -808,11 +808,11 @@ void colo_shutdown(void)
case COLO_MODE_PRIMARY:
s = migrate_get_current();
qemu_event_set(&s->colo_checkpoint_event);
- qemu_sem_post(&s->colo_exit_sem);
+ qemu_event_set(&s->colo_exit_event);
break;
case COLO_MODE_SECONDARY:
mis = migration_incoming_get_current();
- qemu_sem_post(&mis->colo_incoming_sem);
+ qemu_event_set(&mis->colo_incoming_event);
break;
default:
break;
@@ -827,7 +827,7 @@ static void *colo_process_incoming_thread(void *opaque)
Error *local_err = NULL;
rcu_register_thread();
- qemu_sem_init(&mis->colo_incoming_sem, 0);
+ qemu_event_init(&mis->colo_incoming_event, false);
migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_COLO);
@@ -923,8 +923,8 @@ out:
}
/* Hope this not to be too long to loop here */
- qemu_sem_wait(&mis->colo_incoming_sem);
- qemu_sem_destroy(&mis->colo_incoming_sem);
+ qemu_event_wait(&mis->colo_incoming_event);
+ qemu_event_destroy(&mis->colo_incoming_event);
rcu_unregister_thread();
return NULL;
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 09/10] migration/postcopy: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (7 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 08/10] migration/colo: " Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
2025-05-12 15:12 ` Fabiano Rosas
2025-05-11 6:08 ` [PATCH v3 10/10] hw/display/apple-gfx: " Akihiko Odaki
9 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
thread_sync_sem is an one-shot event so it can be converted into
QemuEvent, which is more lightweight.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
migration/migration.h | 4 ++--
migration/postcopy-ram.c | 10 +++++-----
migration/savevm.c | 2 +-
3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index eec49bf3f893..fd45a0f20c0f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -98,9 +98,9 @@ struct MigrationIncomingState {
void (*transport_cleanup)(void *data);
/*
* Used to sync thread creations. Note that we can't create threads in
- * parallel with this sem.
+ * parallel with this event.
*/
- QemuSemaphore thread_sync_sem;
+ QemuEvent thread_sync_event;
/*
* Free at the start of the main state load, set as the main thread finishes
* loading state.
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 995614b38c9d..75fd310fb2b0 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -90,10 +90,10 @@ void postcopy_thread_create(MigrationIncomingState *mis,
QemuThread *thread, const char *name,
void *(*fn)(void *), int joinable)
{
- qemu_sem_init(&mis->thread_sync_sem, 0);
+ qemu_event_init(&mis->thread_sync_event, false);
qemu_thread_create(thread, name, fn, mis, joinable);
- qemu_sem_wait(&mis->thread_sync_sem);
- qemu_sem_destroy(&mis->thread_sync_sem);
+ qemu_event_wait(&mis->thread_sync_event);
+ qemu_event_destroy(&mis->thread_sync_event);
}
/* Postcopy needs to detect accesses to pages that haven't yet been copied
@@ -964,7 +964,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
trace_postcopy_ram_fault_thread_entry();
rcu_register_thread();
mis->last_rb = NULL; /* last RAMBlock we sent part of */
- qemu_sem_post(&mis->thread_sync_sem);
+ qemu_event_set(&mis->thread_sync_event);
struct pollfd *pfd;
size_t pfd_len = 2 + mis->postcopy_remote_fds->len;
@@ -1716,7 +1716,7 @@ void *postcopy_preempt_thread(void *opaque)
rcu_register_thread();
- qemu_sem_post(&mis->thread_sync_sem);
+ qemu_event_set(&mis->thread_sync_event);
/*
* The preempt channel is established in asynchronous way. Wait
diff --git a/migration/savevm.c b/migration/savevm.c
index 006514c3e301..52105dd2f10b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2078,7 +2078,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
MIGRATION_STATUS_POSTCOPY_ACTIVE);
- qemu_sem_post(&mis->thread_sync_sem);
+ qemu_event_set(&mis->thread_sync_event);
trace_postcopy_ram_listen_thread_start();
rcu_register_thread();
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 10/10] hw/display/apple-gfx: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
` (8 preceding siblings ...)
2025-05-11 6:08 ` [PATCH v3 09/10] migration/postcopy: " Akihiko Odaki
@ 2025-05-11 6:08 ` Akihiko Odaki
9 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-11 6:08 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
sem in AppleGFXReadMemoryJob is an one-shot event so it can be converted
into QemuEvent, which is more specialized for such a use case.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
hw/display/apple-gfx.m | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/hw/display/apple-gfx.m b/hw/display/apple-gfx.m
index 2ff1c90df71a..173fffc86ef1 100644
--- a/hw/display/apple-gfx.m
+++ b/hw/display/apple-gfx.m
@@ -454,7 +454,7 @@ static void set_cursor_glyph(void *opaque)
/* ------ DMA (device reading system memory) ------ */
typedef struct AppleGFXReadMemoryJob {
- QemuSemaphore sem;
+ QemuEvent event;
hwaddr physical_address;
uint64_t length;
void *dst;
@@ -470,7 +470,7 @@ static void apple_gfx_do_read_memory(void *opaque)
job->dst, job->length, MEMTXATTRS_UNSPECIFIED);
job->success = (r == MEMTX_OK);
- qemu_sem_post(&job->sem);
+ qemu_event_set(&job->event);
}
static bool apple_gfx_read_memory(AppleGFXState *s, hwaddr physical_address,
@@ -483,11 +483,11 @@ static bool apple_gfx_read_memory(AppleGFXState *s, hwaddr physical_address,
trace_apple_gfx_read_memory(physical_address, length, dst);
/* Performing DMA requires BQL, so do it in a BH. */
- qemu_sem_init(&job.sem, 0);
+ qemu_event_init(&job.event, 0);
aio_bh_schedule_oneshot(qemu_get_aio_context(),
apple_gfx_do_read_memory, &job);
- qemu_sem_wait(&job.sem);
- qemu_sem_destroy(&job.sem);
+ qemu_event_wait(&job.event);
+ qemu_event_destroy(&job.event);
return job.success;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v3 08/10] migration/colo: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 ` [PATCH v3 08/10] migration/colo: " Akihiko Odaki
@ 2025-05-12 15:12 ` Fabiano Rosas
0 siblings, 0 replies; 29+ messages in thread
From: Fabiano Rosas @ 2025-05-12 15:12 UTC (permalink / raw)
To: Akihiko Odaki, Paolo Bonzini, Stefan Weil, Peter Xu,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
Akihiko Odaki <akihiko.odaki@daynix.com> writes:
> colo_exit_sem and colo_incoming_sem represent one-shot events so they
> can be converted into QemuEvent, which is more lightweight.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 09/10] migration/postcopy: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 ` [PATCH v3 09/10] migration/postcopy: " Akihiko Odaki
@ 2025-05-12 15:12 ` Fabiano Rosas
0 siblings, 0 replies; 29+ messages in thread
From: Fabiano Rosas @ 2025-05-12 15:12 UTC (permalink / raw)
To: Akihiko Odaki, Paolo Bonzini, Stefan Weil, Peter Xu,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Akihiko Odaki
Akihiko Odaki <akihiko.odaki@daynix.com> writes:
> thread_sync_sem is an one-shot event so it can be converted into
> QemuEvent, which is more lightweight.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-11 6:08 ` [PATCH v3 01/10] futex: Check value after qemu_futex_wait() Akihiko Odaki
@ 2025-05-13 14:39 ` Peter Xu
2025-05-14 7:34 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-05-13 14:39 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On Sun, May 11, 2025 at 03:08:18PM +0900, Akihiko Odaki wrote:
> futex(2) - Linux manual page
> https://man7.org/linux/man-pages/man2/futex.2.html
> > Note that a wake-up can also be caused by common futex usage patterns
> > in unrelated code that happened to have previously used the futex
> > word's memory location (e.g., typical futex-based implementations of
> > Pthreads mutexes can cause this under some conditions). Therefore,
> > callers should always conservatively assume that a return value of 0
> > can mean a spurious wake-up, and use the futex word's value (i.e.,
> > the user-space synchronization scheme) to decide whether to continue
> > to block or not.
I'm just curious - do you know when this will happen?
AFAIU, QEMU uses futex always on private mappings, internally futex does
use (mm, HVA) tuple to index a futex, afaict. Hence, I don't see how it
can get spurious wakeups.. And _if_ it happens, since mm pointer can't
change it must mean the HVA of the futex word is reused, it sounds like an
UAF user bug to me instead.
I checked the man-pages git repo, this line was introduced in:
https://github.com/mkerrisk/man-pages/commit/4b35dc5dabcf356ce6dcb1f949f7b00e76c7587d
I also didn't see details yet in commit message on why that paragraph was
added.
And..
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
> include/qemu/futex.h | 9 +++++++++
> tests/unit/test-aio-multithread.c | 4 +++-
> util/qemu-thread-posix.c | 28 ++++++++++++++++------------
> 3 files changed, 28 insertions(+), 13 deletions(-)
>
> diff --git a/include/qemu/futex.h b/include/qemu/futex.h
> index 91ae88966e12..f57774005330 100644
> --- a/include/qemu/futex.h
> +++ b/include/qemu/futex.h
> @@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
> qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
> }
>
> +/*
> + * Note that a wake-up can also be caused by common futex usage patterns in
> + * unrelated code that happened to have previously used the futex word's
> + * memory location (e.g., typical futex-based implementations of Pthreads
> + * mutexes can cause this under some conditions). Therefore, callers should
.. another thing that was unclear to me is, here it's mentioning "typical
futex-based implementations of pthreads mutexes..", but here
qemu_futex_wait() is using raw futex without any pthread impl. Does it
also mean that this may not be applicable to whatever might cause a
spurious wakeup?
> + * always conservatively assume that it is a spurious wake-up, and use the futex
> + * word's value (i.e., the user-space synchronization scheme) to decide whether
> + * to continue to block or not.
> + */
> static inline void qemu_futex_wait(void *f, unsigned val)
> {
> while (qemu_futex(f, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
> diff --git a/tests/unit/test-aio-multithread.c b/tests/unit/test-aio-multithread.c
> index 08d4570ccb14..8c2e41545a29 100644
> --- a/tests/unit/test-aio-multithread.c
> +++ b/tests/unit/test-aio-multithread.c
> @@ -305,7 +305,9 @@ static void mcs_mutex_lock(void)
> prev = qatomic_xchg(&mutex_head, id);
> if (prev != -1) {
> qatomic_set(&nodes[prev].next, id);
> - qemu_futex_wait(&nodes[id].locked, 1);
> + while (qatomic_read(&nodes[id].locked) == 1) {
> + qemu_futex_wait(&nodes[id].locked, 1);
> + }
> }
> }
>
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index b2e26e21205b..eade5311d175 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -428,17 +428,21 @@ void qemu_event_wait(QemuEvent *ev)
>
> assert(ev->initialized);
>
> - /*
> - * qemu_event_wait must synchronize with qemu_event_set even if it does
> - * not go down the slow path, so this load-acquire is needed that
> - * synchronizes with the first memory barrier in qemu_event_set().
> - *
> - * If we do go down the slow path, there is no requirement at all: we
> - * might miss a qemu_event_set() here but ultimately the memory barrier in
> - * qemu_futex_wait() will ensure the check is done correctly.
> - */
> - value = qatomic_load_acquire(&ev->value);
> - if (value != EV_SET) {
> + while (true) {
> + /*
> + * qemu_event_wait must synchronize with qemu_event_set even if it does
> + * not go down the slow path, so this load-acquire is needed that
> + * synchronizes with the first memory barrier in qemu_event_set().
> + *
> + * If we do go down the slow path, there is no requirement at all: we
> + * might miss a qemu_event_set() here but ultimately the memory barrier
> + * in qemu_futex_wait() will ensure the check is done correctly.
> + */
> + value = qatomic_load_acquire(&ev->value);
> + if (value == EV_SET) {
> + break;
> + }
> +
> if (value == EV_FREE) {
> /*
> * Leave the event reset and tell qemu_event_set that there are
> @@ -452,7 +456,7 @@ void qemu_event_wait(QemuEvent *ev)
> * like the load above.
> */
> if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
> - return;
> + break;
> }
> }
>
>
> --
> 2.49.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent
2025-05-11 6:08 ` [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent Akihiko Odaki
@ 2025-05-13 19:13 ` Peter Xu
2025-05-14 12:36 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-05-13 19:13 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On Sun, May 11, 2025 at 03:08:24PM +0900, Akihiko Odaki wrote:
> rp_pong_acks tells if it has ever received one pong. QemuEvent is
> better suited for this usage because it represents a boolean rather
> than integer and will not decrement with the wait operation.
>
> pause_event can utilize qemu_event_reset() to discard events.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
> migration/migration.h | 6 +++---
> migration/migration.c | 29 +++++++++++++----------------
> 2 files changed, 16 insertions(+), 19 deletions(-)
>
> diff --git a/migration/migration.h b/migration/migration.h
> index d53f7cad84d8..11dba5f4da77 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -342,11 +342,11 @@ struct MigrationState {
> */
> QemuSemaphore rp_sem;
> /*
> - * We post to this when we got one PONG from dest. So far it's an
> + * We set this when we got one PONG from dest. So far it's an
> * easy way to know the main channel has successfully established
> * on dest QEMU.
> */
> - QemuSemaphore rp_pong_acks;
> + QemuEvent rp_pong_acks;
If to do the switch then it needs a rename too because "acks" imply a
non-boolean.
For this one, when introduced it was indeed kept in mind that it might at
some point be useful to count for more than one.
For the other one, it is definitely an improvement to use QemuEvent
especially on the tricky reset..
How about keeping rp_pong_acks as-is, and keep the patch for replacing
pause_sem?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-11 6:08 ` [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux Akihiko Odaki
@ 2025-05-14 0:51 ` Paolo Bonzini
2025-05-14 12:57 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2025-05-14 0:51 UTC (permalink / raw)
To: Akihiko Odaki, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 5/11/25 08:08, Akihiko Odaki wrote:
> @@ -387,12 +365,17 @@ void qemu_event_set(QemuEvent *ev)
> assert(ev->initialized);
>
> /*
> - * Pairs with both qemu_event_reset() and qemu_event_wait().
> + * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
> *
> * qemu_event_set has release semantics, but because it *loads*
> * ev->value we need a full memory barrier here.
> + *
> + * qemu_event_wait() do not have a paired barrier on non-Linux because
> + * ev->lock ensures ordering.
> */
Thanks for trying to write down the idea. This works, but there's
still room for improvement. :) The insight is that the logic becomes:
if (done == 0) { done = 1;
<read ev->value> ev->value = EV_SET;
ev->value = EV_FREE;
if (done == 0) {
pthread_cond_wait(); pthread_cond_broadcast();
}
}
which is a lot simplified compared to the futex case because all the
differences between EV_FREE and EV_BUSY are replaced by the condition
variable. In order to avoid the slow path, all that's needed is to
ensure that if qemu_event_reset() sees EV_SET, the caller also sees
done == 1 in the second "if".
The futex case needs qatomic_or() because there can be a concurrent
EV_FREE->EV_BUSY, and a smp_mb() because qemu_event_set() needs to
*read* ev->value. The non-futex case instead it can do this:
if (done == 0) { done = 1;
v = load_acquire(ev->value); <--- store_release(value, EV_SET);
ev->value = v | EV_FREE;
if (done == 0) {
pthread_cond_wait(); pthread_cond_broadcast();
}
}
where load<---store indicates that the load on the left reads the
value that the store writes, and the store "happens before" the load.
In other words:
- because v reads EV_SET, everything before the store_release is ordered
before everything that follows the load_acquire
- therefore if v reads EV_SET, the pthread_cond_wait() won't be reached
and there's no possibility of a hang (I think :))
This becomes the following patch:
diff --git a/util/event.c b/util/event.c
index 20853d61a7c..489cec08de7 100644
--- a/util/event.c
+++ b/util/event.c
@@ -47,18 +47,14 @@ void qemu_event_set(QemuEvent *ev)
{
assert(ev->initialized);
+#ifdef HAVE_FUTEX
/*
- * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
- *
+ * Pairs with qemu_event_wait() and qemu_event_reset().
* qemu_event_set has release semantics, but because it *loads*
* ev->value we need a full memory barrier here.
- *
- * qemu_event_wait() do not have a paired barrier on non-Linux because
- * ev->lock ensures ordering.
*/
smp_mb();
-#ifdef HAVE_FUTEX
if (qatomic_read(&ev->value) != EV_SET) {
int old = qatomic_xchg(&ev->value, EV_SET);
@@ -71,7 +67,8 @@ void qemu_event_set(QemuEvent *ev)
}
#else
pthread_mutex_lock(&ev->lock);
- qatomic_set(&ev->value, EV_SET);
+ /* Pairs with qemu_event_reset()'s load acquire. */
+ qatomic_store_release(&ev->value, EV_SET);
pthread_cond_broadcast(&ev->cond);
pthread_mutex_unlock(&ev->lock);
#endif
@@ -81,6 +78,7 @@ void qemu_event_reset(QemuEvent *ev)
{
assert(ev->initialized);
+#ifdef HAVE_FUTEX
/*
* If there was a concurrent reset (or even reset+wait),
* do nothing. Otherwise change EV_SET->EV_FREE.
@@ -92,6 +90,28 @@ void qemu_event_reset(QemuEvent *ev)
* Pairs with the first memory barrier in qemu_event_set().
*/
smp_mb__after_rmw();
+#else
+ /*
+ * If futexes are not available, there are no EV_FREE->EV_BUSY
+ * transitions because wakeups are done entirely through the
+ * condition variable. Since qatomic_set() always writes EV_FREE
+ * the load seems useless, but in reality its acquire synchronizes
+ * with qemu_event_set()'s store-release: if qemu_event_reset()
+ * sees EV_SET here, then the caller will certainly see a
+ * successful condition and skip qemu_event_wait():
+ *
+ * done = 1; if (done == 0)
+ * qemu_event_set() { qemu_event_reset() {
+ * lock();
+ * ev->value = EV_SET -----> v = ev->value
+ * ev->value = v | EV_FREE
+ * cond_broadcast()
+ * unlock(); }
+ * } if (done == 0)
+ * // qemu_event_wait() not called
+ */
+ qatomic_set(&ev->value, qatomic_load_acquire(&ev->value) | EV_FREE);
+#endif
}
void qemu_event_wait(QemuEvent *ev)
You don't need to post v4 (which would include the above diff in
this patch, in qemu-thread-posix.c); I can post the above change
as a follow up too. What you have now is not wrong, it's only the
explanation that isn't too precise.
Paolo
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-13 14:39 ` Peter Xu
@ 2025-05-14 7:34 ` Akihiko Odaki
2025-05-14 17:06 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-14 7:34 UTC (permalink / raw)
To: Peter Xu
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/13 23:39, 'Peter Xu' via devel wrote:
> On Sun, May 11, 2025 at 03:08:18PM +0900, Akihiko Odaki wrote:
>> futex(2) - Linux manual page
>> https://man7.org/linux/man-pages/man2/futex.2.html
>>> Note that a wake-up can also be caused by common futex usage patterns
>>> in unrelated code that happened to have previously used the futex
>>> word's memory location (e.g., typical futex-based implementations of
>>> Pthreads mutexes can cause this under some conditions). Therefore,
>>> callers should always conservatively assume that a return value of 0
>>> can mean a spurious wake-up, and use the futex word's value (i.e.,
>>> the user-space synchronization scheme) to decide whether to continue
>>> to block or not.
>
> I'm just curious - do you know when this will happen?
>
> AFAIU, QEMU uses futex always on private mappings, internally futex does
> use (mm, HVA) tuple to index a futex, afaict. Hence, I don't see how it
> can get spurious wakeups.. And _if_ it happens, since mm pointer can't
> change it must mean the HVA of the futex word is reused, it sounds like an
> UAF user bug to me instead.
>
> I checked the man-pages git repo, this line was introduced in:
>
> https://github.com/mkerrisk/man-pages/commit/4b35dc5dabcf356ce6dcb1f949f7b00e76c7587d
>
> I also didn't see details yet in commit message on why that paragraph was
> added.
>
> And..
>
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>> include/qemu/futex.h | 9 +++++++++
>> tests/unit/test-aio-multithread.c | 4 +++-
>> util/qemu-thread-posix.c | 28 ++++++++++++++++------------
>> 3 files changed, 28 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/qemu/futex.h b/include/qemu/futex.h
>> index 91ae88966e12..f57774005330 100644
>> --- a/include/qemu/futex.h
>> +++ b/include/qemu/futex.h
>> @@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
>> qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
>> }
>>
>> +/*
>> + * Note that a wake-up can also be caused by common futex usage patterns in
>> + * unrelated code that happened to have previously used the futex word's
>> + * memory location (e.g., typical futex-based implementations of Pthreads
>> + * mutexes can cause this under some conditions). Therefore, callers should
>
> .. another thing that was unclear to me is, here it's mentioning "typical
> futex-based implementations of pthreads mutexes..", but here
> qemu_futex_wait() is using raw futex without any pthread impl. Does it
> also mean that this may not be applicable to whatever might cause a
> spurious wakeup?
No. The man-page mentions "unrelated code that happened to have
previously used the futex word's memory location", so it doesn't matter
whether we use pthread here.
libpthread and even this QemuEvent follows the "common futex usage" so
we should do what is written in the man page.
Unfortunately the man page does not describe the "common futex usage
pattern". It looks like as follows:
Assume there are two threads, one atomic variable, and one futex.
Thread A does the following:
A1. Read the atomic variable.
A2. Go A5 if the atomic variable is zero.
A3. Wait using the futex.
A4. Go A1.
A5. Free the atomic variable and the futex.
Thread B does the following:
B1. Set the atomic variable to zero.
B2. Wake up using the futex.
In this example, the execution may happen in the following order:
B1 -> A1 -> A2 -> A5 -> B2
Here, B2 will cause a spurious wake up of QemuEvent if the freed memory
gets reused for QemuEvent.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent
2025-05-13 19:13 ` Peter Xu
@ 2025-05-14 12:36 ` Akihiko Odaki
0 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-14 12:36 UTC (permalink / raw)
To: Peter Xu
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/14 4:13, Peter Xu wrote:
> On Sun, May 11, 2025 at 03:08:24PM +0900, Akihiko Odaki wrote:
>> rp_pong_acks tells if it has ever received one pong. QemuEvent is
>> better suited for this usage because it represents a boolean rather
>> than integer and will not decrement with the wait operation.
>>
>> pause_event can utilize qemu_event_reset() to discard events.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>> migration/migration.h | 6 +++---
>> migration/migration.c | 29 +++++++++++++----------------
>> 2 files changed, 16 insertions(+), 19 deletions(-)
>>
>> diff --git a/migration/migration.h b/migration/migration.h
>> index d53f7cad84d8..11dba5f4da77 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -342,11 +342,11 @@ struct MigrationState {
>> */
>> QemuSemaphore rp_sem;
>> /*
>> - * We post to this when we got one PONG from dest. So far it's an
>> + * We set this when we got one PONG from dest. So far it's an
>> * easy way to know the main channel has successfully established
>> * on dest QEMU.
>> */
>> - QemuSemaphore rp_pong_acks;
>> + QemuEvent rp_pong_acks;
>
> If to do the switch then it needs a rename too because "acks" imply a
> non-boolean.
>
> For this one, when introduced it was indeed kept in mind that it might at
> some point be useful to count for more than one.
>
> For the other one, it is definitely an improvement to use QemuEvent
> especially on the tricky reset..
>
> How about keeping rp_pong_acks as-is, and keep the patch for replacing
> pause_sem?
I'll keep rp_pong_acks as is as you suggest with the next version. My
objective in this series to replace QemuSemaphores with QemuEvents if
QemuEvents are obviously more suited; this one is not that obvious.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-14 0:51 ` Paolo Bonzini
@ 2025-05-14 12:57 ` Akihiko Odaki
2025-05-16 11:09 ` Paolo Bonzini
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-14 12:57 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Weil, Peter Xu, Fabiano Rosas,
Hailiang Zhang
Cc: Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/14 9:51, 'Paolo Bonzini' via devel wrote:
> On 5/11/25 08:08, Akihiko Odaki wrote:
>> @@ -387,12 +365,17 @@ void qemu_event_set(QemuEvent *ev)
>> assert(ev->initialized);
>> /*
>> - * Pairs with both qemu_event_reset() and qemu_event_wait().
>> + * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
>> *
>> * qemu_event_set has release semantics, but because it *loads*
>> * ev->value we need a full memory barrier here.
>> + *
>> + * qemu_event_wait() do not have a paired barrier on non-Linux
>> because
>> + * ev->lock ensures ordering.
>> */
>
> Thanks for trying to write down the idea. This works, but there's
> still room for improvement. :) The insight is that the logic becomes:
>
> if (done == 0) { done = 1;
> <read ev->value> ev->value = EV_SET;
> ev->value = EV_FREE;
> if (done == 0) {
> pthread_cond_wait(); pthread_cond_broadcast();
> }
> }
>
> which is a lot simplified compared to the futex case because all the
> differences between EV_FREE and EV_BUSY are replaced by the condition
> variable. In order to avoid the slow path, all that's needed is to
> ensure that if qemu_event_reset() sees EV_SET, the caller also sees
> done == 1 in the second "if".
>
> The futex case needs qatomic_or() because there can be a concurrent
> EV_FREE->EV_BUSY, and a smp_mb() because qemu_event_set() needs to
> *read* ev->value. The non-futex case instead it can do this:
>
> if (done == 0) { done = 1;
> v = load_acquire(ev->value); <--- store_release(value, EV_SET);
> ev->value = v | EV_FREE;
> if (done == 0) {
> pthread_cond_wait(); pthread_cond_broadcast();
> }
> }
>
> where load<---store indicates that the load on the left reads the
> value that the store writes, and the store "happens before" the load.
> In other words:>
> - because v reads EV_SET, everything before the store_release is ordered
> before everything that follows the load_acquire
>
> - therefore if v reads EV_SET, the pthread_cond_wait() won't be reached
> and there's no possibility of a hang (I think :))
>
>
> This becomes the following patch:
>
> diff --git a/util/event.c b/util/event.c
> index 20853d61a7c..489cec08de7 100644
> --- a/util/event.c
> +++ b/util/event.c
> @@ -47,18 +47,14 @@ void qemu_event_set(QemuEvent *ev)
> {
> assert(ev->initialized);
>
> +#ifdef HAVE_FUTEX
> /*
> - * Pairs with qemu_event_wait() (on Linux) and qemu_event_reset().
> - *
> + * Pairs with qemu_event_wait() and qemu_event_reset().
> * qemu_event_set has release semantics, but because it *loads*
> * ev->value we need a full memory barrier here.
> - *
> - * qemu_event_wait() do not have a paired barrier on non-Linux because
> - * ev->lock ensures ordering.
> */
> smp_mb();
>
> -#ifdef HAVE_FUTEX
> if (qatomic_read(&ev->value) != EV_SET) {
> int old = qatomic_xchg(&ev->value, EV_SET);
>
> @@ -71,7 +67,8 @@ void qemu_event_set(QemuEvent *ev)
> }
> #else
> pthread_mutex_lock(&ev->lock);
> - qatomic_set(&ev->value, EV_SET);
> + /* Pairs with qemu_event_reset()'s load acquire. */
> + qatomic_store_release(&ev->value, EV_SET);
> pthread_cond_broadcast(&ev->cond);
> pthread_mutex_unlock(&ev->lock);
> #endif
> @@ -81,6 +78,7 @@ void qemu_event_reset(QemuEvent *ev)
> {
> assert(ev->initialized);
>
> +#ifdef HAVE_FUTEX
> /*
> * If there was a concurrent reset (or even reset+wait),
> * do nothing. Otherwise change EV_SET->EV_FREE.
> @@ -92,6 +90,28 @@ void qemu_event_reset(QemuEvent *ev)
> * Pairs with the first memory barrier in qemu_event_set().
> */
> smp_mb__after_rmw();
> +#else
> + /*
> + * If futexes are not available, there are no EV_FREE->EV_BUSY
> + * transitions because wakeups are done entirely through the
> + * condition variable. Since qatomic_set() always writes EV_FREE
> + * the load seems useless, but in reality its acquire synchronizes
> + * with qemu_event_set()'s store-release: if qemu_event_reset()
> + * sees EV_SET here, then the caller will certainly see a
> + * successful condition and skip qemu_event_wait():
> + *
> + * done = 1; if (done == 0)
> + * qemu_event_set() { qemu_event_reset() {
> + * lock();
> + * ev->value = EV_SET -----> v = ev->value
> + * ev->value = v | EV_FREE
> + * cond_broadcast()
> + * unlock(); }
> + * } if (done == 0)
> + * // qemu_event_wait() not called
> + */
> + qatomic_set(&ev->value, qatomic_load_acquire(&ev->value) | EV_FREE);
> +#endif
> }
>
> void qemu_event_wait(QemuEvent *ev)
>
>
> You don't need to post v4 (which would include the above diff in
> this patch, in qemu-thread-posix.c); I can post the above change
> as a follow up too. What you have now is not wrong, it's only the
> explanation that isn't too precise.
Honestly, I do not understand why smp_mb() is placed here even in the
futex case. The comment says:
> qemu_event_set has release semantics, but because it *loads*
> ev->value we need a full memory barrier here.
The barrier is indeed followed by a load: qatomic_read(&ev->value) != EV_SET
However, the caller of qemu_event_set() should not care whether the
event is already set or not, so I'm not sure if smp_mb() is needed in
the first place.
Perhaps what is missing here is a clear documentation of concurrency
semantics of these functions. In my understanding, these functions need
to satisfy the following semantics:
qemu_event_set(): release *if the event is not already set*; otherwise
it has no effect on synchronization so we don't need a barrier either.
qemu_event_reset(): acquire; this enables checking the state for the
event set before resetting.
qemu_event_wait(): acquire
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-14 7:34 ` Akihiko Odaki
@ 2025-05-14 17:06 ` Peter Xu
2025-05-16 5:34 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-05-14 17:06 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On Wed, May 14, 2025 at 04:34:33PM +0900, Akihiko Odaki wrote:
> On 2025/05/13 23:39, 'Peter Xu' via devel wrote:
> > On Sun, May 11, 2025 at 03:08:18PM +0900, Akihiko Odaki wrote:
> > > futex(2) - Linux manual page
> > > https://man7.org/linux/man-pages/man2/futex.2.html
> > > > Note that a wake-up can also be caused by common futex usage patterns
> > > > in unrelated code that happened to have previously used the futex
> > > > word's memory location (e.g., typical futex-based implementations of
> > > > Pthreads mutexes can cause this under some conditions). Therefore,
> > > > callers should always conservatively assume that a return value of 0
> > > > can mean a spurious wake-up, and use the futex word's value (i.e.,
> > > > the user-space synchronization scheme) to decide whether to continue
> > > > to block or not.
> >
> > I'm just curious - do you know when this will happen?
> >
> > AFAIU, QEMU uses futex always on private mappings, internally futex does
> > use (mm, HVA) tuple to index a futex, afaict. Hence, I don't see how it
> > can get spurious wakeups.. And _if_ it happens, since mm pointer can't
> > change it must mean the HVA of the futex word is reused, it sounds like an
> > UAF user bug to me instead.
[1]
> >
> > I checked the man-pages git repo, this line was introduced in:
> >
> > https://github.com/mkerrisk/man-pages/commit/4b35dc5dabcf356ce6dcb1f949f7b00e76c7587d
> >
> > I also didn't see details yet in commit message on why that paragraph was
> > added.
> >
> > And..
> >
> > >
> > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > > ---
> > > include/qemu/futex.h | 9 +++++++++
> > > tests/unit/test-aio-multithread.c | 4 +++-
> > > util/qemu-thread-posix.c | 28 ++++++++++++++++------------
> > > 3 files changed, 28 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/qemu/futex.h b/include/qemu/futex.h
> > > index 91ae88966e12..f57774005330 100644
> > > --- a/include/qemu/futex.h
> > > +++ b/include/qemu/futex.h
> > > @@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
> > > qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
> > > }
> > > +/*
> > > + * Note that a wake-up can also be caused by common futex usage patterns in
> > > + * unrelated code that happened to have previously used the futex word's
> > > + * memory location (e.g., typical futex-based implementations of Pthreads
> > > + * mutexes can cause this under some conditions). Therefore, callers should
> >
> > .. another thing that was unclear to me is, here it's mentioning "typical
> > futex-based implementations of pthreads mutexes..", but here
> > qemu_futex_wait() is using raw futex without any pthread impl. Does it
> > also mean that this may not be applicable to whatever might cause a
> > spurious wakeup?
>
> No. The man-page mentions "unrelated code that happened to have previously
> used the futex word's memory location", so it doesn't matter whether we use
> pthread here.
>
> libpthread and even this QemuEvent follows the "common futex usage" so we
> should do what is written in the man page.
>
> Unfortunately the man page does not describe the "common futex usage
> pattern". It looks like as follows:
>
> Assume there are two threads, one atomic variable, and one futex.
>
> Thread A does the following:
> A1. Read the atomic variable.
> A2. Go A5 if the atomic variable is zero.
> A3. Wait using the futex.
> A4. Go A1.
> A5. Free the atomic variable and the futex.
>
> Thread B does the following:
> B1. Set the atomic variable to zero.
> B2. Wake up using the futex.
>
> In this example, the execution may happen in the following order:
> B1 -> A1 -> A2 -> A5 -> B2
>
> Here, B2 will cause a spurious wake up of QemuEvent if the freed memory gets
> reused for QemuEvent.
This is true.
Said that, if to follow my previous statement at [1] above, here I think A5
is the UAF bug I mentioned, trying to free the lock object with existing
user (Thread B) accessing the object.
IMHO, the userapp should make sure the object will never be freed if
there's any possible user of it, and that includes a waker like Thread B.
For futex, the futex word (which is the important bit here relevant to
possible spurious wakeups) is part of the lock object, hence if the lock
object isn't freed too early it won't ever get reused, and then there
should have no chance of spurious wakeups in the futex context.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-14 17:06 ` Peter Xu
@ 2025-05-16 5:34 ` Akihiko Odaki
2025-05-16 14:53 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-16 5:34 UTC (permalink / raw)
To: Peter Xu
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/15 2:06, Peter Xu wrote:
> On Wed, May 14, 2025 at 04:34:33PM +0900, Akihiko Odaki wrote:
>> On 2025/05/13 23:39, 'Peter Xu' via devel wrote:
>>> On Sun, May 11, 2025 at 03:08:18PM +0900, Akihiko Odaki wrote:
>>>> futex(2) - Linux manual page
>>>> https://man7.org/linux/man-pages/man2/futex.2.html
>>>>> Note that a wake-up can also be caused by common futex usage patterns
>>>>> in unrelated code that happened to have previously used the futex
>>>>> word's memory location (e.g., typical futex-based implementations of
>>>>> Pthreads mutexes can cause this under some conditions). Therefore,
>>>>> callers should always conservatively assume that a return value of 0
>>>>> can mean a spurious wake-up, and use the futex word's value (i.e.,
>>>>> the user-space synchronization scheme) to decide whether to continue
>>>>> to block or not.
>>>
>>> I'm just curious - do you know when this will happen?
>>>
>>> AFAIU, QEMU uses futex always on private mappings, internally futex does
>>> use (mm, HVA) tuple to index a futex, afaict. Hence, I don't see how it
>>> can get spurious wakeups.. And _if_ it happens, since mm pointer can't
>>> change it must mean the HVA of the futex word is reused, it sounds like an
>>> UAF user bug to me instead.
>
> [1]
>
>>>
>>> I checked the man-pages git repo, this line was introduced in:
>>>
>>> https://github.com/mkerrisk/man-pages/commit/4b35dc5dabcf356ce6dcb1f949f7b00e76c7587d
>>>
>>> I also didn't see details yet in commit message on why that paragraph was
>>> added.
>>>
>>> And..
>>>
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>> ---
>>>> include/qemu/futex.h | 9 +++++++++
>>>> tests/unit/test-aio-multithread.c | 4 +++-
>>>> util/qemu-thread-posix.c | 28 ++++++++++++++++------------
>>>> 3 files changed, 28 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/include/qemu/futex.h b/include/qemu/futex.h
>>>> index 91ae88966e12..f57774005330 100644
>>>> --- a/include/qemu/futex.h
>>>> +++ b/include/qemu/futex.h
>>>> @@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
>>>> qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
>>>> }
>>>> +/*
>>>> + * Note that a wake-up can also be caused by common futex usage patterns in
>>>> + * unrelated code that happened to have previously used the futex word's
>>>> + * memory location (e.g., typical futex-based implementations of Pthreads
>>>> + * mutexes can cause this under some conditions). Therefore, callers should
>>>
>>> .. another thing that was unclear to me is, here it's mentioning "typical
>>> futex-based implementations of pthreads mutexes..", but here
>>> qemu_futex_wait() is using raw futex without any pthread impl. Does it
>>> also mean that this may not be applicable to whatever might cause a
>>> spurious wakeup?
>>
>> No. The man-page mentions "unrelated code that happened to have previously
>> used the futex word's memory location", so it doesn't matter whether we use
>> pthread here.
>>
>> libpthread and even this QemuEvent follows the "common futex usage" so we
>> should do what is written in the man page.
>>
>> Unfortunately the man page does not describe the "common futex usage
>> pattern". It looks like as follows:
>>
>> Assume there are two threads, one atomic variable, and one futex.
>>
>> Thread A does the following:
>> A1. Read the atomic variable.
>> A2. Go A5 if the atomic variable is zero.
>> A3. Wait using the futex.
>> A4. Go A1.
>> A5. Free the atomic variable and the futex.
>>
>> Thread B does the following:
>> B1. Set the atomic variable to zero.
>> B2. Wake up using the futex.
>>
>> In this example, the execution may happen in the following order:
>> B1 -> A1 -> A2 -> A5 -> B2
>>
>> Here, B2 will cause a spurious wake up of QemuEvent if the freed memory gets
>> reused for QemuEvent.
>
> This is true.
>
> Said that, if to follow my previous statement at [1] above, here I think A5
> is the UAF bug I mentioned, trying to free the lock object with existing
> user (Thread B) accessing the object.
>
> IMHO, the userapp should make sure the object will never be freed if
> there's any possible user of it, and that includes a waker like Thread B.
>
> For futex, the futex word (which is the important bit here relevant to
> possible spurious wakeups) is part of the lock object, hence if the lock
> object isn't freed too early it won't ever get reused, and then there
> should have no chance of spurious wakeups in the futex context.
It is a UAF, but it is by design and not a bug.
The principle of the futex design is to use atomic memory operations to
manage the state instead of using a system call, which is more expensive.
This principle motivates tolerating spurious wakeups. If wakeup system
calls after free are forbidden, a thread will need to use a (expensive)
system call to ensure the wake up actually happened before freeing.
Instead, we can tolerate spurious wakeups without causing a buggy
behavior by making the waiting thread perform (cheaper) atomic memory
reads to verify the expected state.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-14 12:57 ` Akihiko Odaki
@ 2025-05-16 11:09 ` Paolo Bonzini
2025-05-16 12:58 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2025-05-16 11:09 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Stefan Weil, Peter Xu, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 2073 bytes --]
Il mer 14 mag 2025, 08:57 Akihiko Odaki <akihiko.odaki@daynix.com> ha
scritto:
> Honestly, I do not understand why smp_mb() is placed here even in the
> futex case. The comment says:
> > qemu_event_set has release semantics, but because it *loads*
> > ev->value we need a full memory barrier here.
>
> The barrier is indeed followed by a load: qatomic_read(&ev->value) !=
> EV_SET
> However, the caller of qemu_event_set() should not care whether the
> event is already set or not
so I'm not sure if smp_mb() is needed in
> the first place.
The barrier is needed to ensure correct ordering in all cases. You have on
one side
done=true
Set
Read ev->value
If not EV_SET, set the event+ wake up waiters
>
And on the other
Write EV_FREE
Write
If not done
Wait
If one that calls qemu_event_set and the other calls qemu_event_reset, you
need to avoid that set reads a previous EV_SET for the value *and* the
other side reads done equal to false. The only way to avoid this is for
both sides to use a memory barrier before the read.
qemu_event_set(): release *if the event is not already set*; otherwise
> it has no effect on synchronization so we don't need a barrier either.
It needs to be release always. This ensures that, whenever the setter
declares the event to be set, the other side sees whatever the setter did
before the call.
It's the full memory barrier that is only needed, in principle, when the
event is not already set. But in practice you cannot know which barrier is
needed until you read the value, so you do need smp_mb().
qemu_event_reset(): acquire; this enables checking the state for the
> event set before resetting.
>
> qemu_event_wait(): acquire
>
These are correct. However, these are the semantics seen by the caller, but
(because of the algorithm used with futexes) there is a store-load ordering
and full barriers are necessary. Without futexes indeed it's enough to have
store-release and load-acquire in set/reset, and the mutex and condvar take
care of synchronizing set with wait.
Paolo
Regards,
> Akihiko Odaki
>
>
[-- Attachment #2: Type: text/html, Size: 4161 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-16 11:09 ` Paolo Bonzini
@ 2025-05-16 12:58 ` Akihiko Odaki
2025-05-16 14:25 ` Paolo Bonzini
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-16 12:58 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Stefan Weil, Peter Xu, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/16 20:09, 'Paolo Bonzini' via devel wrote:
>
>
> Il mer 14 mag 2025, 08:57 Akihiko Odaki <akihiko.odaki@daynix.com
> <mailto:akihiko.odaki@daynix.com>> ha scritto:
>
> Honestly, I do not understand why smp_mb() is placed here even in the
> futex case. The comment says:
> > qemu_event_set has release semantics, but because it *loads*
> > ev->value we need a full memory barrier here.
>
> The barrier is indeed followed by a load: qatomic_read(&ev->value) !
> = EV_SET
> However, the caller of qemu_event_set() should not care whether the
> event is already set or not
>
> so I'm not sure if smp_mb() is needed in
> the first place.
>
>
> The barrier is needed to ensure correct ordering in all cases. You have
> on one side
>
> done=true
> Set
> Read ev->value
> If not EV_SET, set the event+ wake up waiters
>
>
> And on the other
>
> Write EV_FREE
> Write
> If not done
> Wait
>
> If one that calls qemu_event_set and the other calls qemu_event_reset,
> you need to avoid that set reads a previous EV_SET for the value *and*
> the other side reads done equal to false. The only way to avoid this is
> for both sides to use a memory barrier before the read.
>
> qemu_event_set(): release *if the event is not already set*; otherwise
> it has no effect on synchronization so we don't need a barrier either.
>
>
> It needs to be release always. This ensures that, whenever the setter
> declares the event to be set, the other side sees whatever the setter
> did before the call.
That is what I misunderstood, and now I can also imagine why it should
always release. Thinking of the scenario with the "done" flag we have
discussed, if we have two setters, the resetter should acquire the state
from both of the two setters.
If the event is already set, we need to ensure all stores specified
before qatomic_read(&ev->value) != EV_SET should happen before
qatomic_read(&ev->value), which requires us to put a full barrier.
But I have a new question: do we really need "if
(qatomic_read(&ev->value) != EV_SET)"?
With git blame, I found it didn't have the full barrier until commit
374293ca6fb0 ("qemu-thread: use acquire/release to clarify semantics of
QemuEvent"), which added it.
atomic_mb_read() was used until the commit instead.
include/qemu/atomic.h at that time had the following comment:
> /* atomic_mb_read/set semantics map Java volatile variables. They are
> * less expensive on some platforms (notably POWER & ARMv7) than fully
> * sequentially consistent operations.
We place a full barrier anyway, so we no longer have a reason to have
qatomic_read() before performing qatomic_xchg().
I also found smp_mb__after_rmw() before qemu_futex_wake_all() is no
longer necessary. Summing up, I think the code should look like as follows:
#ifdef HAVE_FUTEX
if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
/* There were waiters, wake them up. */
qemu_futex_wake_all(ev);
}
#else
pthread_mutex_lock(&ev->lock);
qatomic_store_release(&ev->value, EV_SET);
pthread_cond_broadcast(&ev->cond);
pthread_mutex_unlock(&ev->lock);
#endif
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-16 12:58 ` Akihiko Odaki
@ 2025-05-16 14:25 ` Paolo Bonzini
2025-05-17 5:40 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2025-05-16 14:25 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Stefan Weil, Peter Xu, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 964 bytes --]
Il ven 16 mag 2025, 08:58 Akihiko Odaki <akihiko.odaki@daynix.com> ha
scritto:
> I also found smp_mb__after_rmw() before qemu_futex_wake_all() is no
> longer necessary. Summing up, I think the code should look like as follows:
>
> #ifdef HAVE_FUTEX
>
You would still need smp_mb__before_rmw() here.
if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
>
Removing the qatomic_read() works, but it's more expensive in the case that
the event is already set.
The barrier before qemu_futex_wake_all(ev) could be unnecessary because
there is probably one in FUTEX_WAKE. But not being able to audit Windows
makes me a bit uneasy about it.
Paolo
/* There were waiters, wake them up. */
> qemu_futex_wake_all(ev);
> }
#else
> pthread_mutex_lock(&ev->lock);
> qatomic_store_release(&ev->value, EV_SET);
> pthread_cond_broadcast(&ev->cond);
> pthread_mutex_unlock(&ev->lock);
> #endif
>
> Regards,
> Akihiko Odaki
>
>
[-- Attachment #2: Type: text/html, Size: 2240 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-16 5:34 ` Akihiko Odaki
@ 2025-05-16 14:53 ` Peter Xu
2025-05-17 5:01 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-05-16 14:53 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On Fri, May 16, 2025 at 02:34:33PM +0900, Akihiko Odaki wrote:
> On 2025/05/15 2:06, Peter Xu wrote:
> > On Wed, May 14, 2025 at 04:34:33PM +0900, Akihiko Odaki wrote:
> > > On 2025/05/13 23:39, 'Peter Xu' via devel wrote:
> > > > On Sun, May 11, 2025 at 03:08:18PM +0900, Akihiko Odaki wrote:
> > > > > futex(2) - Linux manual page
> > > > > https://man7.org/linux/man-pages/man2/futex.2.html
> > > > > > Note that a wake-up can also be caused by common futex usage patterns
> > > > > > in unrelated code that happened to have previously used the futex
> > > > > > word's memory location (e.g., typical futex-based implementations of
> > > > > > Pthreads mutexes can cause this under some conditions). Therefore,
> > > > > > callers should always conservatively assume that a return value of 0
> > > > > > can mean a spurious wake-up, and use the futex word's value (i.e.,
> > > > > > the user-space synchronization scheme) to decide whether to continue
> > > > > > to block or not.
> > > >
> > > > I'm just curious - do you know when this will happen?
> > > >
> > > > AFAIU, QEMU uses futex always on private mappings, internally futex does
> > > > use (mm, HVA) tuple to index a futex, afaict. Hence, I don't see how it
> > > > can get spurious wakeups.. And _if_ it happens, since mm pointer can't
> > > > change it must mean the HVA of the futex word is reused, it sounds like an
> > > > UAF user bug to me instead.
> >
> > [1]
> >
> > > >
> > > > I checked the man-pages git repo, this line was introduced in:
> > > >
> > > > https://github.com/mkerrisk/man-pages/commit/4b35dc5dabcf356ce6dcb1f949f7b00e76c7587d
> > > >
> > > > I also didn't see details yet in commit message on why that paragraph was
> > > > added.
> > > >
> > > > And..
> > > >
> > > > >
> > > > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > > > > ---
> > > > > include/qemu/futex.h | 9 +++++++++
> > > > > tests/unit/test-aio-multithread.c | 4 +++-
> > > > > util/qemu-thread-posix.c | 28 ++++++++++++++++------------
> > > > > 3 files changed, 28 insertions(+), 13 deletions(-)
> > > > >
> > > > > diff --git a/include/qemu/futex.h b/include/qemu/futex.h
> > > > > index 91ae88966e12..f57774005330 100644
> > > > > --- a/include/qemu/futex.h
> > > > > +++ b/include/qemu/futex.h
> > > > > @@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
> > > > > qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
> > > > > }
> > > > > +/*
> > > > > + * Note that a wake-up can also be caused by common futex usage patterns in
> > > > > + * unrelated code that happened to have previously used the futex word's
> > > > > + * memory location (e.g., typical futex-based implementations of Pthreads
> > > > > + * mutexes can cause this under some conditions). Therefore, callers should
> > > >
> > > > .. another thing that was unclear to me is, here it's mentioning "typical
> > > > futex-based implementations of pthreads mutexes..", but here
> > > > qemu_futex_wait() is using raw futex without any pthread impl. Does it
> > > > also mean that this may not be applicable to whatever might cause a
> > > > spurious wakeup?
> > >
> > > No. The man-page mentions "unrelated code that happened to have previously
> > > used the futex word's memory location", so it doesn't matter whether we use
> > > pthread here.
> > >
> > > libpthread and even this QemuEvent follows the "common futex usage" so we
> > > should do what is written in the man page.
> > >
> > > Unfortunately the man page does not describe the "common futex usage
> > > pattern". It looks like as follows:
> > >
> > > Assume there are two threads, one atomic variable, and one futex.
> > >
> > > Thread A does the following:
> > > A1. Read the atomic variable.
> > > A2. Go A5 if the atomic variable is zero.
> > > A3. Wait using the futex.
> > > A4. Go A1.
> > > A5. Free the atomic variable and the futex.
> > >
> > > Thread B does the following:
> > > B1. Set the atomic variable to zero.
> > > B2. Wake up using the futex.
> > >
> > > In this example, the execution may happen in the following order:
> > > B1 -> A1 -> A2 -> A5 -> B2
> > >
> > > Here, B2 will cause a spurious wake up of QemuEvent if the freed memory gets
> > > reused for QemuEvent.
> >
> > This is true.
> >
> > Said that, if to follow my previous statement at [1] above, here I think A5
> > is the UAF bug I mentioned, trying to free the lock object with existing
> > user (Thread B) accessing the object.
> >
> > IMHO, the userapp should make sure the object will never be freed if
> > there's any possible user of it, and that includes a waker like Thread B.
> >
> > For futex, the futex word (which is the important bit here relevant to
> > possible spurious wakeups) is part of the lock object, hence if the lock
> > object isn't freed too early it won't ever get reused, and then there
> > should have no chance of spurious wakeups in the futex context.
>
> It is a UAF, but it is by design and not a bug.
>
> The principle of the futex design is to use atomic memory operations to
> manage the state instead of using a system call, which is more expensive.
>
> This principle motivates tolerating spurious wakeups. If wakeup system calls
> after free are forbidden, a thread will need to use a (expensive) system
> call to ensure the wake up actually happened before freeing. Instead, we can
> tolerate spurious wakeups without causing a buggy behavior by making the
> waiting thread perform (cheaper) atomic memory reads to verify the expected
> state.
Right, that's also my understanding that it's by design for futex from
kernel POV.
Which I am not yet sure is whether it's by design to be used in userapp so
that a spurious wakeup could happen. From which regard, I still think
maybe we shouldn't have that paragraph in the man page at all, at least it
can be clearer when put into man pages.
So now the question is, do we have such use case so that QEMU needs to free
a qemu_futex_*() API based lock _before_ any wakeups?
QEMU only has two locks impled on top of direct futex, which is QemuEvent
and QemuLockCnt. From what I can tell on how they're used (not a lot),
none of them will use such as a feature, so IIUC it means QEMU still should
be free from such UAF issue, and it's definitely not a feature either.
Hence if any spurious wakeup happened in QEMU, it's a real bug.
From that POV, IMHO it would make more sense if we allow spurious wakeup
iff it's proved to be necessary (e.g. when QEMU wants to make it a feature
not a bug), and also I worry we're copying the man page content all over
into QEMU tree but just in case it's inaccurate to be applied to QEMU's
context at all.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] futex: Check value after qemu_futex_wait()
2025-05-16 14:53 ` Peter Xu
@ 2025-05-17 5:01 ` Akihiko Odaki
0 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-17 5:01 UTC (permalink / raw)
To: Peter Xu
Cc: Paolo Bonzini, Stefan Weil, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/16 23:53, Peter Xu wrote:
> On Fri, May 16, 2025 at 02:34:33PM +0900, Akihiko Odaki wrote:
>> On 2025/05/15 2:06, Peter Xu wrote:
>>> On Wed, May 14, 2025 at 04:34:33PM +0900, Akihiko Odaki wrote:
>>>> On 2025/05/13 23:39, 'Peter Xu' via devel wrote:
>>>>> On Sun, May 11, 2025 at 03:08:18PM +0900, Akihiko Odaki wrote:
>>>>>> futex(2) - Linux manual page
>>>>>> https://man7.org/linux/man-pages/man2/futex.2.html
>>>>>>> Note that a wake-up can also be caused by common futex usage patterns
>>>>>>> in unrelated code that happened to have previously used the futex
>>>>>>> word's memory location (e.g., typical futex-based implementations of
>>>>>>> Pthreads mutexes can cause this under some conditions). Therefore,
>>>>>>> callers should always conservatively assume that a return value of 0
>>>>>>> can mean a spurious wake-up, and use the futex word's value (i.e.,
>>>>>>> the user-space synchronization scheme) to decide whether to continue
>>>>>>> to block or not.
>>>>>
>>>>> I'm just curious - do you know when this will happen?
>>>>>
>>>>> AFAIU, QEMU uses futex always on private mappings, internally futex does
>>>>> use (mm, HVA) tuple to index a futex, afaict. Hence, I don't see how it
>>>>> can get spurious wakeups.. And _if_ it happens, since mm pointer can't
>>>>> change it must mean the HVA of the futex word is reused, it sounds like an
>>>>> UAF user bug to me instead.
>>>
>>> [1]
>>>
>>>>>
>>>>> I checked the man-pages git repo, this line was introduced in:
>>>>>
>>>>> https://github.com/mkerrisk/man-pages/commit/4b35dc5dabcf356ce6dcb1f949f7b00e76c7587d
>>>>>
>>>>> I also didn't see details yet in commit message on why that paragraph was
>>>>> added.
>>>>>
>>>>> And..
>>>>>
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>> ---
>>>>>> include/qemu/futex.h | 9 +++++++++
>>>>>> tests/unit/test-aio-multithread.c | 4 +++-
>>>>>> util/qemu-thread-posix.c | 28 ++++++++++++++++------------
>>>>>> 3 files changed, 28 insertions(+), 13 deletions(-)
>>>>>>
>>>>>> diff --git a/include/qemu/futex.h b/include/qemu/futex.h
>>>>>> index 91ae88966e12..f57774005330 100644
>>>>>> --- a/include/qemu/futex.h
>>>>>> +++ b/include/qemu/futex.h
>>>>>> @@ -24,6 +24,15 @@ static inline void qemu_futex_wake(void *f, int n)
>>>>>> qemu_futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
>>>>>> }
>>>>>> +/*
>>>>>> + * Note that a wake-up can also be caused by common futex usage patterns in
>>>>>> + * unrelated code that happened to have previously used the futex word's
>>>>>> + * memory location (e.g., typical futex-based implementations of Pthreads
>>>>>> + * mutexes can cause this under some conditions). Therefore, callers should
>>>>>
>>>>> .. another thing that was unclear to me is, here it's mentioning "typical
>>>>> futex-based implementations of pthreads mutexes..", but here
>>>>> qemu_futex_wait() is using raw futex without any pthread impl. Does it
>>>>> also mean that this may not be applicable to whatever might cause a
>>>>> spurious wakeup?
>>>>
>>>> No. The man-page mentions "unrelated code that happened to have previously
>>>> used the futex word's memory location", so it doesn't matter whether we use
>>>> pthread here.
>>>>
>>>> libpthread and even this QemuEvent follows the "common futex usage" so we
>>>> should do what is written in the man page.
>>>>
>>>> Unfortunately the man page does not describe the "common futex usage
>>>> pattern". It looks like as follows:
>>>>
>>>> Assume there are two threads, one atomic variable, and one futex.
>>>>
>>>> Thread A does the following:
>>>> A1. Read the atomic variable.
>>>> A2. Go A5 if the atomic variable is zero.
>>>> A3. Wait using the futex.
>>>> A4. Go A1.
>>>> A5. Free the atomic variable and the futex.
>>>>
>>>> Thread B does the following:
>>>> B1. Set the atomic variable to zero.
>>>> B2. Wake up using the futex.
>>>>
>>>> In this example, the execution may happen in the following order:
>>>> B1 -> A1 -> A2 -> A5 -> B2
>>>>
>>>> Here, B2 will cause a spurious wake up of QemuEvent if the freed memory gets
>>>> reused for QemuEvent.
>>>
>>> This is true.
>>>
>>> Said that, if to follow my previous statement at [1] above, here I think A5
>>> is the UAF bug I mentioned, trying to free the lock object with existing
>>> user (Thread B) accessing the object.
>>>
>>> IMHO, the userapp should make sure the object will never be freed if
>>> there's any possible user of it, and that includes a waker like Thread B.
>>>
>>> For futex, the futex word (which is the important bit here relevant to
>>> possible spurious wakeups) is part of the lock object, hence if the lock
>>> object isn't freed too early it won't ever get reused, and then there
>>> should have no chance of spurious wakeups in the futex context.
>>
>> It is a UAF, but it is by design and not a bug.
>>
>> The principle of the futex design is to use atomic memory operations to
>> manage the state instead of using a system call, which is more expensive.
>>
>> This principle motivates tolerating spurious wakeups. If wakeup system calls
>> after free are forbidden, a thread will need to use a (expensive) system
>> call to ensure the wake up actually happened before freeing. Instead, we can
>> tolerate spurious wakeups without causing a buggy behavior by making the
>> waiting thread perform (cheaper) atomic memory reads to verify the expected
>> state.
>
> Right, that's also my understanding that it's by design for futex from
> kernel POV.
I think it also makes sense from the userspace POV; it is a common truth
that atomic memory operations are cheaper than system calls.
>
> Which I am not yet sure is whether it's by design to be used in userapp so
> that a spurious wakeup could happen. From which regard, I still think
> maybe we shouldn't have that paragraph in the man page at all, at least it
> can be clearer when put into man pages.
Eliminating spurious wakeups requires removing the paragraph from the
man page and updating all libraries (including libpthread) not to make
spurious wakeups, which takes a long time. We need to prepare for
spurious wakeups for now.
I agree the man page can be clearer; the paragraph assumes readers
naively follow what it says, but you need more insights; I also had to
spend some time to understand the QemuEvent code and the libpthread
code, which was unnecessary if the man page describes "the common futex
usage pattern".
>
> So now the question is, do we have such use case so that QEMU needs to free
> a qemu_futex_*() API based lock _before_ any wakeups?
We need to care external libraries that may use futex and at least I
know libpthread can cause spurious wakeups.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-16 14:25 ` Paolo Bonzini
@ 2025-05-17 5:40 ` Akihiko Odaki
2025-05-17 10:24 ` Paolo Bonzini
0 siblings, 1 reply; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-17 5:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Stefan Weil, Peter Xu, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/16 23:25, 'Paolo Bonzini' via devel wrote:
>
>
> Il ven 16 mag 2025, 08:58 Akihiko Odaki <akihiko.odaki@daynix.com
> <mailto:akihiko.odaki@daynix.com>> ha scritto:
>
> I also found smp_mb__after_rmw() before qemu_futex_wake_all() is no
> longer necessary. Summing up, I think the code should look like as
> follows:
>
> #ifdef HAVE_FUTEX
>
>
> You would still need smp_mb__before_rmw() here.
docs/devel/atomics.rst says:
> however, according to the C11 memory model that QEMU uses, this order
> does not propagate to other memory accesses on either side of the
> read-modify-write operation. As far as those are concerned, the
> operation consist of just a load-acquire followed by a store-release.
> Stores that precede the RMW operation, and loads that follow it, can
> still be reordered and will happen *in the middle* of the
> read-modify-write operation!
I think we only need a store-release, which is ensured even by the C11
read-modify-write operation; we only need to ensure that ev->value is
set to EV_SET after all stores specified earlier appear.
>
> if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
>
>
> Removing the qatomic_read() works, but it's more expensive in the case
> that the event is already set.
Removing the qatomic_read() (while keeping smp_mb()) is more expensive
in that case indeed, but I'm not so sure if the case is common enough
that it depreciates the overhead added in the other cases.
I don't know whether the qatomic_read() has improved the performance or
not, but perhaps it is still a premature optimization.
> > The barrier before qemu_futex_wake_all(ev) could be unnecessary
because
> there is probably one in FUTEX_WAKE. But not being able to audit Windows
> makes me a bit uneasy about it.
With "[PATCH v3 01/10] futex: Check value after qemu_futex_wait()",
qemu_event_wait() always calls qatomic_load_acquire(&ev->value) or
qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) before returning, so a
store-release of ev->value is sufficient to ensure ordering and we don't
need to rely on qemu_futex_wait() for that.
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-17 5:40 ` Akihiko Odaki
@ 2025-05-17 10:24 ` Paolo Bonzini
2025-05-17 16:30 ` Akihiko Odaki
0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2025-05-17 10:24 UTC (permalink / raw)
To: Akihiko Odaki
Cc: Stefan Weil, Peter Xu, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 1690 bytes --]
Il sab 17 mag 2025, 01:41 Akihiko Odaki <akihiko.odaki@daynix.com> ha
scritto:
> I think we only need a store-release, which is ensured even by the C11
> read-modify-write operation; we only need to ensure that ev->value is
> set to EV_SET after all stores specified earlier appear.
>
You really need a barrier to order the store against the load
unfortunately. Likewise in qemu_event_wait(). It's really central to this
synchronization pattern, otherwise it's possible that neither side sees the
action of the other (set does not see the request to wake, or wait does not
see EV_SET).
Paolo
> >
> > if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
> >
> >
> > Removing the qatomic_read() works, but it's more expensive in the case
> > that the event is already set.
>
> Removing the qatomic_read() (while keeping smp_mb()) is more expensive
> in that case indeed, but I'to m not so sure if the case is common enough
> that it depreciates the overhead added in the other cases.
>
> I don't know whether the qatomic_read() has improved the performance or
> not, but perhaps it is still a premature optimization.
>
> > > The barrier before qemu_futex_wake_all(ev) could be unnecessary
> because
> > there is probably one in FUTEX_WAKE. But not being able to audit Windows
> > makes me a bit uneasy about it.
>
> With "[PATCH v3 01/10] futex: Check value after qemu_futex_wait()",
> qemu_event_wait() always calls qatomic_load_acquire(&ev->value) or
> qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) before returning, so a
> store-release of ev->value is sufficient to ensure ordering and we don't
> need to rely on qemu_futex_wait() for that.
>
> Regards,
> Akihiko Odaki
>
>
[-- Attachment #2: Type: text/html, Size: 2553 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux
2025-05-17 10:24 ` Paolo Bonzini
@ 2025-05-17 16:30 ` Akihiko Odaki
0 siblings, 0 replies; 29+ messages in thread
From: Akihiko Odaki @ 2025-05-17 16:30 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Stefan Weil, Peter Xu, Fabiano Rosas, Hailiang Zhang,
Phil Dennis-Jordan, qemu-devel, devel, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé
On 2025/05/17 19:24, Paolo Bonzini wrote:
>
>
> Il sab 17 mag 2025, 01:41 Akihiko Odaki <akihiko.odaki@daynix.com
> <mailto:akihiko.odaki@daynix.com>> ha scritto:
>
> I think we only need a store-release, which is ensured even by the C11
> read-modify-write operation; we only need to ensure that ev->value is
> set to EV_SET after all stores specified earlier appear.
>
>
> You really need a barrier to order the store against the load
> unfortunately. Likewise in qemu_event_wait(). It's really central to
> this synchronization pattern, otherwise it's possible that neither side
> sees the action of the other (set does not see the request to wake, or
> wait does not see EV_SET).
The code I suggested does not order stores before qemu_event_set() and
the load in the function unlike qemu_event_wait(), which orders stores
before the loads in the function. I'll show how the code still satisfies
its goal below.
Below is the list of all relevant memory accesses in the thread calling
qemu_event_set() and the one calling qemu_event_wait():
Thread A:
A1. Specify stores
A2. Call qemu_event_set()
A2-1. Call qatomic_xchg()
A2-1-1. Load ev->value
A2-1-2. Store ev->value
A2-2. Call qemu_futex_wake_all()
A2-2-1. Wake up
Thread B:
B1. Call qemu_event_wait()
B1-1. Call qatomic_load_acquire()
B1-1-1. Load ev->value
B1-2. Call qatomic_cmpxchg()
B1-2-1. Load ev->value
B1-2-2. Store ev->value
B1-3. Call qemu_futex_wait()
B1-3-1. Load ev->value
B1-3-2. Wait
B2. Specify loads
The goal is to satisfy the following two cross-thread ordering:
a) B1-3-2 -> A2-2-1 (start waiting -> wake up)
b) A1 -> B2 (stores before setting -> loads after waiting)
First, I'll show that a) is satisfied. There are three facts to consider:
- There are only two stores for ev->value: A2-1-2 and B1-2-2.
- A2-1-2 stores EV_SET.
- B1-2-2 stores EV_BUSY.
- qemu_futex_wait() atomically performs B1-3-1 and B1-3-2.
- B1-3-1 will not appear if EV_SET was loaded earlier.
- B1-3-2 (wait) appears when B1-3-1 loads EV_BUSY.
These facts ensures one of the following orders when B1-3-2 appears:
B1-2-2 (store EV_BUSY)
-> B1-3-1 (loads EV_BUSY)
-> B1-3-2 (start waiting)
-> A2-1-2 (store EV_SET)
-> A2-2-1 (wake up)
Next, I'll show that b) is satisfied.
- A1 (loads before qemu_event_set()) appears before A2-1-2.
- Only A2-1-2 stores EV_SET.
- B2 (stores after waiting) appears after B1-1-1 or B1-2-1 loads EV_SET.
Therefore, the following order are ensured when B2 appears:
A1 (specify loads before qemu_event_set())
-> A2-1-2 (store EV_SET)
-> B1-1-1 or B1-2-1 (load EV_SET)
-> B2 (specify stores after qemu_event_wait())
Regards,
Akihiko Odaki
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-05-17 16:31 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-11 6:08 [PATCH v3 00/10] Improve futex usage Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 01/10] futex: Check value after qemu_futex_wait() Akihiko Odaki
2025-05-13 14:39 ` Peter Xu
2025-05-14 7:34 ` Akihiko Odaki
2025-05-14 17:06 ` Peter Xu
2025-05-16 5:34 ` Akihiko Odaki
2025-05-16 14:53 ` Peter Xu
2025-05-17 5:01 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 02/10] futex: Support Windows Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 03/10] futex: Replace __linux__ with CONFIG_LINUX Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 04/10] qemu-thread: Avoid futex abstraction for non-Linux Akihiko Odaki
2025-05-14 0:51 ` Paolo Bonzini
2025-05-14 12:57 ` Akihiko Odaki
2025-05-16 11:09 ` Paolo Bonzini
2025-05-16 12:58 ` Akihiko Odaki
2025-05-16 14:25 ` Paolo Bonzini
2025-05-17 5:40 ` Akihiko Odaki
2025-05-17 10:24 ` Paolo Bonzini
2025-05-17 16:30 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 05/10] qemu-thread: Use futex for QemuEvent on Windows Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 06/10] qemu-thread: Use futex if available for QemuLockCnt Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 07/10] migration: Replace QemuSemaphore with QemuEvent Akihiko Odaki
2025-05-13 19:13 ` Peter Xu
2025-05-14 12:36 ` Akihiko Odaki
2025-05-11 6:08 ` [PATCH v3 08/10] migration/colo: " Akihiko Odaki
2025-05-12 15:12 ` Fabiano Rosas
2025-05-11 6:08 ` [PATCH v3 09/10] migration/postcopy: " Akihiko Odaki
2025-05-12 15:12 ` Fabiano Rosas
2025-05-11 6:08 ` [PATCH v3 10/10] hw/display/apple-gfx: " Akihiko Odaki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).