[RFC PATCH 0/2] futex: how to solve the robust

public inbox for linux-api@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
@ 2026-02-20 20:26 André Almeida
  2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: André Almeida @ 2026-02-20 20:26 UTC (permalink / raw)
  To: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
	Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
	Mathieu Desnoyers, Liam R . Howlett
  Cc: kernel-dev, linux-api, linux-kernel, André Almeida

During LPC 2025, I presented a session about creating a new syscall for
robust_list[0][1]. However, most of the session discussion wasn't much related
to the new syscall itself, but much more related to an old bug that exists in
the current robust_list mechanism.

Since at least 2012, there's an open bug reporting a race condition, as
Carlos O'Donell pointed out:

  "File corruption race condition in robust mutex unlocking"
  https://sourceware.org/bugzilla/show_bug.cgi?id=14485

To help understand the bug, I've created a reproducer (patch 1/2) and a
companion kernel hack (patch 2/2) that helps to make the race condition
more likely. When the bug happens, the reproducer shows a message
comparing the original memory with the corrupted one:

  "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"

I'm not sure yet what would be the appropriated approach to fix it, so I
decided to reach the community before moving forward in some direction.
One suggestion from Peter[2] resolves around serializing the mmap() and the
robust list exit path, which might cause overheads for the common case,
where list_op_pending is empty.

However, giving that there's a new interface being prepared, this could
also give the opportunity to rethink how list_op_pending works, and get
rid of the race condition by design.

Feedback is very much welcome.

Thanks!
	André

[0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@igalia.com/
[1] https://lpc.events/event/19/contributions/2108/
[2] https://lore.kernel.org/lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/

André Almeida (2):
  futex: Create reproducer for robust_list race condition
  futex: Add debug delays

 kernel/futex/core.c |  10 +++
 robust_bug.c        | 178 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 188 insertions(+)
 create mode 100644 robust_bug.c

-- 
2.53.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
  2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
@ 2026-02-20 20:26 ` André Almeida
  2026-03-12  9:04   ` Sebastian Andrzej Siewior
  2026-02-20 20:26 ` [RFC PATCH 2/2] futex: hack: Add debug delays André Almeida
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: André Almeida @ 2026-02-20 20:26 UTC (permalink / raw)
  To: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
	Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
	Mathieu Desnoyers, Liam R . Howlett
  Cc: kernel-dev, linux-api, linux-kernel, André Almeida

Create a reproducer for https://sourceware.org/bugzilla/show_bug.cgi?id=14485

This is not supposed to be merged.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 robust_bug.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 178 insertions(+)
 create mode 100644 robust_bug.c

diff --git a/robust_bug.c b/robust_bug.c
new file mode 100644
index 000000000000..1ade4e6d66dd
--- /dev/null
+++ b/robust_bug.c
@@ -0,0 +1,178 @@
+/*
+ *  gcc robust_bug.c -o robust_bug
+ *
+ * This is a reproducer for "File corruption race condition in robust
+ * mutex unlocking" from https://sourceware.org/bugzilla/show_bug.cgi?id=14485
+ *
+ * To increase the changes of reaching the race condition, a delay can be added
+ * to the kernel function handle_futex_death(), just before the user memory
+ * write futex_cmpxchg_value_locked().
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include <time.h>
+
+#define cpu_relax() asm volatile("rep; nop");
+
+/*
+ * This struct is an example of a lock struct, shared between the threads.
+ */
+struct lock_struct {
+	uint32_t 		futex;
+	struct robust_list	list;
+};
+
+static struct lock_struct *lock;
+
+/*
+ * This is the struct that we are going to use to allocate on top of the 
+ * freed memory to observe the race condition.
+ */
+struct another_struct {
+	uint64_t value;
+};
+
+static pthread_barrier_t barrier;
+
+static int set_robust_list(struct robust_list_head *head)
+{
+	return syscall(SYS_set_robust_list, head, sizeof(*head));
+}
+
+/*
+ * This thread emulates the behaviour of a thread releasing a robust mutex:
+ * - It starts by adding the mutex to the op_pending field
+ * - Remove the mutex from the robust list
+ * - Release the lock and wake up waiters
+ * - Remove the mutex from the op_pending field
+ *
+ * However, this thread dies before doing this last step, leaving the mutex
+ * behind in the op_pending field.
+ */
+void *func_b(void *arg)
+{
+	static struct robust_list_head head;
+	pid_t tid = gettid() | FUTEX_WAITERS;
+
+	/*
+	 * Initial thread setup. This would happen in an earlier stage of the
+	 * thread execution.
+	 */
+	set_robust_list(&head);
+	head.list.next = &head.list;
+	head.futex_offset = (size_t) offsetof(struct lock_struct, futex) -
+		(size_t) offsetof(struct lock_struct, list);
+
+	/* This thread takes the lock... */
+	lock->futex = tid;
+
+	/* ...would do some work here... */
+
+	/*
+	 * ...and starts the release process. Adds the mutex to be released on
+	 * the op_pending.
+	 */
+	head.list_op_pending = &lock->list;
+
+	/* Barrier to synchronize thread B taking the lock */
+	pthread_barrier_wait(&barrier);
+	usleep(100);
+
+	/*
+	 * Here we would release the lock and wake up any waiters.
+	 *
+	 * lock->futex = LOCK_FREE;
+	 * futex_wake(lock->futex, 1);
+	 */
+
+	/*
+	 * We would remove the lock from op_pending, but we emulate a thread
+	 * exiting before doing it.
+	 */
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	struct another_struct *new;
+	uint64_t original_val;
+	pthread_t thread_b;
+	uint32_t value;
+	int ret;
+
+	ret = pthread_barrier_init(&barrier, NULL, 2);
+	if (ret) {
+		puts("pthread_barrier_init failed");
+		return -1;
+	}
+
+	/* Initialize the lock */
+	lock = mmap(NULL, sizeof(struct lock_struct), PROT_READ | PROT_WRITE,
+		    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	if (lock == MAP_FAILED) {
+		puts("mmap failed");
+		return -1;
+	}
+	memset(lock, 0, sizeof(*lock));
+
+	/* Create the thread B that will take the lock */
+	pthread_create(&thread_b, NULL, func_b, NULL);
+
+	/* Barrier to synchronize thread B taking the lock */
+	pthread_barrier_wait(&barrier);
+
+	/* Copy this value as we will use it later */
+	value = lock->futex;
+
+	/*
+	 * Here, this thread would do the following:
+	 * - It would wait for the lock, and be wake from thread B
+	 * - Take the lock, do some work, and release it
+	 * - After releasing the lock and being the last user, it can correctly
+	 *   free it
+	 */
+	munmap(lock, sizeof(struct lock_struct));
+
+	/*
+	 * After freeing the lock, this thread allocates memory, which
+	 * happens to be at the same address of the lock, and by chance, it fills
+	 * the memory with the TID of thread B.
+	 */
+	new = mmap(NULL, sizeof(struct another_struct), PROT_READ | PROT_WRITE,
+		    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	if (new == MAP_FAILED) {
+		puts("mmap failed");
+		return -1;
+	}
+	if ((uintptr_t) lock != (uintptr_t) new) {
+		puts("mmap got a different address");
+		return -1;
+	}
+
+	new->value = ((uint64_t) value << 32) + value;
+
+	/* Create a backup of the current value */
+	original_val = new->value;
+
+	/* Wait for the memory corruption to happen... */	
+	while (new->value == original_val)
+		cpu_relax();
+
+	/* ...and now the kernel just overwrote an unrelated user memory! */
+	printf("Memory was corrupted by the kernel: %lx vs %lx\n",
+		original_val, new->value);
+
+	munmap(new, sizeof(struct another_struct));
+
+	return 0;
+}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/2] futex: hack: Add debug delays
  2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
  2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
@ 2026-02-20 20:26 ` André Almeida
  2026-02-20 20:51 ` [RFC PATCH 0/2] futex: how to solve the robust_list race condition? Liam R. Howlett
  2026-02-20 21:42 ` Mathieu Desnoyers
  3 siblings, 0 replies; 23+ messages in thread
From: André Almeida @ 2026-02-20 20:26 UTC (permalink / raw)
  To: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
	Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
	Mathieu Desnoyers, Liam R . Howlett
  Cc: kernel-dev, linux-api, linux-kernel, André Almeida

Add delays to handle_futex_death() to increase the chance of hitting the race
condition.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 kernel/futex/core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cf7e610eac42..d409b3368cb3 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -44,6 +44,7 @@
 #include <linux/prctl.h>
 #include <linux/mempolicy.h>
 #include <linux/mmap_lock.h>
+#include <linux/delay.h>
 
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
@@ -1095,6 +1096,12 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
 	 * does not guarantee R/W access. If that fails we
 	 * give up and leave the futex locked.
 	 */
+
+	if (!strcmp(current->comm, "robust_bug")) {
+		printk("robust_bug is exiting\n");
+		msleep(500);
+	}
+
 	if ((err = futex_cmpxchg_value_locked(&nval, uaddr, uval, mval))) {
 		switch (err) {
 		case -EFAULT:
@@ -1112,6 +1119,9 @@ static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr,
 		}
 	}
 
+	if (!strcmp(current->comm, "robust_bug"))
+		printk("memory written\n");
+
 	if (nval != uval)
 		goto retry;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
  2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
  2026-02-20 20:26 ` [RFC PATCH 2/2] futex: hack: Add debug delays André Almeida
@ 2026-02-20 20:51 ` Liam R. Howlett
  2026-02-27 19:15   ` André Almeida
  2026-02-20 21:42 ` Mathieu Desnoyers
  3 siblings, 1 reply; 23+ messages in thread
From: Liam R. Howlett @ 2026-02-20 20:51 UTC (permalink / raw)
  To: André Almeida
  Cc: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
	Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
	Mathieu Desnoyers, kernel-dev, linux-api, linux-kernel,
	Suren Baghdasaryan, Lorenzo Stoakes, Michal Hocko

+Cc Suren, Lorenzo, and Michal

* André Almeida <andrealmeid@igalia.com> [260220 15:27]:
> During LPC 2025, I presented a session about creating a new syscall for
> robust_list[0][1]. However, most of the session discussion wasn't much related
> to the new syscall itself, but much more related to an old bug that exists in
> the current robust_list mechanism.

Ah, sorry for hijacking the session, that was not my intention, but this
needs to be addressed before we propagate the issue into the next
iteration.

> 
> Since at least 2012, there's an open bug reporting a race condition, as
> Carlos O'Donell pointed out:
> 
>   "File corruption race condition in robust mutex unlocking"
>   https://sourceware.org/bugzilla/show_bug.cgi?id=14485
> 
> To help understand the bug, I've created a reproducer (patch 1/2) and a
> companion kernel hack (patch 2/2) that helps to make the race condition
> more likely. When the bug happens, the reproducer shows a message
> comparing the original memory with the corrupted one:
> 
>   "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"
> 
> I'm not sure yet what would be the appropriated approach to fix it, so I
> decided to reach the community before moving forward in some direction.
> One suggestion from Peter[2] resolves around serializing the mmap() and the
> robust list exit path, which might cause overheads for the common case,
> where list_op_pending is empty.
> 
> However, giving that there's a new interface being prepared, this could
> also give the opportunity to rethink how list_op_pending works, and get
> rid of the race condition by design.
> 
> Feedback is very much welcome.

There was a delay added to the oom reaper for these tasks [1] by commit
e4a38402c36e ("oom_kill.c: futex: delay the OOM reaper to allow time for
proper futex cleanup")

We did discuss marking the vmas as needing to be skipped by the oom
manager, but no clear path forward was clear.  It's also not clear if
that's the only area where such a problem exists.

[1].  https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u

> 
> Thanks!
> 	André
> 
> [0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@igalia.com/
> [1] https://lpc.events/event/19/contributions/2108/
> [2] https://lore.kernel.org/lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/
> 
> André Almeida (2):
>   futex: Create reproducer for robust_list race condition
>   futex: Add debug delays
> 
>  kernel/futex/core.c |  10 +++
>  robust_bug.c        | 178 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 188 insertions(+)
>  create mode 100644 robust_bug.c
> 
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
                   ` (2 preceding siblings ...)
  2026-02-20 20:51 ` [RFC PATCH 0/2] futex: how to solve the robust_list race condition? Liam R. Howlett
@ 2026-02-20 21:42 ` Mathieu Desnoyers
  2026-02-20 22:41   ` Mathieu Desnoyers
  3 siblings, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-02-20 21:42 UTC (permalink / raw)
  To: André Almeida, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
	Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
	Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
	Lorenzo Stoakes, Michal Hocko
  Cc: kernel-dev, linux-api, linux-kernel, libc-alpha

+CC libc-alpha.

On 2026-02-20 15:26, André Almeida wrote:
> During LPC 2025, I presented a session about creating a new syscall for
> robust_list[0][1]. However, most of the session discussion wasn't much related
> to the new syscall itself, but much more related to an old bug that exists in
> the current robust_list mechanism.
> 
> Since at least 2012, there's an open bug reporting a race condition, as
> Carlos O'Donell pointed out:
> 
>    "File corruption race condition in robust mutex unlocking"
>    https://sourceware.org/bugzilla/show_bug.cgi?id=14485
> 
> To help understand the bug, I've created a reproducer (patch 1/2) and a
> companion kernel hack (patch 2/2) that helps to make the race condition
> more likely. When the bug happens, the reproducer shows a message
> comparing the original memory with the corrupted one:
> 
>    "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"
> 
> I'm not sure yet what would be the appropriated approach to fix it, so I
> decided to reach the community before moving forward in some direction.
> One suggestion from Peter[2] resolves around serializing the mmap() and the
> robust list exit path, which might cause overheads for the common case,
> where list_op_pending is empty.
> 
> However, giving that there's a new interface being prepared, this could
> also give the opportunity to rethink how list_op_pending works, and get
> rid of the race condition by design.
> 
> Feedback is very much welcome.

Looking at this bug, one thing I'm starting to consider is that it
appears to be an issue inherent to lack of synchronization between
pthread_mutex_destroy(3) and the per-thread list_op_pending fields
and not so much a kernel issue.

Here is why I think the issue is purely userspace:

Let's suppose we have a shared memory area across Processes 1 and Process 2,
which internally have its own custom memory allocator in userspace to
allocate/free space within that shared memory.

Process 1, Thread A stumbles through the scenario highlighted by this bug, and
basically gets preempted at this FIXME in libc __pthread_mutex_unlock_full():

       if (__glibc_unlikely ((atomic_exchange_release (&mutex->__data.__lock, 0)
                              & FUTEX_WAITERS) != 0))
         futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);

       /* We must clear op_pending after we release the mutex.
          FIXME However, this violates the mutex destruction requirements
          because another thread could acquire the mutex, destroy it, and
          reuse the memory for something else; then, if this thread crashes,
          and the memory happens to have a value equal to the TID, the kernel
          will believe it is still related to the mutex (which has been
          destroyed already) and will modify some other random object.  */
       __asm ("" ::: "memory");
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

Then Process 1, Thread B runs, grabs the lock, releases it, and based on
program state it knows it can pthread_mutex_destroy() this lock, free its
associated memory through the custom shared memory allocator, and allocate
it for other purposes. Then we get to the point where Process 1 is
killed, and where the robust futex kernel code corrupts data in shared
memory because of the dangling list_op_pending pointer.

That shared memory data is still observable by Process B, which will get a
corrupted state.

Notice how this all happens without any munmap(2)/mmap(2) in the sequence ?
This is why I think this is purely a userspace issue rather than an issue
we can solve by adding extra synchronization in the kernel.

The one point we have in that sequence where I think we can add synchronization
is pthread_mutex_destroy(3) in libc. One possible "big hammer" solution would be
to make pthread_mutex_destroy iterate on all other threads list_op_pending
and busy-wait if it finds that the mutex address is in use. It would of course
only have to do that for robust futexes.

If that big hammer solution is not fast enough for many-threaded use-cases,
then we can think of other approaches such as adding a reference counter
in the mutex structure, or introducing hazard pointers in userspace to reduce
synchronization iteration from nr_threads to nr_cpus (or even down to max
rseq mm_cid).

Thoughts ?

Thanks,

Mathieu

> 
> Thanks!
> 	André
> 
> [0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@igalia.com/
> [1] https://lpc.events/event/19/contributions/2108/
> [2] https://lore.kernel.org/lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 21:42 ` Mathieu Desnoyers
@ 2026-02-20 22:41   ` Mathieu Desnoyers
  2026-02-20 23:17     ` Mathieu Desnoyers
  0 siblings, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-02-20 22:41 UTC (permalink / raw)
  To: André Almeida, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
	Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
	Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
	Lorenzo Stoakes, Michal Hocko
  Cc: kernel-dev, linux-api, linux-kernel, libc-alpha

On 2026-02-20 16:42, Mathieu Desnoyers wrote:
> +CC libc-alpha.
> 
> On 2026-02-20 15:26, André Almeida wrote:
>> During LPC 2025, I presented a session about creating a new syscall for
>> robust_list[0][1]. However, most of the session discussion wasn't much 
>> related
>> to the new syscall itself, but much more related to an old bug that 
>> exists in
>> the current robust_list mechanism.
>>
>> Since at least 2012, there's an open bug reporting a race condition, as
>> Carlos O'Donell pointed out:
>>
>>    "File corruption race condition in robust mutex unlocking"
>>    https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>
>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>> companion kernel hack (patch 2/2) that helps to make the race condition
>> more likely. When the bug happens, the reproducer shows a message
>> comparing the original memory with the corrupted one:
>>
>>    "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 
>> 8001fe8dc0000000"
>>
>> I'm not sure yet what would be the appropriated approach to fix it, so I
>> decided to reach the community before moving forward in some direction.
>> One suggestion from Peter[2] resolves around serializing the mmap() 
>> and the
>> robust list exit path, which might cause overheads for the common case,
>> where list_op_pending is empty.
>>
>> However, giving that there's a new interface being prepared, this could
>> also give the opportunity to rethink how list_op_pending works, and get
>> rid of the race condition by design.
>>
>> Feedback is very much welcome.
> 
> Looking at this bug, one thing I'm starting to consider is that it
> appears to be an issue inherent to lack of synchronization between
> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
> and not so much a kernel issue.
> 
> Here is why I think the issue is purely userspace:
> 
> Let's suppose we have a shared memory area across Processes 1 and 
> Process 2,
> which internally have its own custom memory allocator in userspace to
> allocate/free space within that shared memory.
> 
> Process 1, Thread A stumbles through the scenario highlighted by this 
> bug, and
> basically gets preempted at this FIXME in libc 
> __pthread_mutex_unlock_full():
> 
>        if (__glibc_unlikely ((atomic_exchange_release (&mutex- 
>  >__data.__lock, 0)
>                               & FUTEX_WAITERS) != 0))
>          futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
> 
>        /* We must clear op_pending after we release the mutex.
>           FIXME However, this violates the mutex destruction requirements
>           because another thread could acquire the mutex, destroy it, and
>           reuse the memory for something else; then, if this thread 
> crashes,
>           and the memory happens to have a value equal to the TID, the 
> kernel
>           will believe it is still related to the mutex (which has been
>           destroyed already) and will modify some other random object.  */
>        __asm ("" ::: "memory");
>        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> 
> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
> program state it knows it can pthread_mutex_destroy() this lock, free its
> associated memory through the custom shared memory allocator, and allocate
> it for other purposes. Then we get to the point where Process 1 is
> killed, and where the robust futex kernel code corrupts data in shared
> memory because of the dangling list_op_pending pointer.
> 
> That shared memory data is still observable by Process B, which will get a
> corrupted state.
> 
> Notice how this all happens without any munmap(2)/mmap(2) in the sequence ?
> This is why I think this is purely a userspace issue rather than an issue
> we can solve by adding extra synchronization in the kernel.
> 
> The one point we have in that sequence where I think we can add 
> synchronization
> is pthread_mutex_destroy(3) in libc. One possible "big hammer" solution 
> would be
> to make pthread_mutex_destroy iterate on all other threads list_op_pending
> and busy-wait if it finds that the mutex address is in use. It would of 
> course
> only have to do that for robust futexes.
> 
> If that big hammer solution is not fast enough for many-threaded use-cases,
> then we can think of other approaches such as adding a reference counter
> in the mutex structure, or introducing hazard pointers in userspace to 
> reduce
> synchronization iteration from nr_threads to nr_cpus (or even down to max
> rseq mm_cid).

To make matters even worse, the pthread_mutex_destroy(3) and reallocation
could happen from Process 2 rather than Process 1. So iterating on a
threads from Process 1 is not sufficient. We'd need to synchronize
pthread_mutex_destroy on something within the mutex structure which is
observable from all processes using the lock, for instance a reference count.

Thanks,

Mathieu

> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
>>
>> Thanks!
>>     André
>>
>> [0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex- 
>> v6-0-05fea005a0fd@igalia.com/
>> [1] https://lpc.events/event/19/contributions/2108/
>> [2] https://lore.kernel.org/ 
>> lkml/20241219171344.GA26279@noisy.programming.kicks-ass.net/
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 22:41   ` Mathieu Desnoyers
@ 2026-02-20 23:17     ` Mathieu Desnoyers
  2026-02-23 11:13       ` Florian Weimer
  2026-02-27 19:16       ` André Almeida
  0 siblings, 2 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-02-20 23:17 UTC (permalink / raw)
  To: André Almeida, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Florian Weimer,
	Rich Felker, Torvald Riegel, Darren Hart, Thomas Gleixner,
	Ingo Molnar, Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett,
	Lorenzo Stoakes, Michal Hocko
  Cc: kernel-dev, linux-api, linux-kernel, libc-alpha

On 2026-02-20 17:41, Mathieu Desnoyers wrote:
> On 2026-02-20 16:42, Mathieu Desnoyers wrote:
>> +CC libc-alpha.
>>
>> On 2026-02-20 15:26, André Almeida wrote:
>>> During LPC 2025, I presented a session about creating a new syscall for
>>> robust_list[0][1]. However, most of the session discussion wasn't 
>>> much related
>>> to the new syscall itself, but much more related to an old bug that 
>>> exists in
>>> the current robust_list mechanism.
>>>
>>> Since at least 2012, there's an open bug reporting a race condition, as
>>> Carlos O'Donell pointed out:
>>>
>>>    "File corruption race condition in robust mutex unlocking"
>>>    https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>>
>>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>>> companion kernel hack (patch 2/2) that helps to make the race condition
>>> more likely. When the bug happens, the reproducer shows a message
>>> comparing the original memory with the corrupted one:
>>>
>>>    "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 
>>> 8001fe8dc0000000"
>>>
>>> I'm not sure yet what would be the appropriated approach to fix it, so I
>>> decided to reach the community before moving forward in some direction.
>>> One suggestion from Peter[2] resolves around serializing the mmap() 
>>> and the
>>> robust list exit path, which might cause overheads for the common case,
>>> where list_op_pending is empty.
>>>
>>> However, giving that there's a new interface being prepared, this could
>>> also give the opportunity to rethink how list_op_pending works, and get
>>> rid of the race condition by design.
>>>
>>> Feedback is very much welcome.
>>
>> Looking at this bug, one thing I'm starting to consider is that it
>> appears to be an issue inherent to lack of synchronization between
>> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
>> and not so much a kernel issue.
>>
>> Here is why I think the issue is purely userspace:
>>
>> Let's suppose we have a shared memory area across Processes 1 and 
>> Process 2,
>> which internally have its own custom memory allocator in userspace to
>> allocate/free space within that shared memory.
>>
>> Process 1, Thread A stumbles through the scenario highlighted by this 
>> bug, and
>> basically gets preempted at this FIXME in libc 
>> __pthread_mutex_unlock_full():
>>
>>        if (__glibc_unlikely ((atomic_exchange_release (&mutex- 
>>  >__data.__lock, 0)
>>                               & FUTEX_WAITERS) != 0))
>>          futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
>>
>>        /* We must clear op_pending after we release the mutex.
>>           FIXME However, this violates the mutex destruction requirements
>>           because another thread could acquire the mutex, destroy it, and
>>           reuse the memory for something else; then, if this thread 
>> crashes,
>>           and the memory happens to have a value equal to the TID, the 
>> kernel
>>           will believe it is still related to the mutex (which has been
>>           destroyed already) and will modify some other random 
>> object.  */
>>        __asm ("" ::: "memory");
>>        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>>
>> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
>> program state it knows it can pthread_mutex_destroy() this lock, free its
>> associated memory through the custom shared memory allocator, and 
>> allocate
>> it for other purposes. Then we get to the point where Process 1 is
>> killed, and where the robust futex kernel code corrupts data in shared
>> memory because of the dangling list_op_pending pointer.
>>
>> That shared memory data is still observable by Process B, which will 
>> get a
>> corrupted state.
>>
>> Notice how this all happens without any munmap(2)/mmap(2) in the 
>> sequence ?
>> This is why I think this is purely a userspace issue rather than an issue
>> we can solve by adding extra synchronization in the kernel.
>>
>> The one point we have in that sequence where I think we can add 
>> synchronization
>> is pthread_mutex_destroy(3) in libc. One possible "big hammer" 
>> solution would be
>> to make pthread_mutex_destroy iterate on all other threads 
>> list_op_pending
>> and busy-wait if it finds that the mutex address is in use. It would 
>> of course
>> only have to do that for robust futexes.
>>
>> If that big hammer solution is not fast enough for many-threaded use- 
>> cases,
>> then we can think of other approaches such as adding a reference counter
>> in the mutex structure, or introducing hazard pointers in userspace to 
>> reduce
>> synchronization iteration from nr_threads to nr_cpus (or even down to max
>> rseq mm_cid).
> 
> To make matters even worse, the pthread_mutex_destroy(3) and reallocation
> could happen from Process 2 rather than Process 1. So iterating on a
> threads from Process 1 is not sufficient. We'd need to synchronize
> pthread_mutex_destroy on something within the mutex structure which is
> observable from all processes using the lock, for instance a reference 
> count.
Trying to find a backward compatible way to solve this may be tricky.
Here is one possible approach I have in mind: Introduce a new syscall,
e.g. sys_cleanup_robust_list(void *addr)

This system call would be invoked on pthread_mutex_destroy(3) of
robust mutexes, and do the following:

- Calculate the offset of @addr within its mapping,
- Iterate on all processes which map the backing store which contain
   the lock address @addr.
   - Iterate on each thread sibling within each of those processes,
     - If the thread has a robust list, and its list_op_pending points
       to the same offset within the backing store mapping, clear the
       list_op_pending pointer.

The overhead would be added specifically to pthread_mutex_destroy(3),
and only for robust mutexes.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 23:17     ` Mathieu Desnoyers
@ 2026-02-23 11:13       ` Florian Weimer
  2026-02-23 13:37         ` Mathieu Desnoyers
  2026-02-27 19:16       ` André Almeida
  1 sibling, 1 reply; 23+ messages in thread
From: Florian Weimer @ 2026-02-23 11:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: André Almeida, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Rich Felker,
	Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
	Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett, Lorenzo Stoakes,
	Michal Hocko, kernel-dev, linux-api, linux-kernel, libc-alpha

* Mathieu Desnoyers:

> Trying to find a backward compatible way to solve this may be tricky.
> Here is one possible approach I have in mind: Introduce a new syscall,
> e.g. sys_cleanup_robust_list(void *addr)
>
> This system call would be invoked on pthread_mutex_destroy(3) of
> robust mutexes, and do the following:
>
> - Calculate the offset of @addr within its mapping,
> - Iterate on all processes which map the backing store which contain
>   the lock address @addr.
>   - Iterate on each thread sibling within each of those processes,
>     - If the thread has a robust list, and its list_op_pending points
>       to the same offset within the backing store mapping, clear the
>       list_op_pending pointer.
>
> The overhead would be added specifically to pthread_mutex_destroy(3),
> and only for robust mutexes.

Would we have to do this for pthread_mutex_destroy only, or also for
pthread_join?  It is defined to exit a thread with mutexes still locked,
and the pthread_join call could mean that the application can determine
by its own logic that the backing store can be deallocated.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-23 11:13       ` Florian Weimer
@ 2026-02-23 13:37         ` Mathieu Desnoyers
  2026-02-23 13:47           ` Rich Felker
  0 siblings, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-02-23 13:37 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Rich Felker,
	Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
	Davidlohr Bueso, Arnd Bergmann, Liam R . Howlett, Lorenzo Stoakes,
	Michal Hocko, kernel-dev, linux-api, linux-kernel, libc-alpha

On 2026-02-23 06:13, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> Trying to find a backward compatible way to solve this may be tricky.
>> Here is one possible approach I have in mind: Introduce a new syscall,
>> e.g. sys_cleanup_robust_list(void *addr)
>>
>> This system call would be invoked on pthread_mutex_destroy(3) of
>> robust mutexes, and do the following:
>>
>> - Calculate the offset of @addr within its mapping,
>> - Iterate on all processes which map the backing store which contain
>>    the lock address @addr.
>>    - Iterate on each thread sibling within each of those processes,
>>      - If the thread has a robust list, and its list_op_pending points
>>        to the same offset within the backing store mapping, clear the
>>        list_op_pending pointer.
>>
>> The overhead would be added specifically to pthread_mutex_destroy(3),
>> and only for robust mutexes.
> 
> Would we have to do this for pthread_mutex_destroy only, or also for
> pthread_join?  It is defined to exit a thread with mutexes still locked,
> and the pthread_join call could mean that the application can determine
> by its own logic that the backing store can be deallocated.
Let me try to wrap my head around this scenario.

AFAIU, the https://man7.org/linux/man-pages/man3/pthread_join.3.html
NOTES section states the following for pthread_join(3):

        After a successful call to pthread_join(), the caller is
        guaranteed that the target thread has terminated.  The caller may
        then choose to do any clean-up that is required after termination
        of the thread (e.g., freeing memory or other resources that were
        allocated to the target thread).

What is the behavior when a thread exits with a mutex locked ? I would
expect that this mutex stays locked and the pthread_join(3) caller gets
to release that mutex and eventually calls pthread_mutex_destroy(3) if
the application logic allows it.

But it looks like you are implying that the pthread_mutex_destroy(3) is
somehow implicit to pthread_join, and I really don't understand that
part. Am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-23 13:37         ` Mathieu Desnoyers
@ 2026-02-23 13:47           ` Rich Felker
  0 siblings, 0 replies; 23+ messages in thread
From: Rich Felker @ 2026-02-23 13:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Florian Weimer, André Almeida, Carlos O'Donell,
	Sebastian Andrzej Siewior, Peter Zijlstra, Torvald Riegel,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Davidlohr Bueso,
	Arnd Bergmann, Liam R . Howlett, Lorenzo Stoakes, Michal Hocko,
	kernel-dev, linux-api, linux-kernel, libc-alpha

On Mon, Feb 23, 2026 at 08:37:13AM -0500, Mathieu Desnoyers wrote:
> On 2026-02-23 06:13, Florian Weimer wrote:
> > * Mathieu Desnoyers:
> > 
> > > Trying to find a backward compatible way to solve this may be tricky.
> > > Here is one possible approach I have in mind: Introduce a new syscall,
> > > e.g. sys_cleanup_robust_list(void *addr)
> > > 
> > > This system call would be invoked on pthread_mutex_destroy(3) of
> > > robust mutexes, and do the following:
> > > 
> > > - Calculate the offset of @addr within its mapping,
> > > - Iterate on all processes which map the backing store which contain
> > >    the lock address @addr.
> > >    - Iterate on each thread sibling within each of those processes,
> > >      - If the thread has a robust list, and its list_op_pending points
> > >        to the same offset within the backing store mapping, clear the
> > >        list_op_pending pointer.
> > > 
> > > The overhead would be added specifically to pthread_mutex_destroy(3),
> > > and only for robust mutexes.
> > 
> > Would we have to do this for pthread_mutex_destroy only, or also for
> > pthread_join?  It is defined to exit a thread with mutexes still locked,
> > and the pthread_join call could mean that the application can determine
> > by its own logic that the backing store can be deallocated.
> Let me try to wrap my head around this scenario.
> 
> AFAIU, the https://man7.org/linux/man-pages/man3/pthread_join.3.html
> NOTES section states the following for pthread_join(3):
> 
>        After a successful call to pthread_join(), the caller is
>        guaranteed that the target thread has terminated.  The caller may
>        then choose to do any clean-up that is required after termination
>        of the thread (e.g., freeing memory or other resources that were
>        allocated to the target thread).
> 
> What is the behavior when a thread exits with a mutex locked ? I would
> expect that this mutex stays locked

For a robust mutex, if the owning thread exits, the mutex enters
EOWNERDEAD state.

Otherwise, per POSIX the mutex just remains permanently locked and
undestroyable. glibc does not actually implement this for recursive or
errorchecking mutexes, as the tid might get reused and then the new
thread that got the same tid will now behave as if it were the owner
(e.g. it's allowed to take further recursive locks or observe itself
as the owner via EDEADLK). In musl we implement this by putting all
recursive and errorchecking mutexes on a robust list to reassign an
unmatchable tid to them at pthread_exit time.

> and the pthread_join(3) caller gets
> to release that mutex and eventually calls pthread_mutex_destroy(3) if
> the application logic allows it.

No other thread can release the mutex that was left locked unless it
was robust and it goes via the EOWNERDEAD/recovery process. Nor can
you legally call pthread_mutex_destroy on a mutex that's still owned.

Rich

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 20:51 ` [RFC PATCH 0/2] futex: how to solve the robust_list race condition? Liam R. Howlett
@ 2026-02-27 19:15   ` André Almeida
  0 siblings, 0 replies; 23+ messages in thread
From: André Almeida @ 2026-02-27 19:15 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Carlos O'Donell, Sebastian Andrzej Siewior, Peter Zijlstra,
	Florian Weimer, Rich Felker, Torvald Riegel, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Davidlohr Bueso, Arnd Bergmann,
	Mathieu Desnoyers, kernel-dev, linux-api, linux-kernel,
	Suren Baghdasaryan, Lorenzo Stoakes, Michal Hocko

Hi Liam,

Em 20/02/2026 17:51, Liam R. Howlett escreveu:
> +Cc Suren, Lorenzo, and Michal
> 
> * André Almeida <andrealmeid@igalia.com> [260220 15:27]:
>> During LPC 2025, I presented a session about creating a new syscall for
>> robust_list[0][1]. However, most of the session discussion wasn't much related
>> to the new syscall itself, but much more related to an old bug that exists in
>> the current robust_list mechanism.
> 
> Ah, sorry for hijacking the session, that was not my intention, but this
> needs to be addressed before we propagate the issue into the next
> iteration.
> 

No problem! I believe that this reflects the fact that the race 
condition is the main concern about this new interface, and that we 
should focus our discussion around this.

>>
>> Since at least 2012, there's an open bug reporting a race condition, as
>> Carlos O'Donell pointed out:
>>
>>    "File corruption race condition in robust mutex unlocking"
>>    https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>

[...]

> 
> There was a delay added to the oom reaper for these tasks [1] by commit
> e4a38402c36e ("oom_kill.c: futex: delay the OOM reaper to allow time for
> proper futex cleanup")
> 
> We did discuss marking the vmas as needing to be skipped by the oom
> manager, but no clear path forward was clear.  It's also not clear if
> that's the only area where such a problem exists.
> 
> [1].  https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
> 

So how would you detect which vmas should be skipped? And this won't fix 
the issue when the memory is unmapped right, just for the OOM case?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-20 23:17     ` Mathieu Desnoyers
  2026-02-23 11:13       ` Florian Weimer
@ 2026-02-27 19:16       ` André Almeida
  2026-02-27 19:59         ` Mathieu Desnoyers
  1 sibling, 1 reply; 23+ messages in thread
From: André Almeida @ 2026-02-27 19:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kernel-dev, Liam R . Howlett, linux-api, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Florian Weimer,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel, libc-alpha,
	Arnd Bergmann, Sebastian Andrzej Siewior

Hi Mathieu,

Em 20/02/2026 20:17, Mathieu Desnoyers escreveu:
> On 2026-02-20 17:41, Mathieu Desnoyers wrote:
>> On 2026-02-20 16:42, Mathieu Desnoyers wrote:
>>> +CC libc-alpha.
>>>
>>> On 2026-02-20 15:26, André Almeida wrote:
>>>> During LPC 2025, I presented a session about creating a new syscall for
>>>> robust_list[0][1]. However, most of the session discussion wasn't 
>>>> much related
>>>> to the new syscall itself, but much more related to an old bug that 
>>>> exists in
>>>> the current robust_list mechanism.
>>>>
>>>> Since at least 2012, there's an open bug reporting a race condition, as
>>>> Carlos O'Donell pointed out:
>>>>
>>>>    "File corruption race condition in robust mutex unlocking"
>>>>    https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>>>
>>>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>>>> companion kernel hack (patch 2/2) that helps to make the race condition
>>>> more likely. When the bug happens, the reproducer shows a message
>>>> comparing the original memory with the corrupted one:
>>>>
>>>>    "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 
>>>> 8001fe8dc0000000"
>>>>
>>>> I'm not sure yet what would be the appropriated approach to fix it, 
>>>> so I
>>>> decided to reach the community before moving forward in some direction.
>>>> One suggestion from Peter[2] resolves around serializing the mmap() 
>>>> and the
>>>> robust list exit path, which might cause overheads for the common case,
>>>> where list_op_pending is empty.
>>>>
>>>> However, giving that there's a new interface being prepared, this could
>>>> also give the opportunity to rethink how list_op_pending works, and get
>>>> rid of the race condition by design.
>>>>
>>>> Feedback is very much welcome.
>>>
>>> Looking at this bug, one thing I'm starting to consider is that it
>>> appears to be an issue inherent to lack of synchronization between
>>> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
>>> and not so much a kernel issue.
>>>
>>> Here is why I think the issue is purely userspace:
>>>
>>> Let's suppose we have a shared memory area across Processes 1 and 
>>> Process 2,
>>> which internally have its own custom memory allocator in userspace to
>>> allocate/free space within that shared memory.
>>>
>>> Process 1, Thread A stumbles through the scenario highlighted by this 
>>> bug, and
>>> basically gets preempted at this FIXME in libc 
>>> __pthread_mutex_unlock_full():
>>>
>>>        if (__glibc_unlikely ((atomic_exchange_release (&mutex- 
>>>  >__data.__lock, 0)
>>>                               & FUTEX_WAITERS) != 0))
>>>          futex_wake ((unsigned int *) &mutex->__data.__lock, 1, 
>>> private);
>>>
>>>        /* We must clear op_pending after we release the mutex.
>>>           FIXME However, this violates the mutex destruction 
>>> requirements
>>>           because another thread could acquire the mutex, destroy it, 
>>> and
>>>           reuse the memory for something else; then, if this thread 
>>> crashes,
>>>           and the memory happens to have a value equal to the TID, 
>>> the kernel
>>>           will believe it is still related to the mutex (which has been
>>>           destroyed already) and will modify some other random 
>>> object.  */
>>>        __asm ("" ::: "memory");
>>>        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>>>
>>> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
>>> program state it knows it can pthread_mutex_destroy() this lock, free 
>>> its
>>> associated memory through the custom shared memory allocator, and 
>>> allocate
>>> it for other purposes. Then we get to the point where Process 1 is
>>> killed, and where the robust futex kernel code corrupts data in shared
>>> memory because of the dangling list_op_pending pointer.
>>>
>>> That shared memory data is still observable by Process B, which will 
>>> get a
>>> corrupted state.
>>>
>>> Notice how this all happens without any munmap(2)/mmap(2) in the 
>>> sequence ?
>>> This is why I think this is purely a userspace issue rather than an 
>>> issue
>>> we can solve by adding extra synchronization in the kernel.
>>>
>>> The one point we have in that sequence where I think we can add 
>>> synchronization
>>> is pthread_mutex_destroy(3) in libc. One possible "big hammer" 
>>> solution would be
>>> to make pthread_mutex_destroy iterate on all other threads 
>>> list_op_pending
>>> and busy-wait if it finds that the mutex address is in use. It would 
>>> of course
>>> only have to do that for robust futexes.
>>>
>>> If that big hammer solution is not fast enough for many-threaded use- 
>>> cases,
>>> then we can think of other approaches such as adding a reference counter
>>> in the mutex structure, or introducing hazard pointers in userspace 
>>> to reduce
>>> synchronization iteration from nr_threads to nr_cpus (or even down to 
>>> max
>>> rseq mm_cid).
>>
>> To make matters even worse, the pthread_mutex_destroy(3) and reallocation
>> could happen from Process 2 rather than Process 1. So iterating on a
>> threads from Process 1 is not sufficient. We'd need to synchronize
>> pthread_mutex_destroy on something within the mutex structure which is
>> observable from all processes using the lock, for instance a reference 
>> count.
> Trying to find a backward compatible way to solve this may be tricky.
> Here is one possible approach I have in mind: Introduce a new syscall,
> e.g. sys_cleanup_robust_list(void *addr)
> 
> This system call would be invoked on pthread_mutex_destroy(3) of
> robust mutexes, and do the following:
> 
> - Calculate the offset of @addr within its mapping,
> - Iterate on all processes which map the backing store which contain
>    the lock address @addr.
>    - Iterate on each thread sibling within each of those processes,
>      - If the thread has a robust list, and its list_op_pending points
>        to the same offset within the backing store mapping, clear the
>        list_op_pending pointer.
> 
> The overhead would be added specifically to pthread_mutex_destroy(3),
> and only for robust mutexes.
> 
> Thoughts ?
> 

Right, your explanation makes sense to me. I think the only difference 
between alloc/free and map/munmap is that ""freeing" memory does not 
actually return it to the operating system for other applications to 
use"[1], so I don't know if this custom allocator is violating some 
memory rules.

About the system call, we would call sys_cleanup_robust_list() before 
freeing/unmapping the robust mutex. To guarantee that we check every 
process that shares the memory region, would we need to check *every* 
single process? I don't think there's a way find a way to find such maps 
without checking them all.

I'm trying to explore the idea about the reference counter. Would the 
mummap() be blocked till the refcount goes to zero or something like 
that? I've also tried to find more examples of a memory region that's 
shared between one or more process and the kernel at the same time to 
get some inspiration, but it seems robust_list is a quite unique design 
on its own regarding this memory sharing problem.

[1] https://sourceware.org/glibc/wiki/MallocInternals


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-27 19:16       ` André Almeida
@ 2026-02-27 19:59         ` Mathieu Desnoyers
  2026-02-27 20:41           ` Suren Baghdasaryan
  2026-03-01 15:49           ` Mathieu Desnoyers
  0 siblings, 2 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-02-27 19:59 UTC (permalink / raw)
  To: André Almeida
  Cc: kernel-dev, Liam R . Howlett, linux-api, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Florian Weimer,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel, libc-alpha,
	Arnd Bergmann, Sebastian Andrzej Siewior

On 2026-02-27 14:16, André Almeida wrote:
[...]
>> Trying to find a backward compatible way to solve this may be tricky.
>> Here is one possible approach I have in mind: Introduce a new syscall,
>> e.g. sys_cleanup_robust_list(void *addr)
>>
>> This system call would be invoked on pthread_mutex_destroy(3) of
>> robust mutexes, and do the following:
>>
>> - Calculate the offset of @addr within its mapping,
>> - Iterate on all processes which map the backing store which contain
>>    the lock address @addr.
>>    - Iterate on each thread sibling within each of those processes,
>>      - If the thread has a robust list, and its list_op_pending points
>>        to the same offset within the backing store mapping, clear the
>>        list_op_pending pointer.
>>
>> The overhead would be added specifically to pthread_mutex_destroy(3),
>> and only for robust mutexes.
>>
>> Thoughts ?
>>
[...]
> 
> About the system call, we would call sys_cleanup_robust_list() before 
> freeing/unmapping the robust mutex. To guarantee that we check every 
> process that shares the memory region, would we need to check *every* 
> single process? I don't think there's a way find a way to find such maps 
> without checking them all.

We should be able to do it with just an iteration on the struct address_space
reverse mapping (list of vma which map the shared mapping).

AFAIU we'd want to get the struct address_space associated with the
__user pointer, then, while holding i_mmap_lock_read(mapping), iterate
on its reverse mapping (i_mmap field) with vma_interval_tree_foreach. We
can get each mm_struct through vma->vm_mm.

We'd want to do most of this in a kthread and use other mm_struct through
use_mm().

For each mm_struct, we go through the owner field to get the thread
group leader, and iterate on all thread siblings (for_each_thread).

For each of those threads, we'd want to clear the list_op_pending
if it matches the offset of @addr within the mapping. I suspect we'd
want to clear that userspace pointer with a futex_atomic_cmpxchg_inatomic
which only clears the pointer if the old value match the one we expect.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-27 19:59         ` Mathieu Desnoyers
@ 2026-02-27 20:41           ` Suren Baghdasaryan
  2026-03-01 15:49           ` Mathieu Desnoyers
  1 sibling, 0 replies; 23+ messages in thread
From: Suren Baghdasaryan @ 2026-02-27 20:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Florian Weimer, Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes,
	Rich Felker, Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha, Arnd Bergmann, Sebastian Andrzej Siewior, npache

On Fri, Feb 27, 2026 at 8:00 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2026-02-27 14:16, André Almeida wrote:
> [...]
> >> Trying to find a backward compatible way to solve this may be tricky.
> >> Here is one possible approach I have in mind: Introduce a new syscall,
> >> e.g. sys_cleanup_robust_list(void *addr)
> >>
> >> This system call would be invoked on pthread_mutex_destroy(3) of
> >> robust mutexes, and do the following:
> >>
> >> - Calculate the offset of @addr within its mapping,
> >> - Iterate on all processes which map the backing store which contain
> >>    the lock address @addr.
> >>    - Iterate on each thread sibling within each of those processes,
> >>      - If the thread has a robust list, and its list_op_pending points
> >>        to the same offset within the backing store mapping, clear the
> >>        list_op_pending pointer.
> >>
> >> The overhead would be added specifically to pthread_mutex_destroy(3),
> >> and only for robust mutexes.
> >>
> >> Thoughts ?
> >>
> [...]
> >
> > About the system call, we would call sys_cleanup_robust_list() before
> > freeing/unmapping the robust mutex. To guarantee that we check every
> > process that shares the memory region, would we need to check *every*
> > single process? I don't think there's a way find a way to find such maps
> > without checking them all.
>
> We should be able to do it with just an iteration on the struct address_space
> reverse mapping (list of vma which map the shared mapping).
>
> AFAIU we'd want to get the struct address_space associated with the
> __user pointer, then, while holding i_mmap_lock_read(mapping), iterate
> on its reverse mapping (i_mmap field) with vma_interval_tree_foreach. We
> can get each mm_struct through vma->vm_mm.
>
> We'd want to do most of this in a kthread and use other mm_struct through
> use_mm().
>
> For each mm_struct, we go through the owner field to get the thread
> group leader, and iterate on all thread siblings (for_each_thread).
>
> For each of those threads, we'd want to clear the list_op_pending
> if it matches the offset of @addr within the mapping. I suspect we'd
> want to clear that userspace pointer with a futex_atomic_cmpxchg_inatomic
> which only clears the pointer if the old value match the one we expect.

I've been looking into this problem this week and IIUC Nico Pache
pursued this direction at some point (see [1]). I'm CC'ing him to
share his experience.
FYI, the link also contains an interesting discussion between Thomas
and Michal about difficulty of identifying all the VMAs possibly
involved in the lock chain and some technical challenges.

[1] https://lore.kernel.org/all/bd61369c-ef50-2eb4-2cca-91422fbfa328@redhat.com/

Thanks,
Suren.

>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-02-27 19:59         ` Mathieu Desnoyers
  2026-02-27 20:41           ` Suren Baghdasaryan
@ 2026-03-01 15:49           ` Mathieu Desnoyers
  2026-03-02  7:31             ` Florian Weimer
  1 sibling, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-03-01 15:49 UTC (permalink / raw)
  To: André Almeida
  Cc: kernel-dev, Liam R . Howlett, linux-api, Darren Hart,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Florian Weimer,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel, libc-alpha,
	Arnd Bergmann, Sebastian Andrzej Siewior

Hi André,

So it looks like I got a simpler idea on how to solve this at some
point between going to bed and waking up.

Let's extend the rseq system call. Here is how:

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 863c4a00a66b..0592be0c3b32 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -86,6 +86,59 @@ struct rseq_slice_ctrl {
  	};
  };
  
+/**
+ * rseq_rl_cs - Robust list unlock transaction descriptor
+ *
+ * rseq_rl_cs describes a transaction which begins with a successful
+ * robust mutex unlock followed by clearing a robust list pending ops.
+ *
+ * Userspace prepares for a robust_list unlock transaction by storing
+ * the address of a struct rseq_rl_cs descriptor into its per-thread
+ * rseq area rseq_rl_cs field. After the transaction is over, userspace
+ * clears the rseq_rl_cs pointer.
+ *
+ * A thread is considered to be within a rseq_rl_cs transaction if
+ * either of those conditions are true:
+ *
+ * - ip >= post_cond_store_ip && ip < post_success_ip && ll_sc_success(pt_regs)
+ * - ip >= post_success_ip && ip < post_clear_op_pending_ip
+ *
+ * If the kernel terminates a process within an active robust list
+ * unlock transaction, it should consider the robust list op pending
+ * as empty even if it contains an op pending address.
+ */
+struct rseq_rl_cs {
+	/* Version of this structure. */
+	__u32 version;
+	/* Reserved flags. */
+	__u32 flags;
+	/*
+	 * Address immediately after store which unlocks the robust
+	 * mutex. This store is usually implemented with an atomic
+	 * exchange, or linked-load/store-conditional. In case it is
+	 * implemented with ll/sc, the kernel needs to check whether the
+	 * conditional store has succeeded with the appropriate registers
+	 * or flags, as defined by the architecture ABI.
+	 */
+	__u64 post_cond_store_ip;
+	/*
+	 * For architectures implementing atomic exchange as ll/sc,
+	 * a conditional branch is needed to handle failure.
+	 * The unlock success IP is the address immediately after
+	 * the conditional branch instruction after which the kernel
+	 * can assume that the ll/sc has succeeded without checking
+	 * registers or flags. For architectures where the the mutex
+	 * unlock store instruction cannot fail, this address is equal
+	 * to post_cond_store_ip.
+	 */
+	__u64 post_success_ip;
+	/*
+	 * Address after the instruction which clears the op pending
+	 * list. This store is the last instruction of this sequence.
+	 */
+	__u64 post_clear_op_pending_ip;
+} __attribute__((aligned(4 * sizeof(__u64))));
+
  /*
   * struct rseq is aligned on 4 * 8 bytes to ensure it is always
   * contained within a single cache-line.
@@ -180,6 +233,28 @@ struct rseq {
  	 */
  	struct rseq_slice_ctrl slice_ctrl;
  
+	/*
+	 * Restartable sequences rseq_rl_cs field.
+	 *
+	 * Contains NULL when no robust list unlock transaction is
+	 * active for the current thread, or holds a pointer to the
+	 * currently active struct rseq_rl_cs.
+	 *
+	 * Updated by user-space, which sets the address of the currently
+	 * active rseq_rl_cs at some point before the beginning of the
+	 * transaction, and set to NULL by user-space at some point
+	 * after the transaction has completed.
+	 *
+	 * Read by the kernel. Set by user-space with single-copy
+	 * atomicity semantics. This field should only be updated by the
+	 * thread which registered this data structure. Aligned on
+	 * 64-bit.
+	 *
+	 * 32-bit architectures should update the low order bits of the
+	 * rseq_cs field, leaving the high order bits initialized to 0.
+	 */
+	__u64 rseq_rl_cs;
+
  	/*
  	 * Flexible array member at end of structure, after last feature field.
  	 */

Of course, we'd have to implement the whole transaction in assembler for each
architecture.

Feedback is welcome!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-03-01 15:49           ` Mathieu Desnoyers
@ 2026-03-02  7:31             ` Florian Weimer
  2026-03-02 14:57               ` Mathieu Desnoyers
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer @ 2026-03-02  7:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel, libc-alpha

* Mathieu Desnoyers:

> Of course, we'd have to implement the whole transaction in assembler
> for each architecture.

Could this be hidden ina vDSO call?  It would have to receive a pointer
to the rseq area in addition to other arguments that identify the unlock
operation to be performed.  The advantage is that the kernel would now
the addresses involved, so a single rseq flag should be sufficient.  It
could also vary the LL/SC sequence based on architecture capabilities.

The question is whether we can model the unlock operation so that it's
sufficiently generic.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-03-02  7:31             ` Florian Weimer
@ 2026-03-02 14:57               ` Mathieu Desnoyers
  2026-03-02 15:32                 ` Florian Weimer
  0 siblings, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-03-02 14:57 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha@sourceware.org, Arnd Bergmann,
	Sebastian Andrzej Siewior

On 2026-03-02 02:31, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> Of course, we'd have to implement the whole transaction in assembler
>> for each architecture.
> 
> Could this be hidden ina vDSO call?

Yes, good idea! I think this approach could work as well and reduce coupling
between kernel and userspace compared to the rseq_rl_cs approach. It's OK
as long as an extra function call on robust mutex unlock is not an issue
performance wise.

> It would have to receive a pointer
> to the rseq area in addition to other arguments that identify the unlock
> operation to be performed.  The advantage is that the kernel would now> the addresses involved, so a single rseq flag should be sufficient.

But if we implement the robust list unlock operation in a vDSO, if we
don't consider signal handlers nesting, then we would not even need a
rseq flag, right ?

Having this in a vDSO makes it so that the kernel knows when it's
terminating a process while it runs specific ranges of instruction
pointers within the vDSO. It even knows about the relevant registers
(e.g. ll/sc success) within specific instruction pointer ranges.

The remaining question is how to handle signal handlers which can
nest over vDSO. When this happens, we can end up terminating a process
while it is running within a signal handler which has been delivered on
top of the vDSO, so the topmost frame's instruction pointer points to
the signal handler code rather than the vDSO.

One possible approach to take care of this would be to add a robust list
pending ops clear on signal delivery. When a signal is delivered
on top of the robust list unlock vDSO range, *and* the mutex is known
to have been successfully unlocked, but the pending ops was not cleared
yet, the signal delivery could clear the pending ops before delivering
the signal.

> It
> could also vary the LL/SC sequence based on architecture capabilities.

Yes. I would be good for selecting dynamically between aarch64 LL/SC vs
LSE atomics.

> 
> The question is whether we can model the unlock operation so that it's
> sufficiently generic.

I suspect the IP ranges and associated store-conditional flags I identified
for the rseq_rl_cs approach are pretty much the key states we need to track.
Architectures which support atomic exchange instructions are even simpler.
We'd just have to keep track of this unlock operations steps internally
between the kernel and the vDSO.

But you mentioned that rseq would be needed for a flag, so what I am
missing ?

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-03-02 14:57               ` Mathieu Desnoyers
@ 2026-03-02 15:32                 ` Florian Weimer
  2026-03-02 16:32                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer @ 2026-03-02 15:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha@sourceware.org, Arnd Bergmann,
	Sebastian Andrzej Siewior

* Mathieu Desnoyers:

> On 2026-03-02 02:31, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>> 
>>> Of course, we'd have to implement the whole transaction in assembler
>>> for each architecture.
>> Could this be hidden ina vDSO call?
>
> Yes, good idea! I think this approach could work as well and reduce coupling
> between kernel and userspace compared to the rseq_rl_cs approach. It's OK
> as long as an extra function call on robust mutex unlock is not an issue
> performance wise.

I don't have a performance concern there.  It would be specific to
robust mutexes.

>> The question is whether we can model the unlock operation so that
>> it's sufficiently generic.
>
> I suspect the IP ranges and associated store-conditional flags I identified
> for the rseq_rl_cs approach are pretty much the key states we need to track.
> Architectures which support atomic exchange instructions are even simpler.
> We'd just have to keep track of this unlock operations steps internally
> between the kernel and the vDSO.

If the unlock operation is in the vDSO, we need to parameterize it
somehow, regarding offsets, values written etc., so that it's not
specific to exactly one robust mutex implementation.

> But you mentioned that rseq would be needed for a flag, so what I am
> missing ?

It's so that you don't have to figure out that the program counter is
somewhere in the special robust mutex unlock code every time a task gets
descheduled.

Thanks,
Foorian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-03-02 15:32                 ` Florian Weimer
@ 2026-03-02 16:32                   ` Mathieu Desnoyers
  2026-03-02 16:42                     ` Florian Weimer
  0 siblings, 1 reply; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-03-02 16:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha@sourceware.org, Arnd Bergmann,
	Sebastian Andrzej Siewior

On 2026-03-02 10:32, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> On 2026-03-02 02:31, Florian Weimer wrote:
>>> * Mathieu Desnoyers:
>>>
>>>> Of course, we'd have to implement the whole transaction in assembler
>>>> for each architecture.
>>> Could this be hidden ina vDSO call?
>>
[...]
>> I suspect the IP ranges and associated store-conditional flags I identified
>> for the rseq_rl_cs approach are pretty much the key states we need to track.
>> Architectures which support atomic exchange instructions are even simpler.
>> We'd just have to keep track of this unlock operations steps internally
>> between the kernel and the vDSO.
> 
> If the unlock operation is in the vDSO, we need to parameterize it
> somehow, regarding offsets, values written etc., so that it's not
> specific to exactly one robust mutex implementation.

Agreed.

> 
>> But you mentioned that rseq would be needed for a flag, so what I am
>> missing ?
> 
> It's so that you don't have to figure out that the program counter is
> somewhere in the special robust mutex unlock code every time a task gets
> descheduled.

AFAIU we don't need to evaluate this on context switch. We only need
to evaluate it at:

(a) Signal delivery,
(b) Process exit.

Also, the tradeoff here is not clear cut to me: the only thing the rseq
flag would prevent is comparisons of the instruction pointer against a
vDSO range at (a) and (b), which are not as performance critical as
context switches. I'm not sure it would warrant the added complexity of
the rseq flag, and coupling with rseq. Moreover, I'm not convinced that
loading an extra rseq flag field from userspace would be faster than
just comparing with a known range of vDSO addresses.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-03-02 16:32                   ` Mathieu Desnoyers
@ 2026-03-02 16:42                     ` Florian Weimer
  2026-03-02 16:56                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer @ 2026-03-02 16:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha@sourceware.org, Arnd Bergmann,
	Sebastian Andrzej Siewior

* Mathieu Desnoyers:

> On 2026-03-02 10:32, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>> 
>>> On 2026-03-02 02:31, Florian Weimer wrote:
>>>> * Mathieu Desnoyers:
>>>>
>>>>> Of course, we'd have to implement the whole transaction in assembler
>>>>> for each architecture.
>>>> Could this be hidden ina vDSO call?
>>>
> [...]
>>> I suspect the IP ranges and associated store-conditional flags I identified
>>> for the rseq_rl_cs approach are pretty much the key states we need to track.
>>> Architectures which support atomic exchange instructions are even simpler.
>>> We'd just have to keep track of this unlock operations steps internally
>>> between the kernel and the vDSO.
>> If the unlock operation is in the vDSO, we need to parameterize it
>> somehow, regarding offsets, values written etc., so that it's not
>> specific to exactly one robust mutex implementation.
>
> Agreed.
>
>> 
>>> But you mentioned that rseq would be needed for a flag, so what I am
>>> missing ?
>> It's so that you don't have to figure out that the program counter
>> is
>> somewhere in the special robust mutex unlock code every time a task gets
>> descheduled.
>
> AFAIU we don't need to evaluate this on context switch. We only need
> to evaluate it at:
>
> (a) Signal delivery,
> (b) Process exit.

Ah, missed that part.  It changes the rules somewhat.

> Also, the tradeoff here is not clear cut to me: the only thing the rseq
> flag would prevent is comparisons of the instruction pointer against a
> vDSO range at (a) and (b), which are not as performance critical as
> context switches. I'm not sure it would warrant the added complexity of
> the rseq flag, and coupling with rseq. Moreover, I'm not convinced that
> loading an extra rseq flag field from userspace would be faster than
> just comparing with a known range of vDSO addresses.

It wouldn't work for the signal case anyway.  That would need space in
rseq for some kind of write-ahead log of the operation before it's being
carried out, so that it can be completed on signal delivery/process
exit.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
  2026-03-02 16:42                     ` Florian Weimer
@ 2026-03-02 16:56                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 23+ messages in thread
From: Mathieu Desnoyers @ 2026-03-02 16:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: André Almeida, kernel-dev, Liam R . Howlett, linux-api,
	Darren Hart, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Torvald Riegel, Davidlohr Bueso, Lorenzo Stoakes, Rich Felker,
	Carlos O'Donell, Michal Hocko, linux-kernel,
	libc-alpha@sourceware.org, Arnd Bergmann,
	Sebastian Andrzej Siewior

On 2026-03-02 11:42, Florian Weimer wrote:
> * Mathieu Desnoyers:
[...]
>> AFAIU we don't need to evaluate this on context switch. We only need
>> to evaluate it at:
>>
>> (a) Signal delivery,
>> (b) Process exit.
> 
> Ah, missed that part.  It changes the rules somewhat.
> 
>> Also, the tradeoff here is not clear cut to me: the only thing the rseq
>> flag would prevent is comparisons of the instruction pointer against a
>> vDSO range at (a) and (b), which are not as performance critical as
>> context switches. I'm not sure it would warrant the added complexity of
>> the rseq flag, and coupling with rseq. Moreover, I'm not convinced that
>> loading an extra rseq flag field from userspace would be faster than
>> just comparing with a known range of vDSO addresses.
> 
> It wouldn't work for the signal case anyway.  That would need space in
> rseq for some kind of write-ahead log of the operation before it's being
> carried out, so that it can be completed on signal delivery/process
> exit.

The signal handler case can be dealt with by making sure we clear the
pending ops list on signal delivery. AFAIU with that in place we would
not need a write-ahead log. But even then, I don't think the rseq flag
would bring any benefit over simple vDSO instruction pointer ranges
comparisons.

Also the rseq flag set/clear cannot be done atomically with respect
to the mutex unlock (success) and pending ops clear state transitions,
so we'd need instruction pointer comparisons anyway.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
  2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
@ 2026-03-12  9:04   ` Sebastian Andrzej Siewior
  2026-03-12 13:36     ` André Almeida
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-12  9:04 UTC (permalink / raw)
  To: André Almeida
  Cc: Carlos O'Donell, Peter Zijlstra, Florian Weimer, Rich Felker,
	Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
	Davidlohr Bueso, Arnd Bergmann, Mathieu Desnoyers,
	Liam R . Howlett, kernel-dev, linux-api, linux-kernel

On 2026-02-20 17:26:19 [-0300], André Almeida wrote:
> --- /dev/null
> +++ b/robust_bug.c
…
> +	new->value = ((uint64_t) value << 32) + value;
> +
> +	/* Create a backup of the current value */
> +	original_val = new->value;

Now that I finally got it and I might have understood the issue.

You exit before unlocking the futex. You free this block and this new
memory (address) is the same as the old one. Your corruption comes from
the fact that the old content is the same as the new content.

If the thread does unlock in userland (or kernel) but the lock remains
on the robust_list while it gets killed then the kernel will attempt to
unlock the lock. But this requires that the futex value matches the
value.
So if it is unlocked (0x0) or used again then nothing happens. Unless
the new memory gets the same value assigned as the pid value by
accident. Then it gets changed…

If the unlock did not happen and is still owned by the thread, that is
killed, then the "fixup" here is the right thing to do. The memory
should not be free()ed because the lock was still owned by the thread.
The misunderstanding here might be "once the thread is gone, the lock is
free we can throw away the memory". At the very least, it was a locked
mutex and I think pthread_mutex_destroy() would complain here.

So is the issue here that the "new" value is the same as the "old" value
and the robust-death-handle part in the kernel does its job? Or did I
over simplify something?
Let me continue with the thread…

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition
  2026-03-12  9:04   ` Sebastian Andrzej Siewior
@ 2026-03-12 13:36     ` André Almeida
  0 siblings, 0 replies; 23+ messages in thread
From: André Almeida @ 2026-03-12 13:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Carlos O'Donell, Peter Zijlstra, Florian Weimer, Rich Felker,
	Torvald Riegel, Darren Hart, Thomas Gleixner, Ingo Molnar,
	Davidlohr Bueso, Arnd Bergmann, Mathieu Desnoyers,
	Liam R . Howlett, kernel-dev, linux-api, linux-kernel

Em 12/03/2026 06:04, Sebastian Andrzej Siewior escreveu:
> On 2026-02-20 17:26:19 [-0300], André Almeida wrote:
>> --- /dev/null
>> +++ b/robust_bug.c
> …
>> +	new->value = ((uint64_t) value << 32) + value;
>> +
>> +	/* Create a backup of the current value */
>> +	original_val = new->value;
> 
> Now that I finally got it and I might have understood the issue.
> 
> You exit before unlocking the futex. You free this block and this new
> memory (address) is the same as the old one. Your corruption comes from
> the fact that the old content is the same as the new content.
> 
> If the thread does unlock in userland (or kernel) but the lock remains
> on the robust_list while it gets killed then the kernel will attempt to
> unlock the lock. But this requires that the futex value matches the
> value.
> So if it is unlocked (0x0) or used again then nothing happens. Unless
> the new memory gets the same value assigned as the pid value by
> accident. Then it gets changed…
> 
> If the unlock did not happen and is still owned by the thread, that is
> killed, then the "fixup" here is the right thing to do. The memory
> should not be free()ed because the lock was still owned by the thread.
> The misunderstanding here might be "once the thread is gone, the lock is
> free we can throw away the memory". At the very least, it was a locked
> mutex and I think pthread_mutex_destroy() would complain here.
> 
> So is the issue here that the "new" value is the same as the "old" value
> and the robust-death-handle part in the kernel does its job? Or did I
> over simplify something?
> Let me continue with the thread…
> 

Yes, this is exactly what I understood as well.

User thread A releases the lock, but exits before setting op_pending = 
NULL. Thread B can free the lock after using it, and by chance needs to 
use the same value as the PID in the same memory. Then thread A do the 
robust list handle inside the kernel and the corruption happens.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-03-12 13:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
2026-03-12  9:04   ` Sebastian Andrzej Siewior
2026-03-12 13:36     ` André Almeida
2026-02-20 20:26 ` [RFC PATCH 2/2] futex: hack: Add debug delays André Almeida
2026-02-20 20:51 ` [RFC PATCH 0/2] futex: how to solve the robust_list race condition? Liam R. Howlett
2026-02-27 19:15   ` André Almeida
2026-02-20 21:42 ` Mathieu Desnoyers
2026-02-20 22:41   ` Mathieu Desnoyers
2026-02-20 23:17     ` Mathieu Desnoyers
2026-02-23 11:13       ` Florian Weimer
2026-02-23 13:37         ` Mathieu Desnoyers
2026-02-23 13:47           ` Rich Felker
2026-02-27 19:16       ` André Almeida
2026-02-27 19:59         ` Mathieu Desnoyers
2026-02-27 20:41           ` Suren Baghdasaryan
2026-03-01 15:49           ` Mathieu Desnoyers
2026-03-02  7:31             ` Florian Weimer
2026-03-02 14:57               ` Mathieu Desnoyers
2026-03-02 15:32                 ` Florian Weimer
2026-03-02 16:32                   ` Mathieu Desnoyers
2026-03-02 16:42                     ` Florian Weimer
2026-03-02 16:56                       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox