From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "André Almeida" <andrealmeid@igalia.com>,
"Carlos O'Donell" <carlos@redhat.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
"Peter Zijlstra" <peterz@infradead.org>,
"Florian Weimer" <fweimer@redhat.com>,
"Rich Felker" <dalias@aerifal.cx>,
"Torvald Riegel" <triegel@redhat.com>,
"Darren Hart" <dvhart@infradead.org>,
"Thomas Gleixner" <tglx@kernel.org>,
"Ingo Molnar" <mingo@redhat.com>,
"Davidlohr Bueso" <dave@stgolabs.net>,
"Arnd Bergmann" <arnd@arndb.de>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Michal Hocko" <mhocko@suse.com>
Cc: kernel-dev@igalia.com, linux-api@vger.kernel.org,
linux-kernel@vger.kernel.org,
libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
Date: Fri, 20 Feb 2026 18:17:36 -0500 [thread overview]
Message-ID: <a1e24288-6ffc-438d-8a2a-d152134c9555@efficios.com> (raw)
In-Reply-To: <67be0aa1-2241-43ef-aa01-a89ced26c8f6@efficios.com>
On 2026-02-20 17:41, Mathieu Desnoyers wrote:
> On 2026-02-20 16:42, Mathieu Desnoyers wrote:
>> +CC libc-alpha.
>>
>> On 2026-02-20 15:26, André Almeida wrote:
>>> During LPC 2025, I presented a session about creating a new syscall for
>>> robust_list[0][1]. However, most of the session discussion wasn't
>>> much related
>>> to the new syscall itself, but much more related to an old bug that
>>> exists in
>>> the current robust_list mechanism.
>>>
>>> Since at least 2012, there's an open bug reporting a race condition, as
>>> Carlos O'Donell pointed out:
>>>
>>> "File corruption race condition in robust mutex unlocking"
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>>
>>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>>> companion kernel hack (patch 2/2) that helps to make the race condition
>>> more likely. When the bug happens, the reproducer shows a message
>>> comparing the original memory with the corrupted one:
>>>
>>> "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs
>>> 8001fe8dc0000000"
>>>
>>> I'm not sure yet what would be the appropriated approach to fix it, so I
>>> decided to reach the community before moving forward in some direction.
>>> One suggestion from Peter[2] resolves around serializing the mmap()
>>> and the
>>> robust list exit path, which might cause overheads for the common case,
>>> where list_op_pending is empty.
>>>
>>> However, giving that there's a new interface being prepared, this could
>>> also give the opportunity to rethink how list_op_pending works, and get
>>> rid of the race condition by design.
>>>
>>> Feedback is very much welcome.
>>
>> Looking at this bug, one thing I'm starting to consider is that it
>> appears to be an issue inherent to lack of synchronization between
>> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
>> and not so much a kernel issue.
>>
>> Here is why I think the issue is purely userspace:
>>
>> Let's suppose we have a shared memory area across Processes 1 and
>> Process 2,
>> which internally have its own custom memory allocator in userspace to
>> allocate/free space within that shared memory.
>>
>> Process 1, Thread A stumbles through the scenario highlighted by this
>> bug, and
>> basically gets preempted at this FIXME in libc
>> __pthread_mutex_unlock_full():
>>
>> if (__glibc_unlikely ((atomic_exchange_release (&mutex-
>> >__data.__lock, 0)
>> & FUTEX_WAITERS) != 0))
>> futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
>>
>> /* We must clear op_pending after we release the mutex.
>> FIXME However, this violates the mutex destruction requirements
>> because another thread could acquire the mutex, destroy it, and
>> reuse the memory for something else; then, if this thread
>> crashes,
>> and the memory happens to have a value equal to the TID, the
>> kernel
>> will believe it is still related to the mutex (which has been
>> destroyed already) and will modify some other random
>> object. */
>> __asm ("" ::: "memory");
>> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>>
>> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
>> program state it knows it can pthread_mutex_destroy() this lock, free its
>> associated memory through the custom shared memory allocator, and
>> allocate
>> it for other purposes. Then we get to the point where Process 1 is
>> killed, and where the robust futex kernel code corrupts data in shared
>> memory because of the dangling list_op_pending pointer.
>>
>> That shared memory data is still observable by Process B, which will
>> get a
>> corrupted state.
>>
>> Notice how this all happens without any munmap(2)/mmap(2) in the
>> sequence ?
>> This is why I think this is purely a userspace issue rather than an issue
>> we can solve by adding extra synchronization in the kernel.
>>
>> The one point we have in that sequence where I think we can add
>> synchronization
>> is pthread_mutex_destroy(3) in libc. One possible "big hammer"
>> solution would be
>> to make pthread_mutex_destroy iterate on all other threads
>> list_op_pending
>> and busy-wait if it finds that the mutex address is in use. It would
>> of course
>> only have to do that for robust futexes.
>>
>> If that big hammer solution is not fast enough for many-threaded use-
>> cases,
>> then we can think of other approaches such as adding a reference counter
>> in the mutex structure, or introducing hazard pointers in userspace to
>> reduce
>> synchronization iteration from nr_threads to nr_cpus (or even down to max
>> rseq mm_cid).
>
> To make matters even worse, the pthread_mutex_destroy(3) and reallocation
> could happen from Process 2 rather than Process 1. So iterating on a
> threads from Process 1 is not sufficient. We'd need to synchronize
> pthread_mutex_destroy on something within the mutex structure which is
> observable from all processes using the lock, for instance a reference
> count.
Trying to find a backward compatible way to solve this may be tricky.
Here is one possible approach I have in mind: Introduce a new syscall,
e.g. sys_cleanup_robust_list(void *addr)
This system call would be invoked on pthread_mutex_destroy(3) of
robust mutexes, and do the following:
- Calculate the offset of @addr within its mapping,
- Iterate on all processes which map the backing store which contain
the lock address @addr.
- Iterate on each thread sibling within each of those processes,
- If the thread has a robust list, and its list_op_pending points
to the same offset within the backing store mapping, clear the
list_op_pending pointer.
The overhead would be added specifically to pthread_mutex_destroy(3),
and only for robust mutexes.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
next prev parent reply other threads:[~2026-02-20 23:17 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
2026-03-12 9:04 ` Sebastian Andrzej Siewior
2026-03-12 13:36 ` André Almeida
2026-02-20 20:26 ` [RFC PATCH 2/2] futex: hack: Add debug delays André Almeida
2026-02-20 20:51 ` [RFC PATCH 0/2] futex: how to solve the robust_list race condition? Liam R. Howlett
2026-02-27 19:15 ` André Almeida
2026-02-20 21:42 ` Mathieu Desnoyers
2026-02-20 22:41 ` Mathieu Desnoyers
2026-02-20 23:17 ` Mathieu Desnoyers [this message]
2026-02-23 11:13 ` Florian Weimer
2026-02-23 13:37 ` Mathieu Desnoyers
2026-02-23 13:47 ` Rich Felker
2026-02-27 19:16 ` André Almeida
2026-02-27 19:59 ` Mathieu Desnoyers
2026-02-27 20:41 ` Suren Baghdasaryan
2026-03-01 15:49 ` Mathieu Desnoyers
2026-03-02 7:31 ` Florian Weimer
2026-03-02 14:57 ` Mathieu Desnoyers
2026-03-02 15:32 ` Florian Weimer
2026-03-02 16:32 ` Mathieu Desnoyers
2026-03-02 16:42 ` Florian Weimer
2026-03-02 16:56 ` Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a1e24288-6ffc-438d-8a2a-d152134c9555@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=Liam.Howlett@oracle.com \
--cc=andrealmeid@igalia.com \
--cc=arnd@arndb.de \
--cc=bigeasy@linutronix.de \
--cc=carlos@redhat.com \
--cc=dalias@aerifal.cx \
--cc=dave@stgolabs.net \
--cc=dvhart@infradead.org \
--cc=fweimer@redhat.com \
--cc=kernel-dev@igalia.com \
--cc=libc-alpha@sourceware.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@kernel.org \
--cc=triegel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox