Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?

public inbox for linux-api@vger.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: "André Almeida" <andrealmeid@igalia.com>,
	"Carlos O'Donell" <carlos@redhat.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Florian Weimer" <fweimer@redhat.com>,
	"Rich Felker" <dalias@aerifal.cx>,
	"Torvald Riegel" <triegel@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Thomas Gleixner" <tglx@kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Michal Hocko" <mhocko@suse.com>
Cc: kernel-dev@igalia.com, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	libc-alpha <libc-alpha@sourceware.org>
Subject: Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?
Date: Fri, 20 Feb 2026 18:17:36 -0500	[thread overview]
Message-ID: <a1e24288-6ffc-438d-8a2a-d152134c9555@efficios.com> (raw)
In-Reply-To: <67be0aa1-2241-43ef-aa01-a89ced26c8f6@efficios.com>

On 2026-02-20 17:41, Mathieu Desnoyers wrote:
> On 2026-02-20 16:42, Mathieu Desnoyers wrote:
>> +CC libc-alpha.
>>
>> On 2026-02-20 15:26, André Almeida wrote:
>>> During LPC 2025, I presented a session about creating a new syscall for
>>> robust_list[0][1]. However, most of the session discussion wasn't 
>>> much related
>>> to the new syscall itself, but much more related to an old bug that 
>>> exists in
>>> the current robust_list mechanism.
>>>
>>> Since at least 2012, there's an open bug reporting a race condition, as
>>> Carlos O'Donell pointed out:
>>>
>>>    "File corruption race condition in robust mutex unlocking"
>>>    https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>>>
>>> To help understand the bug, I've created a reproducer (patch 1/2) and a
>>> companion kernel hack (patch 2/2) that helps to make the race condition
>>> more likely. When the bug happens, the reproducer shows a message
>>> comparing the original memory with the corrupted one:
>>>
>>>    "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 
>>> 8001fe8dc0000000"
>>>
>>> I'm not sure yet what would be the appropriated approach to fix it, so I
>>> decided to reach the community before moving forward in some direction.
>>> One suggestion from Peter[2] resolves around serializing the mmap() 
>>> and the
>>> robust list exit path, which might cause overheads for the common case,
>>> where list_op_pending is empty.
>>>
>>> However, giving that there's a new interface being prepared, this could
>>> also give the opportunity to rethink how list_op_pending works, and get
>>> rid of the race condition by design.
>>>
>>> Feedback is very much welcome.
>>
>> Looking at this bug, one thing I'm starting to consider is that it
>> appears to be an issue inherent to lack of synchronization between
>> pthread_mutex_destroy(3) and the per-thread list_op_pending fields
>> and not so much a kernel issue.
>>
>> Here is why I think the issue is purely userspace:
>>
>> Let's suppose we have a shared memory area across Processes 1 and 
>> Process 2,
>> which internally have its own custom memory allocator in userspace to
>> allocate/free space within that shared memory.
>>
>> Process 1, Thread A stumbles through the scenario highlighted by this 
>> bug, and
>> basically gets preempted at this FIXME in libc 
>> __pthread_mutex_unlock_full():
>>
>>        if (__glibc_unlikely ((atomic_exchange_release (&mutex- 
>>  >__data.__lock, 0)
>>                               & FUTEX_WAITERS) != 0))
>>          futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);
>>
>>        /* We must clear op_pending after we release the mutex.
>>           FIXME However, this violates the mutex destruction requirements
>>           because another thread could acquire the mutex, destroy it, and
>>           reuse the memory for something else; then, if this thread 
>> crashes,
>>           and the memory happens to have a value equal to the TID, the 
>> kernel
>>           will believe it is still related to the mutex (which has been
>>           destroyed already) and will modify some other random 
>> object.  */
>>        __asm ("" ::: "memory");
>>        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>>
>> Then Process 1, Thread B runs, grabs the lock, releases it, and based on
>> program state it knows it can pthread_mutex_destroy() this lock, free its
>> associated memory through the custom shared memory allocator, and 
>> allocate
>> it for other purposes. Then we get to the point where Process 1 is
>> killed, and where the robust futex kernel code corrupts data in shared
>> memory because of the dangling list_op_pending pointer.
>>
>> That shared memory data is still observable by Process B, which will 
>> get a
>> corrupted state.
>>
>> Notice how this all happens without any munmap(2)/mmap(2) in the 
>> sequence ?
>> This is why I think this is purely a userspace issue rather than an issue
>> we can solve by adding extra synchronization in the kernel.
>>
>> The one point we have in that sequence where I think we can add 
>> synchronization
>> is pthread_mutex_destroy(3) in libc. One possible "big hammer" 
>> solution would be
>> to make pthread_mutex_destroy iterate on all other threads 
>> list_op_pending
>> and busy-wait if it finds that the mutex address is in use. It would 
>> of course
>> only have to do that for robust futexes.
>>
>> If that big hammer solution is not fast enough for many-threaded use- 
>> cases,
>> then we can think of other approaches such as adding a reference counter
>> in the mutex structure, or introducing hazard pointers in userspace to 
>> reduce
>> synchronization iteration from nr_threads to nr_cpus (or even down to max
>> rseq mm_cid).
> 
> To make matters even worse, the pthread_mutex_destroy(3) and reallocation
> could happen from Process 2 rather than Process 1. So iterating on a
> threads from Process 1 is not sufficient. We'd need to synchronize
> pthread_mutex_destroy on something within the mutex structure which is
> observable from all processes using the lock, for instance a reference 
> count.
Trying to find a backward compatible way to solve this may be tricky.
Here is one possible approach I have in mind: Introduce a new syscall,
e.g. sys_cleanup_robust_list(void *addr)

This system call would be invoked on pthread_mutex_destroy(3) of
robust mutexes, and do the following:

- Calculate the offset of @addr within its mapping,
- Iterate on all processes which map the backing store which contain
   the lock address @addr.
   - Iterate on each thread sibling within each of those processes,
     - If the thread has a robust list, and its list_op_pending points
       to the same offset within the backing store mapping, clear the
       list_op_pending pointer.

The overhead would be added specifically to pthread_mutex_destroy(3),
and only for robust mutexes.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

next prev parent reply	other threads:[~2026-02-20 23:17 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-20 20:26 [RFC PATCH 0/2] futex: how to solve the robust_list race condition? André Almeida
2026-02-20 20:26 ` [RFC PATCH 1/2] futex: Create reproducer for robust_list race condition André Almeida
2026-03-12  9:04   ` Sebastian Andrzej Siewior
2026-03-12 13:36     ` André Almeida
2026-02-20 20:26 ` [RFC PATCH 2/2] futex: hack: Add debug delays André Almeida
2026-02-20 20:51 ` [RFC PATCH 0/2] futex: how to solve the robust_list race condition? Liam R. Howlett
2026-02-27 19:15   ` André Almeida
2026-02-20 21:42 ` Mathieu Desnoyers
2026-02-20 22:41   ` Mathieu Desnoyers
2026-02-20 23:17     ` Mathieu Desnoyers [this message]
2026-02-23 11:13       ` Florian Weimer
2026-02-23 13:37         ` Mathieu Desnoyers
2026-02-23 13:47           ` Rich Felker
2026-02-27 19:16       ` André Almeida
2026-02-27 19:59         ` Mathieu Desnoyers
2026-02-27 20:41           ` Suren Baghdasaryan
2026-03-01 15:49           ` Mathieu Desnoyers
2026-03-02  7:31             ` Florian Weimer
2026-03-02 14:57               ` Mathieu Desnoyers
2026-03-02 15:32                 ` Florian Weimer
2026-03-02 16:32                   ` Mathieu Desnoyers
2026-03-02 16:42                     ` Florian Weimer
2026-03-02 16:56                       ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a1e24288-6ffc-438d-8a2a-d152134c9555@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=andrealmeid@igalia.com \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=carlos@redhat.com \
    --cc=dalias@aerifal.cx \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=kernel-dev@igalia.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@kernel.org \
    --cc=triegel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox