From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
"André Almeida" <andrealmeid@igalia.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
"Carlos O'Donell" <carlos@redhat.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Florian Weimer" <fweimer@redhat.com>,
"Rich Felker" <dalias@aerifal.cx>,
"Torvald Riegel" <triegel@redhat.com>,
"Darren Hart" <dvhart@infradead.org>,
"Ingo Molnar" <mingo@kernel.org>,
"Davidlohr Bueso" <dave@stgolabs.net>,
"Arnd Bergmann" <arnd@arndb.de>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>
Subject: [patch 0/8] futex: Address the robust futex unlock race for real
Date: Mon, 16 Mar 2026 18:12:50 +0100 [thread overview]
Message-ID: <20260316162316.356674433@kernel.org> (raw)
The robust futex unlock mechanism is racy in respect to the clearing of the
robust_list_head::list_op_pending pointer as described here:
https://sourceware.org/bugzilla/show_bug.cgi?id=14485
This issue has been ignored for almost 15 years and has been brought up
again recently by Andre:
https://lore.kernel.org/lkml/20260220202620.139584-1-andrealmeid@igalia.com/
Florian asked for this to be addressed in a VDSO function and a couple of
days ago Mathieu proposed a fix:
https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com
While the idea of using a VDSO unlock helper function along with kernel
side fixups is correct, the proposed solution is incomplete and
untested. Aside of that it just glues it somehow into the existing code
without taking the larger picture into account.
So I sat down and did a complete analysis. The problem is in the unlock
sequences of non-PI and PI futexes.
non-PI futexes do:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = 0;
3) atomic_xchg(lock, lval);
4) if (lval & FUTEX_WAITERS)
5) sys_futex(WAKE,....);
6) robust_list_clear_op_pending();
PI futexes do:
1) robust_list_set_op_pending(mutex);
2) robust_list_remove(mutex);
lval = gettid();
3) if (!atomic_try_cmpxchg(lock, lval, 0))
4) sys_futex(UNLOCK_PI,....);
5) robust_list_clear_op_pending();
In both cases the late clearing of the robust list pending op pointer
(#6/#5) can cause the following problem:
Task 1 Task 2
robust_list_set_op_pending(mutex);
unlock(mutex);
--> interrupt which
forces the T1 to exit.
lock(mutex); // Non-contended
// Last user of the mapping
unlock(mutex);
destroy(mutex);
unmap();
map(some other file);
do_exit()
handle_robust_list()
handle_pending_op()
val = mutex->lock;
if (val == TID)
mutex->lock = val | FUTEX_OWNER_DIED;
This causes memory corruption in the new mapping when:
1) the new mapping covers the address of mutex->lock
2) the memory contains the TID of the exiting task
Unlikely to happen, but possible.
It does not matter whether the unlock happens in the kernel or whether
there are waiters or not as in all scenarios T2 can observe being the last
user _before_ T1 manages to clear the pending op pointer.
So the first question to ask is why the unlock sequences of non-PI and PI
futexes are different.
The unlock sequence of non-PI predates both PI and robust futexes. The
robust handling was added around it. PI must use try_cmpxchg() and cannot
unlock a contended futex in user space as that would create inconsistent
PI state in the kernel.
But technically there is no requirement to keep the non-PI unlock sequence
as it is. It can do the same as PI and let the kernel unlock the futex in
the contended case. It has to go into the kernel anyway when there are
waiters. That functionality does not exist in the kernel, but it is trivial
to add.
With that the unlock sequence of non-PI and PI becomes identical and can be
easily extended to let the kernel do the clearing of the pending op pointer
as well. So that becomes:
robust_list_set_op_pending(mutex);
lval = gettid();
if (atomic_try_cmpxchg(lock, lval, 0))
robust_list_clear_op_pending();
else
sys_futex($OP | FUTEX_ROBUST_UNLOCK, .... op_pointer);
which reduces the problem space significantly. The remaining issue is the
non-contended unlock case, which can be handled with a VDSO unlock sequence
and kernel side fixup.
But that op pointer has an issue when COMPAT is enabled. To support
mixed-mode futexes for gaming emulators the kernel allows a 64-bit
application to register both a 64-bit native robust list head and a 32-bit
compat robust list head. That means on COMPAT kernels the type of the op
pointer is undefined.
That can be solved by marking the pointer size in the pointer itself
similar to the way how the robust list pointer marks PI futexes in bit
0. If the bit is set the pointer has to be treated as 32-bit pointer. If
not it uses the native size (64-bit). This is required for several reasons:
1) sys_futex() has no compat variant
2) The gaming emulators use both both 64-bit and compat 32-bit robust
lists in the same 64-bit application
3) Having the pointer handed in spares the evaluation of the registered
robust lists to figure out whether the futex address is matching the
registered [compat_]robust_list_head::list_pending_op pointer.
Which would become even worse when the multi robust list changes
Andre is working on get merged.
As a consequence 32-bit applications have to set this bit unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the bit is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.
Arguably this could be done with different functions and more flags, but
there is no benefit for that.
With that addressed there is "only" the non-contended case left.
The important question is when this has to be dealt with. It's only
relevant when:
1) the unlocking task was interrupted in the unlock sequence between
successful unlock and the store which clears the pointer.
2) the unlocking task needs to handle a signal on return from interrupt
to user space
The proper place to check for this is therefore right before the invocation
of arch_do_signal_or_restart() in __exit_to_user_mode_loop().
That needs to check whether the task was interrupted and the instruction
pointer is within the critical section. On x86 (COMPAT=n) this critical
section boils down to:
mov %esi,%eax // Load TID into EAX
xor %ecx,%ecx // Set ECX to 0
lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
movq $0x0,(%rdx) // Clear list_op_pending
.Lend:
I.e. the critical section is: IP >= .Lstart && IP < .Lend. That's a very
light weight range check operation which won't affect signal heavy
applications in a noticable way.
If the range check observes IP in the critical section it invokes the fixup
function which determines whether the pointer has to be cleared. In the
above x86 case this just has to evaluate regs->flag & X86_EFLAGS_ZF.
If ZF is set then the pointer must be cleared. The pointer address is
retrieved from regs->dx in this case.
The COMPAT case with different sizes is slightly more complex as it has to
do the size checking in the success path by testing bit 0. That obviously
modifies ZF too, so it has to check whether IP is within the success
region.
mov %esi,%eax // Load TID into EAX
mov %rdx,%rsi // Get the op pointer
xor %ecx,%ecx // Set ECX to 0
and $0xfffffffffffffffe,%rsi // Clear the size bit
lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
.Lstart:
jnz .Lend
.Lsuccess:
testl $0x1,(%rdx) // Test the size bit
jz .Lop64 // Not set: 64-bit
movl $0x0,(%rsi) // Clear 32-bit
jmp .Lend
.Lop64:
movq $0x0,(%rsi) // Clear 64-bit
.Lend:
ret
Again it could be argued to do that with two different fixup functions, but
there is no point to that:
1) User space needs to handle the two variants instead of just
relying on a bit which can be saved in the mutex at
initialization time.
2) It becomes pointlessly different from the syscall op pointer handling
3) The fixup decision function has then to evaluate which code path is
used. That just adds more symbols and range checking for no real
value.
Note, that the fixup is done when a signal is pending even if there is no
signal delivered. That's required because fatal signals are not reaching
the actual signal delivery code where e.g. RSEQ is handled. They never
return from get_signal(). Doing it at that point is harmless and the only
downside is a redundant store if the task continues. The pending op
pointer is task local so it does not matter whether it's nullified twice or
not.
The remaining question is how to do the range checking. Mathiue's proposal
[ab]used the exception table to do the range check. That's not really the
correct approach because:
1) It is not an exception handler
2) It is a fixup similar to RSEQ critical section fixups
So the right approach is to use a dedicated mechanism and not to glue it
into something which has a completely different purpose. That just creates
complexity and overhead.
The critical section labels can be made visible in the VDSO and the VDSO
image generator provides the offsets in the image struct to the kernel. The
VDSO [re]map() evaluates them and puts them into mm_struct::futex::XXX.
Arguably the range check could chase all of that down via
mm->context->vdso->image, but that touches way more cache lines and
requires architecture specific range checking functions. Letting the
architecture specific VDSO [re]map() code store the information in
mm_struct::futex reduces the cache line impact and makes the range check
generic.
The series implements all of the above and passes the simple user space
test case appended below compiled for both 32-bit and 64-bit and tested on
a COMPAT enabled 64-bit kernel. 32-bit kernel should just work, but hasn't
been tested yet.
The fixup mechanism is easy to validate. Run the program in gdb and set a
breakpoint at __vdso_futex_robust_try_unlock_cs_start or at
__vdso_futex_robust_try_unlock_cs_success.
Hitting the breakpoint causes the fixup to run because gdb/ptrace uses
signals to control the tracee. So the op pointer must have been cleared if
the cmpxchg() succeeded:
(gdb) b __vdso_futex_robust_try_unlock_cs_start
(gdb) run
Testing non contended unlock
Breakpoint 1, 0x00007ffff7fc67a3 in __vdso_futex_robust_try_unlock ()
(gdb) info r
rdx 0x7ffff7dc1a70 140737351785072
rsi 0x7ffff7dc1a70 140737351785072
rdi 0x7fffffffea90 140737488349840
eflags 0x246 [ PF ZF IF ]
(gdb) x/xg 0x7ffff7dc1a70
0x7ffff7dc1a70: 0x0000000000000000
ZF is set, which indicates that the cmpxchg() succeeded. The fixup is
"premature" but as explained above that is harmless. The success path
writing 0 again is just a redundant store with no side effects as the
pointer is task local.
Letting it continue hits the contended case:
(gdb) info 5
rdx 0x7ffff7dc1a70 140737351785072
rsi 0x7ffff7dc1a70 140737351785072
rdi 0x7fffffffea90 140737488349840
eflags 0xa87 [ CF PF SF IF OF ]
(gdb) x/xg 0x7ffff7dc1a70
0x7ffff7dc1a70: 0x00007fffffffeab0
Note the lack of ZF in eflags, which means cmpxchg() failed. RDI points to
mutex->lock and the op pointer still to mutex->list (offset 0x20) as it
should be.
A 32-bit executable with the same breakpoint shows the bit 0 handling:
eax 0x1c0 448
ecx 0x0 0
edx 0xf7fbc5f5 -134494731
ebx 0x56558ff4 1448447988
esi 0xf7fbc5f4 -134494732
edi 0xffffdbf8 -9224
eflags 0x246 [ PF ZF IF ]
(gdb) x/xw 0xf7fbc5f4
0xf7fbc5f4: 0x00000000
EDX has bit 0 set and ESI has the real pointer address with bit 0
cleared. ESI is used to clear it. EDX is handed to the fixup function and
that takes the correct path by testing bit 0.
The series applies on v7.0-rc3 and is also available via git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git locking/futex
Thanks
tglx
---
It's so much easier to suggest solutions when you don't know too much about the problem.
- Malcolm Forbes
---
Documentation/locking/robust-futexes.rst | 8
arch/x86/Kconfig | 1
arch/x86/entry/vdso/common/vfutex.c | 72 +++++++
arch/x86/entry/vdso/vdso32/Makefile | 5
arch/x86/entry/vdso/vdso32/vdso32.lds.S | 6
arch/x86/entry/vdso/vdso32/vfutex.c | 1
arch/x86/entry/vdso/vdso64/Makefile | 7
arch/x86/entry/vdso/vdso64/vdso64.lds.S | 6
arch/x86/entry/vdso/vdso64/vdsox32.lds.S | 6
arch/x86/entry/vdso/vdso64/vfutex.c | 1
arch/x86/entry/vdso/vma.c | 20 ++
arch/x86/include/asm/futex_robust.h | 44 ++++
arch/x86/include/asm/vdso.h | 3
arch/x86/tools/vdso2c.c | 17 +
include/linux/futex.h | 43 +++-
include/linux/futex_types.h | 74 +++++++
include/linux/mm_types.h | 12 -
include/linux/sched.h | 16 -
include/uapi/linux/futex.h | 24 ++
include/vdso/futex.h | 44 ++++
init/Kconfig | 6
io_uring/futex.c | 2
kernel/entry/common.c | 9
kernel/exit.c | 4
kernel/futex/core.c | 303 ++++++++++++++++++-------------
kernel/futex/futex.h | 11 -
kernel/futex/pi.c | 41 ++--
kernel/futex/syscalls.c | 36 +--
kernel/futex/waitwake.c | 27 ++
29 files changed, 636 insertions(+), 213 deletions(-)
---
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/futex.h>
#include <linux/prctl.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/types.h>
typedef uint32_t (*frtu_t)(uint32_t *, uint32_t, void *);
static frtu_t frtu;
static void get_vdso(void)
{
void *vdso = dlopen("linux-vdso.so.1",
RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
if (!vdso)
vdso = dlopen("linux-gate.so.1",
RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
if (!vdso)
vdso = dlopen("linux-vdso32.so.1",
RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
if (!vdso)
vdso = dlopen("linux-vdso64.so.1",
RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
if (!vdso) {
printf("Failed to find vDSO\n");
exit(1);
}
frtu = (frtu_t)dlsym(vdso, "__vdso_futex_robust_try_unlock");
if (!frtu) {
printf("Failed to find frtu in vDSO\n");
exit(1);
}
}
static void *mangle_pop(struct robust_list_head *rh)
{
unsigned long addr = (unsigned long) &rh->list_op_pending;
if (__BITS_PER_LONG != 64)
addr |= 0x01;
return (void *)addr;
}
#define FUTEX_ROBUST_UNLOCK 512
#define UNLOCK (FUTEX_WAKE | FUTEX_ROBUST_UNLOCK)
#define UNLOCK_PI (FUTEX_UNLOCK_PI | FUTEX_ROBUST_UNLOCK)
int main(int argc, char * const argv[])
{
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
uint32_t *lock = (uint32_t *)&mutex.__data.__lock;
struct robust_list_head *rhead;
pid_t ltid, tid = gettid();
struct robust_list *list;
size_t sz;
void *pop;
get_vdso();
syscall(SYS_get_robust_list, 0, &rhead, &sz);
pop = mangle_pop(rhead);
list = (struct robust_list *)&mutex.__data.__list.__next;
printf("Testing non contended unlock\n");
*lock = tid;
rhead->list_op_pending = list;
ltid = frtu(lock, tid, pop);
if (ltid != tid)
printf("Non contended unlock failed: LTID %08x\n", ltid);
else if (rhead->list_op_pending)
printf("List op not cleared: %16lx\n", (unsigned long) rhead->list_op_pending);
else if (*lock)
printf("Non contended unlock failed: LOCK %08x\n", *lock);
else
printf("Non contended unlock succeeded\n");
printf("Testing contended FUTEX_WAKE unlock\n");
*lock = tid | FUTEX_WAITERS;
rhead->list_op_pending = list;
ltid = frtu(lock, tid, pop);
if (ltid == tid) {
printf("Contended unlock succeeded %08x\n", ltid);
} else if (!rhead->list_op_pending) {
printf("List op cleared: %16lx\n", (unsigned long) rhead->list_op_pending);
} else if (!*lock) {
printf("Contended unlock cleared LOCK %08x\n", *lock);
} else {
int ret = syscall(SYS_futex, lock, UNLOCK, 1, NULL, pop, 0, 0);
if (ret < 0)
printf("FUTEX_WAKE|FUTEX_ROBUST_UNLOCK failed %d\n", errno);
else if (rhead->list_op_pending)
printf("List op not cleared in syscall: %16lx\n", (unsigned long) rhead->list_op_pending);
else if (*lock)
printf("Contended syscall unlock failed: LOCK %08x\n", *lock);
else
printf("Contended unlock syscall succeeded\n");
}
printf("Testing contended FUTEX_UNLOCK_PI unlock\n");
*lock = tid | FUTEX_WAITERS;
rhead->list_op_pending = list;
ltid = frtu(lock, tid, pop);
if (ltid == tid) {
printf("Contended unlock succeeded %08x\n", ltid);
} else if (!rhead->list_op_pending) {
printf("List op cleared: %16lx\n", (unsigned long) rhead->list_op_pending);
} else if (!*lock) {
printf("Contended unlock cleared LOCK %08x\n", *lock);
} else {
int ret = syscall(SYS_futex, lock, UNLOCK_PI, 0, NULL, pop, 0, 0);
if (ret < 0)
printf("FUTEX_UNLOCK_PI|FUTEX_ROBUST_UNLOCK failed %d\n", errno);
else if (rhead->list_op_pending)
printf("List op not cleared in syscall: %16lx\n", (unsigned long) rhead->list_op_pending);
else if (*lock)
printf("Contended syscall unlock failed: LOCK %08x\n", *lock);
else
printf("Contended unlock syscall succeeded\n");
}
return 0;
}
next reply other threads:[~2026-03-16 17:12 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 17:12 Thomas Gleixner [this message]
2026-03-16 17:12 ` [patch 1/8] futex: Move futex task related data into a struct Thomas Gleixner
2026-03-16 17:55 ` Mathieu Desnoyers
2026-03-17 2:24 ` André Almeida
2026-03-17 9:52 ` Thomas Gleixner
2026-03-16 17:13 ` [patch 2/8] futex: Move futex related mm_struct " Thomas Gleixner
2026-03-16 18:00 ` Mathieu Desnoyers
2026-03-16 17:13 ` [patch 3/8] futex: Provide UABI defines for robust list entry modifiers Thomas Gleixner
2026-03-16 18:02 ` Mathieu Desnoyers
2026-03-17 2:38 ` André Almeida
2026-03-17 9:53 ` Thomas Gleixner
2026-03-16 17:13 ` [patch 4/8] futex: Add support for unlocking robust futexes Thomas Gleixner
2026-03-16 18:24 ` Mathieu Desnoyers
2026-03-17 16:17 ` André Almeida
2026-03-17 20:46 ` Peter Zijlstra
2026-03-17 22:40 ` Thomas Gleixner
2026-03-18 8:02 ` Peter Zijlstra
2026-03-18 8:06 ` Florian Weimer
2026-03-18 14:47 ` Peter Zijlstra
2026-03-18 16:03 ` Thomas Gleixner
2026-03-16 17:13 ` [patch 5/8] futex: Add robust futex unlock IP range Thomas Gleixner
2026-03-16 18:36 ` Mathieu Desnoyers
2026-03-17 19:19 ` André Almeida
2026-03-16 17:13 ` [patch 6/8] futex: Provide infrastructure to plug the non contended robust futex unlock race Thomas Gleixner
2026-03-16 18:35 ` Mathieu Desnoyers
2026-03-16 20:29 ` Thomas Gleixner
2026-03-16 20:52 ` Mathieu Desnoyers
2026-03-16 17:13 ` [patch 7/8] x86/vdso: Prepare for robust futex unlock support Thomas Gleixner
2026-03-16 17:13 ` [patch 8/8] x86/vdso: Implement __vdso_futex_robust_try_unlock() Thomas Gleixner
2026-03-16 19:19 ` Mathieu Desnoyers
2026-03-16 21:02 ` Thomas Gleixner
2026-03-16 22:35 ` Mathieu Desnoyers
2026-03-16 21:14 ` Thomas Gleixner
2026-03-16 21:29 ` Thomas Gleixner
2026-03-17 7:25 ` Thomas Weißschuh
2026-03-17 9:51 ` Thomas Gleixner
2026-03-17 11:17 ` Thomas Weißschuh
2026-03-18 16:17 ` Thomas Gleixner
2026-03-19 7:41 ` Thomas Weißschuh
2026-03-19 8:53 ` Florian Weimer
2026-03-19 9:04 ` Thomas Weißschuh
2026-03-19 9:08 ` Peter Zijlstra
2026-03-19 23:31 ` Thomas Gleixner
2026-03-19 10:36 ` Sebastian Andrzej Siewior
2026-03-19 10:49 ` Thomas Weißschuh
2026-03-19 10:55 ` Sebastian Andrzej Siewior
2026-03-17 8:28 ` Florian Weimer
2026-03-17 9:36 ` Thomas Gleixner
2026-03-17 10:37 ` Florian Weimer
2026-03-17 22:32 ` Thomas Gleixner
2026-03-18 22:08 ` Thomas Gleixner
2026-03-18 22:10 ` Peter Zijlstra
2026-03-19 2:05 ` André Almeida
2026-03-19 7:10 ` Thomas Gleixner
2026-03-17 15:33 ` Uros Bizjak
2026-03-18 8:21 ` Thomas Gleixner
2026-03-18 8:32 ` Uros Bizjak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260316162316.356674433@kernel.org \
--to=tglx@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=andrealmeid@igalia.com \
--cc=arnd@arndb.de \
--cc=bigeasy@linutronix.de \
--cc=carlos@redhat.com \
--cc=dalias@aerifal.cx \
--cc=dave@stgolabs.net \
--cc=dvhart@infradead.org \
--cc=fweimer@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=triegel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox