public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/8] futex: Address the robust futex unlock race for real
@ 2026-03-16 17:12 Thomas Gleixner
  2026-03-16 17:12 ` [patch 1/8] futex: Move futex task related data into a struct Thomas Gleixner
                   ` (7 more replies)
  0 siblings, 8 replies; 57+ messages in thread
From: Thomas Gleixner @ 2026-03-16 17:12 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, André Almeida, Sebastian Andrzej Siewior,
	Carlos O'Donell, Peter Zijlstra, Florian Weimer, Rich Felker,
	Torvald Riegel, Darren Hart, Ingo Molnar, Davidlohr Bueso,
	Arnd Bergmann, Liam R . Howlett

The robust futex unlock mechanism is racy in respect to the clearing of the
robust_list_head::list_op_pending pointer as described here:

  https://sourceware.org/bugzilla/show_bug.cgi?id=14485

This issue has been ignored for almost 15 years and has been brought up
again recently by Andre:

   https://lore.kernel.org/lkml/20260220202620.139584-1-andrealmeid@igalia.com/

Florian asked for this to be addressed in a VDSO function and a couple of
days ago Mathieu proposed a fix:

   https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com

While the idea of using a VDSO unlock helper function along with kernel
side fixups is correct, the proposed solution is incomplete and
untested. Aside of that it just glues it somehow into the existing code
without taking the larger picture into account.

So I sat down and did a complete analysis. The problem is in the unlock
sequences of non-PI and PI futexes.

non-PI futexes do:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);
	
  	lval = 0;
  3)	atomic_xchg(lock, lval);
  4)	if (lval & FUTEX_WAITERS)
  5)		sys_futex(WAKE,....);
  6)	robust_list_clear_op_pending();

PI futexes do:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);
	
  	lval = gettid();
  3)	if (!atomic_try_cmpxchg(lock, lval, 0))
  4)		sys_futex(UNLOCK_PI,....);
  5)	robust_list_clear_op_pending();

In both cases the late clearing of the robust list pending op pointer
(#6/#5) can cause the following problem:

  Task 1			Task 2

       robust_list_set_op_pending(mutex);
       unlock(mutex);

    -->  interrupt which
    	 forces the T1 to exit.

				lock(mutex); // Non-contended

				// Last user of the mapping
				unlock(mutex);
				destroy(mutex);

				unmap();
				map(some other file);

	do_exit()
	  handle_robust_list()
	    handle_pending_op()
	      val = mutex->lock;
	      if (val == TID)
	      	 mutex->lock = val | FUTEX_OWNER_DIED;

This causes memory corruption in the new mapping when:

  1) the new mapping covers the address of mutex->lock

  2) the memory contains the TID of the exiting task

Unlikely to happen, but possible.

It does not matter whether the unlock happens in the kernel or whether
there are waiters or not as in all scenarios T2 can observe being the last
user _before_ T1 manages to clear the pending op pointer.


So the first question to ask is why the unlock sequences of non-PI and PI
futexes are different.

The unlock sequence of non-PI predates both PI and robust futexes. The
robust handling was added around it. PI must use try_cmpxchg() and cannot
unlock a contended futex in user space as that would create inconsistent
PI state in the kernel.

But technically there is no requirement to keep the non-PI unlock sequence
as it is. It can do the same as PI and let the kernel unlock the futex in
the contended case. It has to go into the kernel anyway when there are
waiters. That functionality does not exist in the kernel, but it is trivial
to add.

With that the unlock sequence of non-PI and PI becomes identical and can be
easily extended to let the kernel do the clearing of the pending op pointer
as well. So that becomes:

        robust_list_set_op_pending(mutex);

	lval = gettid();

	if (atomic_try_cmpxchg(lock, lval, 0))
		robust_list_clear_op_pending();
	else
  		sys_futex($OP | FUTEX_ROBUST_UNLOCK, .... op_pointer);

which reduces the problem space significantly. The remaining issue is the
non-contended unlock case, which can be handled with a VDSO unlock sequence
and kernel side fixup.


But that op pointer has an issue when COMPAT is enabled. To support
mixed-mode futexes for gaming emulators the kernel allows a 64-bit
application to register both a 64-bit native robust list head and a 32-bit
compat robust list head. That means on COMPAT kernels the type of the op
pointer is undefined.

That can be solved by marking the pointer size in the pointer itself
similar to the way how the robust list pointer marks PI futexes in bit
0. If the bit is set the pointer has to be treated as 32-bit pointer. If
not it uses the native size (64-bit). This is required for several reasons:

    1) sys_futex() has no compat variant

    2) The gaming emulators use both both 64-bit and compat 32-bit robust
       lists in the same 64-bit application

    3) Having the pointer handed in spares the evaluation of the registered
       robust lists to figure out whether the futex address is matching the
       registered [compat_]robust_list_head::list_pending_op pointer.

       Which would become even worse when the multi robust list changes
       Andre is working on get merged.

As a consequence 32-bit applications have to set this bit unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the bit is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.

Arguably this could be done with different functions and more flags, but
there is no benefit for that.

With that addressed there is "only" the non-contended case left.

The important question is when this has to be dealt with. It's only
relevant when:

    1) the unlocking task was interrupted in the unlock sequence between
       successful unlock and the store which clears the pointer.

    2) the unlocking task needs to handle a signal on return from interrupt
       to user space

The proper place to check for this is therefore right before the invocation
of arch_do_signal_or_restart() in __exit_to_user_mode_loop().

That needs to check whether the task was interrupted and the instruction
pointer is within the critical section. On x86 (COMPAT=n) this critical
section boils down to:

		mov		%esi,%eax	// Load TID into EAX
        	xor		%ecx,%ecx	// Set ECX to 0
   		lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
	.Lstart:
		jnz    		.Lend
    		movq		$0x0,(%rdx)	// Clear list_op_pending
	.Lend:

I.e. the critical section is: IP >= .Lstart && IP < .Lend. That's a very
light weight range check operation which won't affect signal heavy
applications in a noticable way.

If the range check observes IP in the critical section it invokes the fixup
function which determines whether the pointer has to be cleared. In the
above x86 case this just has to evaluate regs->flag & X86_EFLAGS_ZF.

If ZF is set then the pointer must be cleared. The pointer address is
retrieved from regs->dx in this case.

The COMPAT case with different sizes is slightly more complex as it has to
do the size checking in the success path by testing bit 0. That obviously
modifies ZF too, so it has to check whether IP is within the success
region.

	mov		%esi,%eax	// Load TID into EAX
	mov		%rdx,%rsi	// Get the op pointer
	xor		%ecx,%ecx	// Set ECX to 0
	and		$0xfffffffffffffffe,%rsi // Clear the size bit
	lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
  .Lstart:
	jnz    		.Lend
  .Lsuccess:
	testl		$0x1,(%rdx)	// Test the size bit
	jz     		.Lop64		// Not set: 64-bit
	movl		$0x0,(%rsi)	// Clear 32-bit
	jmp    		.Lend
  .Lop64:	
	movq		$0x0,(%rsi)	// Clear 64-bit
  .Lend:
	ret

Again it could be argued to do that with two different fixup functions, but
there is no point to that:

   1) User space needs to handle the two variants instead of just
      relying on a bit which can be saved in the mutex at
      initialization time.

   2) It becomes pointlessly different from the syscall op pointer handling

   3) The fixup decision function has then to evaluate which code path is
      used. That just adds more symbols and range checking for no real
      value.

Note, that the fixup is done when a signal is pending even if there is no
signal delivered. That's required because fatal signals are not reaching
the actual signal delivery code where e.g. RSEQ is handled. They never
return from get_signal(). Doing it at that point is harmless and the only
downside is a redundant store if the task continues. The pending op
pointer is task local so it does not matter whether it's nullified twice or
not.

The remaining question is how to do the range checking. Mathiue's proposal
[ab]used the exception table to do the range check. That's not really the
correct approach because:

   1) It is not an exception handler

   2) It is a fixup similar to RSEQ critical section fixups

So the right approach is to use a dedicated mechanism and not to glue it
into something which has a completely different purpose. That just creates
complexity and overhead.

The critical section labels can be made visible in the VDSO and the VDSO
image generator provides the offsets in the image struct to the kernel. The
VDSO [re]map() evaluates them and puts them into mm_struct::futex::XXX.

Arguably the range check could chase all of that down via
mm->context->vdso->image, but that touches way more cache lines and
requires architecture specific range checking functions. Letting the
architecture specific VDSO [re]map() code store the information in
mm_struct::futex reduces the cache line impact and makes the range check
generic.

The series implements all of the above and passes the simple user space
test case appended below compiled for both 32-bit and 64-bit and tested on
a COMPAT enabled 64-bit kernel. 32-bit kernel should just work, but hasn't
been tested yet.

The fixup mechanism is easy to validate. Run the program in gdb and set a
breakpoint at __vdso_futex_robust_try_unlock_cs_start or at
__vdso_futex_robust_try_unlock_cs_success.

Hitting the breakpoint causes the fixup to run because gdb/ptrace uses
signals to control the tracee. So the op pointer must have been cleared if
the cmpxchg() succeeded:

  (gdb) b __vdso_futex_robust_try_unlock_cs_start
  (gdb) run
  Testing non contended unlock

  Breakpoint 1, 0x00007ffff7fc67a3 in __vdso_futex_robust_try_unlock ()
  (gdb) info r
  rdx            0x7ffff7dc1a70      140737351785072
  rsi            0x7ffff7dc1a70      140737351785072
  rdi            0x7fffffffea90      140737488349840
  eflags         0x246               [ PF ZF IF ]
  (gdb) x/xg 0x7ffff7dc1a70
  0x7ffff7dc1a70:	0x0000000000000000

ZF is set, which indicates that the cmpxchg() succeeded. The fixup is
"premature" but as explained above that is harmless. The success path
writing 0 again is just a redundant store with no side effects as the
pointer is task local.

Letting it continue hits the contended case:

  (gdb) info 5
  rdx            0x7ffff7dc1a70      140737351785072
  rsi            0x7ffff7dc1a70      140737351785072
  rdi            0x7fffffffea90      140737488349840
  eflags         0xa87               [ CF PF SF IF OF ]
  (gdb) x/xg 0x7ffff7dc1a70
  0x7ffff7dc1a70:	0x00007fffffffeab0

Note the lack of ZF in eflags, which means cmpxchg() failed. RDI points to
mutex->lock and the op pointer still to mutex->list (offset 0x20) as it
should be.

A 32-bit executable with the same breakpoint shows the bit 0 handling:

  eax            0x1c0               448
  ecx            0x0                 0
  edx            0xf7fbc5f5          -134494731
  ebx            0x56558ff4          1448447988
  esi            0xf7fbc5f4          -134494732
  edi            0xffffdbf8          -9224
  eflags         0x246               [ PF ZF IF ]

  (gdb) x/xw 0xf7fbc5f4
  0xf7fbc5f4:	0x00000000

EDX has bit 0 set and ESI has the real pointer address with bit 0
cleared. ESI is used to clear it. EDX is handed to the fixup function and
that takes the correct path by testing bit 0.

The series applies on v7.0-rc3 and is also available via git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git locking/futex

Thanks

	tglx
---
  It's so much easier to suggest solutions when you don't know too much about the problem.
     - Malcolm Forbes
---
 Documentation/locking/robust-futexes.rst |    8 
 arch/x86/Kconfig                         |    1 
 arch/x86/entry/vdso/common/vfutex.c      |   72 +++++++
 arch/x86/entry/vdso/vdso32/Makefile      |    5 
 arch/x86/entry/vdso/vdso32/vdso32.lds.S  |    6 
 arch/x86/entry/vdso/vdso32/vfutex.c      |    1 
 arch/x86/entry/vdso/vdso64/Makefile      |    7 
 arch/x86/entry/vdso/vdso64/vdso64.lds.S  |    6 
 arch/x86/entry/vdso/vdso64/vdsox32.lds.S |    6 
 arch/x86/entry/vdso/vdso64/vfutex.c      |    1 
 arch/x86/entry/vdso/vma.c                |   20 ++
 arch/x86/include/asm/futex_robust.h      |   44 ++++
 arch/x86/include/asm/vdso.h              |    3 
 arch/x86/tools/vdso2c.c                  |   17 +
 include/linux/futex.h                    |   43 +++-
 include/linux/futex_types.h              |   74 +++++++
 include/linux/mm_types.h                 |   12 -
 include/linux/sched.h                    |   16 -
 include/uapi/linux/futex.h               |   24 ++
 include/vdso/futex.h                     |   44 ++++
 init/Kconfig                             |    6 
 io_uring/futex.c                         |    2 
 kernel/entry/common.c                    |    9 
 kernel/exit.c                            |    4 
 kernel/futex/core.c                      |  303 ++++++++++++++++++-------------
 kernel/futex/futex.h                     |   11 -
 kernel/futex/pi.c                        |   41 ++--
 kernel/futex/syscalls.c                  |   36 +--
 kernel/futex/waitwake.c                  |   27 ++
 29 files changed, 636 insertions(+), 213 deletions(-)
---
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>

#include <linux/futex.h>
#include <linux/prctl.h>

#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/types.h>

typedef uint32_t (*frtu_t)(uint32_t *, uint32_t, void *);

static frtu_t frtu;

static void get_vdso(void)
{
	void *vdso = dlopen("linux-vdso.so.1",
			    RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
	if (!vdso)
		vdso = dlopen("linux-gate.so.1",
			      RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
	if (!vdso)
		vdso = dlopen("linux-vdso32.so.1",
			      RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
	if (!vdso)
		vdso = dlopen("linux-vdso64.so.1",
			      RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
	if (!vdso) {
		printf("Failed to find vDSO\n");
		exit(1);
	}

	frtu = (frtu_t)dlsym(vdso, "__vdso_futex_robust_try_unlock");
	if (!frtu) {
		printf("Failed to find frtu in vDSO\n");
		exit(1);
	}
}

static void *mangle_pop(struct robust_list_head *rh)
{
	unsigned long addr = (unsigned long) &rh->list_op_pending;

	if (__BITS_PER_LONG != 64)
		addr |= 0x01;
	return (void *)addr;
}

#define FUTEX_ROBUST_UNLOCK	512
#define UNLOCK			(FUTEX_WAKE | FUTEX_ROBUST_UNLOCK)
#define UNLOCK_PI		(FUTEX_UNLOCK_PI | FUTEX_ROBUST_UNLOCK)

int main(int argc, char * const argv[])
{
	pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
	uint32_t *lock = (uint32_t *)&mutex.__data.__lock;
	struct robust_list_head *rhead;
	pid_t ltid, tid = gettid();
	struct robust_list *list;
	size_t sz;
	void *pop;

	get_vdso();

	syscall(SYS_get_robust_list, 0, &rhead, &sz);
	pop = mangle_pop(rhead);

	list = (struct robust_list *)&mutex.__data.__list.__next;

	printf("Testing non contended unlock\n");
	*lock = tid;
	rhead->list_op_pending = list;

	ltid = frtu(lock, tid, pop);
	if (ltid != tid)
		printf("Non contended unlock failed: LTID %08x\n", ltid);
	else if (rhead->list_op_pending)
		printf("List op not cleared: %16lx\n", (unsigned long) rhead->list_op_pending);
	else if (*lock)
		printf("Non contended unlock failed: LOCK %08x\n", *lock);
	else
		printf("Non contended unlock succeeded\n");

	printf("Testing contended FUTEX_WAKE unlock\n");

	*lock = tid | FUTEX_WAITERS;
	rhead->list_op_pending = list;

	ltid = frtu(lock, tid, pop);
	if (ltid == tid) {
		printf("Contended unlock succeeded %08x\n", ltid);
	} else if (!rhead->list_op_pending) {
		printf("List op cleared: %16lx\n", (unsigned long) rhead->list_op_pending);
	} else if (!*lock) {
		printf("Contended unlock cleared LOCK %08x\n", *lock);
	} else {
		int ret = syscall(SYS_futex, lock, UNLOCK, 1, NULL, pop, 0, 0);

		if (ret < 0)
			printf("FUTEX_WAKE|FUTEX_ROBUST_UNLOCK failed %d\n", errno);
		else if (rhead->list_op_pending)
			printf("List op not cleared in syscall: %16lx\n", (unsigned long) rhead->list_op_pending);
		else if (*lock)
			printf("Contended syscall unlock failed: LOCK %08x\n", *lock);
		else
			printf("Contended unlock syscall succeeded\n");
	}

	printf("Testing contended FUTEX_UNLOCK_PI unlock\n");

	*lock = tid | FUTEX_WAITERS;
	rhead->list_op_pending = list;

	ltid = frtu(lock, tid, pop);
	if (ltid == tid) {
		printf("Contended unlock succeeded %08x\n", ltid);
	} else if (!rhead->list_op_pending) {
		printf("List op cleared: %16lx\n", (unsigned long) rhead->list_op_pending);
	} else if (!*lock) {
		printf("Contended unlock cleared LOCK %08x\n", *lock);
	} else {
		int ret = syscall(SYS_futex, lock, UNLOCK_PI, 0, NULL, pop, 0, 0);

		if (ret < 0)
			printf("FUTEX_UNLOCK_PI|FUTEX_ROBUST_UNLOCK failed %d\n", errno);
		else if (rhead->list_op_pending)
			printf("List op not cleared in syscall: %16lx\n", (unsigned long) rhead->list_op_pending);
		else if (*lock)
			printf("Contended syscall unlock failed: LOCK %08x\n", *lock);
		else
			printf("Contended unlock syscall succeeded\n");
	}

	return 0;
}









^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2026-03-19 23:32 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 17:12 [patch 0/8] futex: Address the robust futex unlock race for real Thomas Gleixner
2026-03-16 17:12 ` [patch 1/8] futex: Move futex task related data into a struct Thomas Gleixner
2026-03-16 17:55   ` Mathieu Desnoyers
2026-03-17  2:24   ` André Almeida
2026-03-17  9:52     ` Thomas Gleixner
2026-03-16 17:13 ` [patch 2/8] futex: Move futex related mm_struct " Thomas Gleixner
2026-03-16 18:00   ` Mathieu Desnoyers
2026-03-16 17:13 ` [patch 3/8] futex: Provide UABI defines for robust list entry modifiers Thomas Gleixner
2026-03-16 18:02   ` Mathieu Desnoyers
2026-03-17  2:38   ` André Almeida
2026-03-17  9:53     ` Thomas Gleixner
2026-03-16 17:13 ` [patch 4/8] futex: Add support for unlocking robust futexes Thomas Gleixner
2026-03-16 18:24   ` Mathieu Desnoyers
2026-03-17 16:17   ` André Almeida
2026-03-17 20:46     ` Peter Zijlstra
2026-03-17 22:40       ` Thomas Gleixner
2026-03-18  8:02         ` Peter Zijlstra
2026-03-18  8:06           ` Florian Weimer
2026-03-18 14:47           ` Peter Zijlstra
2026-03-18 16:03             ` Thomas Gleixner
2026-03-16 17:13 ` [patch 5/8] futex: Add robust futex unlock IP range Thomas Gleixner
2026-03-16 18:36   ` Mathieu Desnoyers
2026-03-17 19:19   ` André Almeida
2026-03-16 17:13 ` [patch 6/8] futex: Provide infrastructure to plug the non contended robust futex unlock race Thomas Gleixner
2026-03-16 18:35   ` Mathieu Desnoyers
2026-03-16 20:29     ` Thomas Gleixner
2026-03-16 20:52       ` Mathieu Desnoyers
2026-03-16 17:13 ` [patch 7/8] x86/vdso: Prepare for robust futex unlock support Thomas Gleixner
2026-03-16 17:13 ` [patch 8/8] x86/vdso: Implement __vdso_futex_robust_try_unlock() Thomas Gleixner
2026-03-16 19:19   ` Mathieu Desnoyers
2026-03-16 21:02     ` Thomas Gleixner
2026-03-16 22:35       ` Mathieu Desnoyers
2026-03-16 21:14     ` Thomas Gleixner
2026-03-16 21:29     ` Thomas Gleixner
2026-03-17  7:25   ` Thomas Weißschuh
2026-03-17  9:51     ` Thomas Gleixner
2026-03-17 11:17       ` Thomas Weißschuh
2026-03-18 16:17         ` Thomas Gleixner
2026-03-19  7:41           ` Thomas Weißschuh
2026-03-19  8:53             ` Florian Weimer
2026-03-19  9:04               ` Thomas Weißschuh
2026-03-19  9:08               ` Peter Zijlstra
2026-03-19 23:31                 ` Thomas Gleixner
2026-03-19 10:36             ` Sebastian Andrzej Siewior
2026-03-19 10:49               ` Thomas Weißschuh
2026-03-19 10:55                 ` Sebastian Andrzej Siewior
2026-03-17  8:28   ` Florian Weimer
2026-03-17  9:36     ` Thomas Gleixner
2026-03-17 10:37       ` Florian Weimer
2026-03-17 22:32         ` Thomas Gleixner
2026-03-18 22:08           ` Thomas Gleixner
2026-03-18 22:10             ` Peter Zijlstra
2026-03-19  2:05             ` André Almeida
2026-03-19  7:10               ` Thomas Gleixner
2026-03-17 15:33   ` Uros Bizjak
2026-03-18  8:21     ` Thomas Gleixner
2026-03-18  8:32       ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox