public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Andrè Almeida" <andrealmeid@igalia.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Carlos O'Donell" <carlos@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Florian Weimer" <fweimer@redhat.com>,
	"Rich Felker" <dalias@aerifal.cx>,
	"Torvald Riegel" <triegel@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Uros Bizjak" <ubizjak@gmail.com>,
	"Thomas Weißschuh" <linux@weissschuh.net>
Subject: [patch V3 12/14] x86/vdso: Implement __vdso_futex_robust_try_unlock()
Date: Mon, 30 Mar 2026 14:03:01 +0200	[thread overview]
Message-ID: <20260330120117.878203456@kernel.org> (raw)
In-Reply-To: 20260330114212.927686587@kernel.org

When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in userspace looks like this:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);
	
  	lval = gettid();
  3)	if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
  4)		robust_list_clear_op_pending();
  	else
  5)		sys_futex(OP,...FUTEX_ROBUST_UNLOCK);

That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task which observes that it is the last
user and:

  1) unmaps the mutex memory
  2) maps a different file, which ends up covering the same address

When then the original task exits before reaching #5 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.

Provide a VDSO function which exposes the critical section window in the
VDSO symbol table. The resulting addresses are updated in the task's mm
when the VDSO is (re)map()'ed.

The core code detects when a task was interrupted within the critical
section and is about to deliver a signal. It then invokes an architecture
specific function which determines whether the pending op pointer has to be
cleared or not. The unlock assembly sequence on 64-bit is:

	mov		%esi,%eax	// Load TID into EAX
       	xor		%ecx,%ecx	// Set ECX to 0
	lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
  .Lstart:
	jnz    		.Lend
	movq		%rcx,(%rdx)	// Clear list_op_pending
  .Lend:
	ret

So the decision can be simply based on the ZF state in regs->flags. The
pending op pointer is always in DX independent of the build mode
(32/64-bit) to make the pending op pointer retrieval uniform. The size of
the pointer is stored in the matching criticial section range struct and
the core code retrieves it from there. So the pointer retrieval function
does not have to care. It is bit-size independent:

     return regs->flags & X86_EFLAGS_ZF ? regs->dx : NULL;

There are two entry points to handle the different robust list pending op
pointer size:

	__vdso_futex_robust_list64_try_unlock()
	__vdso_futex_robust_list32_try_unlock()

The 32-bit VDSO provides only __vdso_futex_robust_list32_try_unlock().

The 64-bit VDSO provides always __vdso_futex_robust_list64_try_unlock() and
when COMPAT is enabled also the list32 variant, which is required to
support multi-size robust list pointers used by gaming emulators.

The unlock function is inspired by an idea from Mathieu Desnoyers.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com
--
V3: Use 'r' for the zero register       - Uros
V2: Provide different entry points	- Florian
    Use __u32 and __x86_64__		- Thomas
    Use private labels			- Thomas
    Optimize assembly		   	- Uros
    
    Split the functions up now that ranges are supported in the core and
    document the actual assembly.
---
 arch/x86/Kconfig                         |    1 
 arch/x86/entry/vdso/common/vfutex.c      |   71 +++++++++++++++++++++++++++++++
 arch/x86/entry/vdso/vdso32/Makefile      |    5 +-
 arch/x86/entry/vdso/vdso32/vdso32.lds.S  |    3 +
 arch/x86/entry/vdso/vdso32/vfutex.c      |    1 
 arch/x86/entry/vdso/vdso64/Makefile      |    7 +--
 arch/x86/entry/vdso/vdso64/vdso64.lds.S  |    7 +++
 arch/x86/entry/vdso/vdso64/vdsox32.lds.S |    7 +++
 arch/x86/entry/vdso/vdso64/vfutex.c      |    1 
 arch/x86/include/asm/futex_robust.h      |   19 ++++++++
 10 files changed, 117 insertions(+), 5 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,7 @@ config X86
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_EISA			if X86_32
 	select HAVE_EXIT_THREAD
+	select HAVE_FUTEX_ROBUST_UNLOCK
 	select HAVE_GENERIC_TIF_BITS
 	select HAVE_GUP_FAST
 	select HAVE_FENTRY			if X86_64 || DYNAMIC_FTRACE
--- /dev/null
+++ b/arch/x86/entry/vdso/common/vfutex.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <vdso/futex.h>
+
+/*
+ * Assembly template for the try unlock functions. The basic functionality is:
+ *
+ *		mov		esi, %eax	Move the TID into EAX
+ *		xor		%ecx, %ecx	Clear ECX
+ *		lock_cmpxchgl	%ecx, (%rdi)	Attempt the TID -> 0 transition
+ * .Lcs_start:					Start of the critical section
+ *		jnz		.Lcs_end	If cmpxchl failed jump to the end
+ * .Lcs_success:				Start of the success section
+ *		movq		%rcx, (%rdx)	Set the pending op pointer to 0
+ * .Lcs_end:					End of the critical section
+ *
+ * .Lcs_start and .Lcs_end establish the critical section range. .Lcs_success is
+ * technically not required, but there for illustration, debugging and testing.
+ *
+ * When CONFIG_COMPAT is enabled then the 64-bit VDSO provides two functions.
+ * One for the regular 64-bit sized pending operation pointer and one for a
+ * 32-bit sized pointer to support gaming emulators.
+ *
+ * The 32-bit VDSO provides only the one for 32-bit sized pointers.
+ */
+#define __stringify_1(x...)	#x
+#define __stringify(x...)	__stringify_1(x)
+
+#define LABEL(prefix, which)	__stringify(prefix##_try_unlock_cs_##which:)
+
+#define JNZ_END(prefix)		"jnz " __stringify(prefix) "_try_unlock_cs_end\n"
+
+#define CLEAR_POPQ		"movq	%[zero],  %a[pop]\n"
+#define CLEAR_POPL		"movl	%k[zero], %a[pop]\n"
+
+#define futex_robust_try_unlock(prefix, clear_pop, __lock, __tid, __pop)	\
+({									\
+	asm volatile (							\
+		"						\n"	\
+		"	lock cmpxchgl	%k[zero], %a[lock]	\n"	\
+		"						\n"	\
+		LABEL(prefix, start)					\
+		"						\n"	\
+		JNZ_END(prefix)						\
+		"						\n"	\
+		LABEL(prefix, success)					\
+		"						\n"	\
+			clear_pop					\
+		"						\n"	\
+		LABEL(prefix, end)					\
+		: [tid]   "+&a" (__tid)					\
+		: [lock]  "D"   (__lock),				\
+		  [pop]   "d"   (__pop),				\
+		  [zero]  "r"   (0UL)					\
+		: "memory"						\
+	);								\
+	__tid;								\
+})
+
+#ifdef __x86_64__
+__u32 __vdso_futex_robust_list64_try_unlock(__u32 *lock, __u32 tid, __u64 *pop)
+{
+	return futex_robust_try_unlock(__futex_list64, CLEAR_POPQ, lock, tid, pop);
+}
+#endif /* __x86_64__ */
+
+#if defined(CONFIG_X86_32) || defined(CONFIG_COMPAT)
+__u32 __vdso_futex_robust_list32_try_unlock(__u32 *lock, __u32 tid, __u32 *pop)
+{
+	return futex_robust_try_unlock(__futex_list32, CLEAR_POPL, lock, tid, pop);
+}
+#endif /* CONFIG_X86_32 || CONFIG_COMPAT */
--- a/arch/x86/entry/vdso/vdso32/Makefile
+++ b/arch/x86/entry/vdso/vdso32/Makefile
@@ -7,8 +7,9 @@
 vdsos-y			:= 32
 
 # Files to link into the vDSO:
-vobjs-y			:= note.o vclock_gettime.o vgetcpu.o
-vobjs-y			+= system_call.o sigreturn.o
+vobjs-y					:= note.o vclock_gettime.o vgetcpu.o
+vobjs-y					+= system_call.o sigreturn.o
+vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK)	+= vfutex.o
 
 # Compilation flags
 flags-y			:= -DBUILD_VDSO32 -m32 -mregparm=0
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -30,6 +30,9 @@ VERSION
 		__vdso_clock_gettime64;
 		__vdso_clock_getres_time64;
 		__vdso_getcpu;
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+		__vdso_futex_robust_list32_try_unlock;
+#endif
 	};
 
 	LINUX_2.5 {
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso32/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
--- a/arch/x86/entry/vdso/vdso64/Makefile
+++ b/arch/x86/entry/vdso/vdso64/Makefile
@@ -8,9 +8,10 @@ vdsos-y				:= 64
 vdsos-$(CONFIG_X86_X32_ABI)	+= x32
 
 # Files to link into the vDSO:
-vobjs-y				:= note.o vclock_gettime.o vgetcpu.o
-vobjs-y				+= vgetrandom.o vgetrandom-chacha.o
-vobjs-$(CONFIG_X86_SGX)		+= vsgx.o
+vobjs-y					:= note.o vclock_gettime.o vgetcpu.o
+vobjs-y					+= vgetrandom.o vgetrandom-chacha.o
+vobjs-$(CONFIG_X86_SGX)			+= vsgx.o
+vobjs-$(CONFIG_FUTEX_ROBUST_UNLOCK)	+= vfutex.o
 
 # Compilation flags
 flags-y				:= -DBUILD_VDSO64 -m64 -mcmodel=small
--- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
@@ -32,6 +32,13 @@ VERSION {
 #endif
 		getrandom;
 		__vdso_getrandom;
+
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+		__vdso_futex_robust_list64_try_unlock;
+#ifdef CONFIG_COMPAT
+		__vdso_futex_robust_list32_try_unlock;
+#endif
+#endif
 	local: *;
 	};
 }
--- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
@@ -22,6 +22,13 @@ VERSION {
 		__vdso_getcpu;
 		__vdso_time;
 		__vdso_clock_getres;
+
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+		__vdso_futex_robust_list64_try_unlock;
+#ifdef CONFIG_COMPAT
+		__vdso_futex_robust_list32_try_unlock;
+#endif
+#endif
 	local: *;
 	};
 }
--- /dev/null
+++ b/arch/x86/entry/vdso/vdso64/vfutex.c
@@ -0,0 +1 @@
+#include "common/vfutex.c"
--- /dev/null
+++ b/arch/x86/include/asm/futex_robust.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_FUTEX_ROBUST_H
+#define _ASM_X86_FUTEX_ROBUST_H
+
+#include <asm/ptrace.h>
+
+static __always_inline void __user *x86_futex_robust_unlock_get_pop(struct pt_regs *regs)
+{
+	/*
+	 * If ZF is set then the cmpxchg succeeded and the pending op pointer
+	 * needs to be cleared.
+	 */
+	return regs->flags & X86_EFLAGS_ZF ? (void __user *)regs->dx : NULL;
+}
+
+#define arch_futex_robust_unlock_get_pop(regs)	\
+	x86_futex_robust_unlock_get_pop(regs)
+
+#endif /* _ASM_X86_FUTEX_ROBUST_H */


  parent reply	other threads:[~2026-03-30 12:03 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-30 12:01 [patch V3 00/14] futex: Address the robust futex unlock race for real Thomas Gleixner
2026-03-30 12:02 ` [patch V3 01/14] futex: Move futex task related data into a struct Thomas Gleixner
2026-03-30 12:02 ` [patch V3 02/14] futex: Make futex_mm_init() void Thomas Gleixner
2026-03-30 12:02 ` [patch V3 03/14] futex: Move futex related mm_struct data into a struct Thomas Gleixner
2026-03-30 15:23   ` Alexander Kuleshov
2026-03-30 12:02 ` [patch V3 04/14] futex: Provide UABI defines for robust list entry modifiers Thomas Gleixner
2026-03-30 12:02 ` [patch V3 05/14] uaccess: Provide unsafe_atomic_store_release_user() Thomas Gleixner
2026-03-30 13:33   ` Mark Rutland
2026-03-30 12:02 ` [patch V3 06/14] x86: Select ARCH_MEMORY_ORDER_TOS Thomas Gleixner
2026-03-30 13:34   ` Mark Rutland
2026-03-30 19:48     ` Thomas Gleixner
2026-03-30 12:02 ` [patch V3 07/14] futex: Cleanup UAPI defines Thomas Gleixner
2026-03-30 12:02 ` [patch V3 08/14] futex: Add support for unlocking robust futexes Thomas Gleixner
2026-03-30 12:02 ` [patch V3 09/14] futex: Add robust futex unlock IP range Thomas Gleixner
2026-03-30 12:02 ` [patch V3 10/14] futex: Provide infrastructure to plug the non contended robust futex unlock race Thomas Gleixner
2026-03-30 12:02 ` [patch V3 11/14] x86/vdso: Prepare for robust futex unlock support Thomas Gleixner
2026-03-30 12:03 ` Thomas Gleixner [this message]
2026-03-30 12:03 ` [patch V3 13/14] Documentation: futex: Add a note about robust list race condition Thomas Gleixner
2026-03-30 12:03 ` [patch V3 14/14] selftests: futex: Add tests for robust release operations Thomas Gleixner
2026-03-30 13:45 ` [patch V3 00/14] futex: Address the robust futex unlock race for real Mark Rutland
2026-03-30 13:51   ` Peter Zijlstra
2026-03-30 19:36   ` Thomas Gleixner
2026-03-31 14:12     ` Mark Rutland
2026-03-31 12:59   ` André Almeida
2026-03-31 13:03     ` Sebastian Andrzej Siewior
2026-03-31 14:13     ` Mark Rutland
2026-03-31 15:22   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260330120117.878203456@kernel.org \
    --to=tglx@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=andrealmeid@igalia.com \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=carlos@redhat.com \
    --cc=dalias@aerifal.cx \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@weissschuh.net \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=triegel@redhat.com \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox