public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"André Almeida" <andrealmeid@igalia.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Carlos O'Donell" <carlos@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Florian Weimer" <fweimer@redhat.com>,
	"Rich Felker" <dalias@aerifal.cx>,
	"Torvald Riegel" <triegel@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>
Subject: [patch 6/8] futex: Provide infrastructure to plug the non contended robust futex unlock race
Date: Mon, 16 Mar 2026 18:13:24 +0100	[thread overview]
Message-ID: <20260316164951.345973752@kernel.org> (raw)
In-Reply-To: 20260316162316.356674433@kernel.org

When the FUTEX_ROBUST_UNLOCK mechanism is used for unlocking (PI-)futexes,
then the unlock sequence in user space looks like this:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);
	
  	lval = gettid();
  3)	if (atomic_try_cmpxchg(&mutex->lock, lval, 0))
  4)		robust_list_clear_op_pending();
  	else
  5)		sys_futex(OP | FUTEX_ROBUST_UNLOCK, ....);

That still leaves a minimal race window between #3 and #4 where the mutex
could be acquired by some other task, which observes that it is the last
user and:

  1) unmaps the mutex memory
  2) maps a different file, which ends up covering the same address

When then the original task exits before reaching #6 then the kernel robust
list handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupt unrelated data.

On X86 this boils down to this simplified assembly sequence:

		mov		%esi,%eax	// Load TID into EAX
        	xor		%ecx,%ecx	// Set ECX to 0
   #3		lock cmpxchg	%ecx,(%rdi)	// Try the TID -> 0 transition
	.Lstart:
		jnz    		.Lend
   #4 		movq		$0x0,(%rdx)	// Clear list_op_pending
	.Lend:

If the cmpxchg() succeeds and the task is interrupted before it can clear
list_op_pending in the robust list head (#4) and the task crashes in a
signal handler or gets killed then it ends up in do_exit() and subsequently
in the robust list handling, which then might run into the unmap/map issue
described above.

This is only relevant when user space was interrupted and a signal is
pending. The fix-up has to be done before signal delivery is attempted
because:

   1) The signal might be fatal so get_signal() ends up in do_exit()

   2) The signal handler might crash or the task is killed before returning
      from the handler. At that point the instruction pointer in pt_regs is
      not longer the instruction pointer of the initially interrupted unlock
      sequence.

The right place to handle this is in __exit_to_user_mode_loop() before
invoking arch_do_signal_or_restart() as this covers obviously both
scenarios.

As this is only relevant when the task was interrupted in user space, this
is tied to RSEQ and the generic entry code as RSEQ keeps track of user
space interrupts unconditionally even if the task does not have a RSEQ
region installed. That makes the decision very lightweight:

       if (current->rseq.user_irq && within(regs, unlock_ip_range))
       		futex_fixup_robust_unlock(regs);

futex_fixup_robust_unlock() then invokes a architecture specific function
which evaluates the register content to decide whether the pending ops
pointer in the robust list head needs to be cleared.

Assuming the above unlock sequence, then on x86 this results in the trivial
evaluation of the zero flag:

	return regs->eflags & X86_EFLAGS_ZF;

Other architectures might need to do more complex evaluations due to LLSC,
but the approach is valid in general. In case that COMPAT is enabled the
decision function is a bit more complex, but that's an implementation
detail.

The handling code also requires to retrieve the pending op pointer via an
architecture specific function to be able to perform the clearing.

The unlock sequence is going to be placed in the VDSO so that the kernel
can keep everything synchronized. The resulting code sequence for user
space is:

   if (__vdso_futex_robust_try_unlock(lock, tid, &pending_op) != tid)
 	err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);

Both the VDSO unlock and the kernel side unlock ensure that the pending_op
pointer is always cleared when the lock becomes unlocked.

The pending op pointer has the same modifier requirements as the @uaddr2
argument of sys_futex(FUTEX_ROBUST_UNLOCK) for the very same reasons. That
means VDSO implementations need to support the variable size case for the
pending op pointer as well if COMPAT is enabled.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
 include/linux/futex.h |   31 ++++++++++++++++++++++++++++++-
 include/vdso/futex.h  |   44 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/entry/common.c |    9 ++++++---
 kernel/futex/core.c   |   13 +++++++++++++
 4 files changed, 93 insertions(+), 4 deletions(-)

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -110,7 +110,36 @@ static inline int futex_hash_allocate_de
 }
 static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
 static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
+#endif /* !CONFIG_FUTEX */
 
-#endif
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+#include <asm/futex_robust.h>
+
+void __futex_fixup_robust_unlock(struct pt_regs *regs);
+
+static inline bool futex_within_robust_unlock(struct pt_regs *regs)
+{
+	unsigned long ip = instruction_pointer(regs);
+
+	return ip >= current->mm->futex.unlock_cs_start_ip &&
+		ip < current->mm->futex.unlock_cs_end_ip;
+}
+
+static inline void futex_fixup_robust_unlock(struct pt_regs *regs)
+{
+	/*
+	 * Avoid dereferencing current->mm if not returning from interrupt.
+	 * current->rseq.event is going to be used anyway in the exit to user
+	 * code, so bringing it in is not a big deal.
+	 */
+	if (!current->rseq.event.user_irq)
+		return;
+
+	if (unlikely(futex_within_robust_unlock(regs)))
+		__futex_fixup_robust_unlock(regs);
+}
+#else /* CONFIG_FUTEX_ROBUST_UNLOCK */
+static inline void futex_fixup_robust_unlock(struct pt_regs *regs) {}
+#endif /* !CONFIG_FUTEX_ROBUST_UNLOCK */
 
 #endif
--- /dev/null
+++ b/include/vdso/futex.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _VDSO_FUTEX_H
+#define _VDSO_FUTEX_H
+
+#include <linux/types.h>
+
+struct robust_list;
+
+/**
+ * __vdso_futex_robust_try_unlock - Try to unlock an uncontended robust futex
+ * @lock:	Pointer to the futex lock object
+ * @tid:	The TID of the calling task
+ * @op:		Pointer to the task's robust_list_head::list_pending_op
+ *
+ * Return: The content of *@lock. On success this is the same as @tid.
+ *
+ * The function implements:
+ *	if (atomic_try_cmpxchg(lock, &tid, 0))
+ *		*op = NULL;
+ *	return tid;
+ *
+ * There is a race between a successful unlock and clearing the pending op
+ * pointer in the robust list head. If the calling task is interrupted in the
+ * race window and has to handle a (fatal) signal on return to user space then
+ * the kernel handles the clearing of @pending_op before attempting to deliver
+ * the signal. That ensures that a task cannot exit with a potentially invalid
+ * pending op pointer.
+ *
+ * User space uses it in the following way:
+ *
+ * if (__vdso_futex_robust_try_unlock(lock, tid, &pending_op) != tid)
+ *	err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);
+ *
+ * If the unlock attempt fails due to the FUTEX_WAITERS bit set in the lock,
+ * then the syscall does the unlock, clears the pending op pointer and wakes the
+ * requested number of waiters.
+ *
+ * The @op pointer is intentionally void. It has the same requirements as the
+ * @uaddr2 argument for sys_futex(FUTEX_ROBUST_UNLOCK) operations. See the
+ * modifier and the related documentation in include/uapi/linux/futex.h
+ */
+uint32_t __vdso_futex_robust_try_unlock(uint32_t *lock, uint32_t tid, void *op);
+
+#endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -1,11 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0
 
-#include <linux/irq-entry-common.h>
-#include <linux/resume_user_mode.h>
+#include <linux/futex.h>
 #include <linux/highmem.h>
+#include <linux/irq-entry-common.h>
 #include <linux/jump_label.h>
 #include <linux/kmsan.h>
 #include <linux/livepatch.h>
+#include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 
 /* Workaround to allow gradual conversion of architecture code */
@@ -60,8 +61,10 @@ static __always_inline unsigned long __e
 		if (ti_work & _TIF_PATCH_PENDING)
 			klp_update_patch_state(current);
 
-		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
+		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
+			futex_fixup_robust_unlock(regs);
 			arch_do_signal_or_restart(regs);
+		}
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
 			resume_user_mode_work(regs);
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1455,6 +1455,19 @@ bool futex_robust_list_clear_pending(voi
 	return robust_list_clear_pending(pop);
 }
 
+#ifdef CONFIG_FUTEX_ROBUST_UNLOCK
+void __futex_fixup_robust_unlock(struct pt_regs *regs)
+{
+	void __user *pop;
+
+	if (!arch_futex_needs_robust_unlock_fixup(regs))
+		return;
+
+	pop = arch_futex_robust_unlock_get_pop(regs);
+	futex_robust_list_clear_pending(pop);
+}
+#endif /* CONFIG_FUTEX_ROBUST_UNLOCK */
+
 static void futex_cleanup(struct task_struct *tsk)
 {
 	if (unlikely(tsk->futex.robust_list)) {


  parent reply	other threads:[~2026-03-16 17:13 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 17:12 [patch 0/8] futex: Address the robust futex unlock race for real Thomas Gleixner
2026-03-16 17:12 ` [patch 1/8] futex: Move futex task related data into a struct Thomas Gleixner
2026-03-16 17:55   ` Mathieu Desnoyers
2026-03-17  2:24   ` André Almeida
2026-03-17  9:52     ` Thomas Gleixner
2026-03-16 17:13 ` [patch 2/8] futex: Move futex related mm_struct " Thomas Gleixner
2026-03-16 18:00   ` Mathieu Desnoyers
2026-03-16 17:13 ` [patch 3/8] futex: Provide UABI defines for robust list entry modifiers Thomas Gleixner
2026-03-16 18:02   ` Mathieu Desnoyers
2026-03-17  2:38   ` André Almeida
2026-03-17  9:53     ` Thomas Gleixner
2026-03-16 17:13 ` [patch 4/8] futex: Add support for unlocking robust futexes Thomas Gleixner
2026-03-16 18:24   ` Mathieu Desnoyers
2026-03-17 16:17   ` André Almeida
2026-03-17 20:46     ` Peter Zijlstra
2026-03-17 22:40       ` Thomas Gleixner
2026-03-18  8:02         ` Peter Zijlstra
2026-03-18  8:06           ` Florian Weimer
2026-03-18 14:47           ` Peter Zijlstra
2026-03-18 16:03             ` Thomas Gleixner
2026-03-16 17:13 ` [patch 5/8] futex: Add robust futex unlock IP range Thomas Gleixner
2026-03-16 18:36   ` Mathieu Desnoyers
2026-03-17 19:19   ` André Almeida
2026-03-16 17:13 ` Thomas Gleixner [this message]
2026-03-16 18:35   ` [patch 6/8] futex: Provide infrastructure to plug the non contended robust futex unlock race Mathieu Desnoyers
2026-03-16 20:29     ` Thomas Gleixner
2026-03-16 20:52       ` Mathieu Desnoyers
2026-03-16 17:13 ` [patch 7/8] x86/vdso: Prepare for robust futex unlock support Thomas Gleixner
2026-03-16 17:13 ` [patch 8/8] x86/vdso: Implement __vdso_futex_robust_try_unlock() Thomas Gleixner
2026-03-16 19:19   ` Mathieu Desnoyers
2026-03-16 21:02     ` Thomas Gleixner
2026-03-16 22:35       ` Mathieu Desnoyers
2026-03-16 21:14     ` Thomas Gleixner
2026-03-16 21:29     ` Thomas Gleixner
2026-03-17  7:25   ` Thomas Weißschuh
2026-03-17  9:51     ` Thomas Gleixner
2026-03-17 11:17       ` Thomas Weißschuh
2026-03-18 16:17         ` Thomas Gleixner
2026-03-19  7:41           ` Thomas Weißschuh
2026-03-19  8:53             ` Florian Weimer
2026-03-19  9:04               ` Thomas Weißschuh
2026-03-19  9:08               ` Peter Zijlstra
2026-03-19 23:31                 ` Thomas Gleixner
2026-03-19 10:36             ` Sebastian Andrzej Siewior
2026-03-19 10:49               ` Thomas Weißschuh
2026-03-19 10:55                 ` Sebastian Andrzej Siewior
2026-03-17  8:28   ` Florian Weimer
2026-03-17  9:36     ` Thomas Gleixner
2026-03-17 10:37       ` Florian Weimer
2026-03-17 22:32         ` Thomas Gleixner
2026-03-18 22:08           ` Thomas Gleixner
2026-03-18 22:10             ` Peter Zijlstra
2026-03-19  2:05             ` André Almeida
2026-03-19  7:10               ` Thomas Gleixner
2026-03-17 15:33   ` Uros Bizjak
2026-03-18  8:21     ` Thomas Gleixner
2026-03-18  8:32       ` Uros Bizjak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260316164951.345973752@kernel.org \
    --to=tglx@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=andrealmeid@igalia.com \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=carlos@redhat.com \
    --cc=dalias@aerifal.cx \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=triegel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox