public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"André Almeida" <andrealmeid@igalia.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	"Carlos O'Donell" <carlos@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Florian Weimer" <fweimer@redhat.com>,
	"Rich Felker" <dalias@aerifal.cx>,
	"Torvald Riegel" <triegel@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	"Uros Bizjak" <ubizjak@gmail.com>,
	"Thomas Weißschuh" <linux@weissschuh.net>
Subject: [patch v2 07/11] futex: Add support for unlocking robust futexes
Date: Fri, 20 Mar 2026 00:24:41 +0100	[thread overview]
Message-ID: <20260319231239.613218128@kernel.org> (raw)
In-Reply-To: 20260319225224.853416463@kernel.org

Unlocking robust non-PI futexes happens in user space with the following
sequence:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);
	
  	lval = 0;
  3)	lval = atomic_xchg(lock, lval);
  4)	if (lval & WAITERS)
  5)		sys_futex(WAKE,....);
  6)	robust_list_clear_op_pending();

That opens a window between #3 and #6 where the mutex could be acquired by
some other task which observes that it is the last user and:

  A) unmaps the mutex memory
  B) maps a different file, which ends up covering the same address

When the original task exits before reaching #6 then the kernel robust list
handling observes the pending op entry and tries to fix up user space.

In case that the newly mapped data contains the TID of the exiting thread
at the address of the mutex/futex the kernel will set the owner died bit in
that memory and therefore corrupting unrelated data.

PI futexes have a similar problem both for the non-contented user space
unlock and the in kernel unlock:

  1)	robust_list_set_op_pending(mutex);
  2)	robust_list_remove(mutex);
	
  	lval = gettid();
  3)	if (!atomic_try_cmpxchg(lock, lval, 0))
  4)		sys_futex(UNLOCK_PI,....);
  5)	robust_list_clear_op_pending();

Address the first part of the problem where the futexes have waiters and
need to enter the kernel anyway. Add a new FUTEX_ROBUST_UNLOCK flag, which
is valid for the sys_futex() FUTEX_UNLOCK_PI, FUTEX_WAKE, FUTEX_WAKE_BITSET
operations.

This deliberately omits FUTEX_WAKE_OP from this treatment as it's unclear
whether this is needed and there is no usage of it in glibc either to
investigate.

For the futex2 syscall family this needs to be implemented with a new
syscall.

The sys_futex() case [ab]uses the @uaddr2 argument to hand the pointer to
the kernel. This argument is only evaluated when the FUTEX_ROBUST_UNLOCK
bit is set and is therefore backward compatible.

This requires a second flag FUTEX_ROBUST_LIST32 which indicates that the
robust list pointer points to an u32 and not to an u64. This is required
for two reasons:

    1) sys_futex() has no compat variant

    2) The gaming emulators use both both 64-bit and compat 32-bit robust
       lists in the same 64-bit application

As a consequence 32-bit applications have to set this flag unconditionally
so they can run on a 64-bit kernel in compat mode unmodified. 32-bit
kernels return an error code when the flag is not set. 64-bit kernels will
happily clear the full 64 bits if user space fails to set it.

In case of FUTEX_UNLOCK_PI this clears the robust list pending op when the
unlock succeeded. In case of errors, the user space value is still locked
by the caller and therefore the above cannot happen.

In case of FUTEX_WAKE* this does the unlock of the futex in the kernel and
clears the robust list pending op when the unlock was successful. If not,
the user space value is still locked and user space has to deal with the
returned error. That means that the unlocking of non-PI robust futexes has
to use the same try_cmpxchg() unlock scheme as PI futexes.

If the clearing of the pending list op fails (fault) then the kernel clears
the registered robust list pointer if it matches to prevent that exit()
will try to handle invalid data. That's a valid paranoid decision because
the robust list head sits usually in the TLS and if the TLS is not longer
accessible then the chance for fixing up the resulting mess is very close
to zero.

The problem of non-contended unlocks still exists and will be addressed
separately.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
---
V2: Use store release for unlock	- Andre, Peter
    Use a separate FLAG for 32bit lists	- Florian
    Add command defines
---
 include/uapi/linux/futex.h |   29 +++++++++++++++++++++++-
 io_uring/futex.c           |    2 -
 kernel/futex/core.c        |   53 +++++++++++++++++++++++++++++++++++++++++++--
 kernel/futex/futex.h       |   15 +++++++++++-
 kernel/futex/pi.c          |   15 +++++++++++-
 kernel/futex/syscalls.c    |   13 ++++++++---
 kernel/futex/waitwake.c    |   30 +++++++++++++++++++++++--
 7 files changed, 144 insertions(+), 13 deletions(-)

--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -25,8 +25,11 @@
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
+#define FUTEX_UNLOCK_ROBUST	512
+#define FUTEX_ROBUST_LIST32	1024
 
-#define FUTEX_CMD_MASK			~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
+#define FUTEX_CMD_MASK			~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | \
+					  FUTEX_UNLOCK_ROBUST | FUTEX_ROBUST_LIST32)
 
 #define FUTEX_WAIT_PRIVATE		(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
 #define FUTEX_WAKE_PRIVATE		(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
@@ -43,6 +46,30 @@
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
 
 /*
+ * Operations to unlock a futex, clear the robust list pending op pointer and
+ * wake waiters.
+ */
+#define FUTEX_UNLOCK_PI_LIST64			(FUTEX_UNLOCK_PI | FUTEX_UNLOCK_ROBUST)
+#define FUTEX_UNLOCK_PI_LIST64_PRIVATE		(FUTEX_UNLOCK_PI_LIST64 | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_LIST32			(FUTEX_UNLOCK_PI | FUTEX_UNLOCK_ROBUST | \
+						 FUTEX_ROBUST_LIST32)
+#define FUTEX_UNLOCK_PI_LIST32_PRIVATE		(FUTEX_UNLOCK_PI_LIST32 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_WAKE_LIST64		(FUTEX_WAKE | FUTEX_UNLOCK_ROBUST)
+#define FUTEX_UNLOCK_WAKE_LIST64_PRIVATE	(FUTEX_UNLOCK_LIST64 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_WAKE_LIST32		(FUTEX_WAKE | FUTEX_UNLOCK_ROBUST | \
+						 FUTEX_ROBUST_LIST32)
+#define FUTEX_UNLOCK_WAKE_LIST32_PRIVATE	(FUTEX_UNLOCK_LIST32 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_BITSET_LIST64		(FUTEX_WAKE_BITSET | FUTEX_UNLOCK_ROBUST)
+#define FUTEX_UNLOCK_BITSET_LIST64_PRIVATE	(FUTEX_UNLOCK_BITSET_LIST64 | FUTEX_PRIVATE_FLAG)
+
+#define FUTEX_UNLOCK_BITSET_LIST32		(FUTEX_WAKE_BITSET | FUTEX_UNLOCK_ROBUST | \
+						 FUTEX_ROBUST_LIST32)
+#define FUTEX_UNLOCK_BITSET_LIST32_PRIVATE	(FUTEX_UNLOCK_BITSET_LIST32 | FUTEX_PRIVATE_FLAG)
+
+/*
  * Flags for futex2 syscalls.
  *
  * NOTE: these are not pure flags, they can also be seen as:
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -325,7 +325,7 @@ int io_futex_wake(struct io_kiocb *req,
 	 * Strict flags - ensure that waking 0 futexes yields a 0 result.
 	 * See commit 43adf8449510 ("futex: FLAGS_STRICT") for details.
 	 */
-	ret = futex_wake(iof->uaddr, FLAGS_STRICT | iof->futex_flags,
+	ret = futex_wake(iof->uaddr, FLAGS_STRICT | iof->futex_flags, NULL,
 			 iof->futex_val, iof->futex_mask);
 	if (ret < 0)
 		req_set_fail(req);
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1063,7 +1063,7 @@ static int handle_futex_death(u32 __user
 	owner = uval & FUTEX_TID_MASK;
 
 	if (pending_op && !pi && !owner) {
-		futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1,
+		futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, NULL, 1,
 			   FUTEX_BITSET_MATCH_ANY);
 		return 0;
 	}
@@ -1117,7 +1117,7 @@ static int handle_futex_death(u32 __user
 	 * PI futexes happens in exit_pi_state():
 	 */
 	if (!pi && (uval & FUTEX_WAITERS)) {
-		futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, 1,
+		futex_wake(uaddr, FLAGS_SIZE_32 | FLAGS_SHARED, NULL, 1,
 			   FUTEX_BITSET_MATCH_ANY);
 	}
 
@@ -1209,6 +1209,27 @@ static void exit_robust_list(struct task
 	}
 }
 
+static bool robust_list_clear_pending(unsigned long __user *pop)
+{
+	struct robust_list_head __user *head = current->futex.robust_list;
+
+	if (!put_user(0UL, pop))
+		return true;
+
+	/*
+	 * Just give up. The robust list head is usually part of TLS, so the
+	 * chance that this gets resolved is close to zero.
+	 *
+	 * If @pop_addr is the robust_list_head::list_op_pending pointer then
+	 * clear the robust list head pointer to prevent further damage when the
+	 * task exits.  Better a few stale futexes than corrupted memory. But
+	 * that's mostly an academic exercise.
+	 */
+	if (pop == (unsigned long __user *)&head->list_op_pending)
+		current->futex.robust_list = NULL;
+	return false;
+}
+
 #ifdef CONFIG_COMPAT
 static void __user *futex_uaddr(struct robust_list __user *entry,
 				compat_long_t futex_offset)
@@ -1305,6 +1326,21 @@ static void compat_exit_robust_list(stru
 		handle_futex_death(uaddr, curr, pend_mod, HANDLE_DEATH_PENDING);
 	}
 }
+
+static bool compat_robust_list_clear_pending(u32 __user *pop)
+{
+	struct compat_robust_list_head __user *head = current->futex.compat_robust_list;
+
+	if (!put_user(0U, pop))
+		return true;
+
+	/* See comment in robust_list_clear_pending(). */
+	if (pop == &head->list_op_pending)
+		current->futex.compat_robust_list = NULL;
+	return false;
+}
+#else
+static bool compat_robust_list_clear_pending(u32 __user *pop_addr) { return false; }
 #endif
 
 #ifdef CONFIG_FUTEX_PI
@@ -1398,6 +1434,19 @@ static void exit_pi_state_list(struct ta
 static inline void exit_pi_state_list(struct task_struct *curr) { }
 #endif
 
+bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags)
+{
+	bool size32bit = !!(flags & FLAGS_ROBUST_LIST32);
+
+	if (!IS_ENABLED(CONFIG_64BIT) && !size32bit)
+		return false;
+
+	if (IS_ENABLED(CONFIG_64BIT) && size32bit)
+		return compat_robust_list_clear_pending(pop);
+
+	return robust_list_clear_pending(pop);
+}
+
 static void futex_cleanup(struct task_struct *tsk)
 {
 	if (unlikely(tsk->futex.robust_list)) {
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -40,6 +40,8 @@
 #define FLAGS_NUMA		0x0080
 #define FLAGS_STRICT		0x0100
 #define FLAGS_MPOL		0x0200
+#define FLAGS_UNLOCK_ROBUST	0x0400
+#define FLAGS_ROBUST_LIST32	0x0800
 
 /* FUTEX_ to FLAGS_ */
 static inline unsigned int futex_to_flags(unsigned int op)
@@ -52,6 +54,12 @@ static inline unsigned int futex_to_flag
 	if (op & FUTEX_CLOCK_REALTIME)
 		flags |= FLAGS_CLOCKRT;
 
+	if (op & FUTEX_UNLOCK_ROBUST)
+		flags |= FLAGS_UNLOCK_ROBUST;
+
+	if (op & FUTEX_ROBUST_LIST32)
+		flags |= FLAGS_ROBUST_LIST32;
+
 	return flags;
 }
 
@@ -438,13 +446,16 @@ extern int futex_unqueue_multiple(struct
 extern int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
 			       struct hrtimer_sleeper *to);
 
-extern int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset);
+extern int futex_wake(u32 __user *uaddr, unsigned int flags, void __user *pop,
+		      int nr_wake, u32 bitset);
 
 extern int futex_wake_op(u32 __user *uaddr1, unsigned int flags,
 			 u32 __user *uaddr2, int nr_wake, int nr_wake2, int op);
 
-extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags);
+extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *pop);
 
 extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock);
 
+bool futex_robust_list_clear_pending(void __user *pop, unsigned int flags);
+
 #endif /* _FUTEX_H */
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1129,7 +1129,7 @@ int futex_lock_pi(u32 __user *uaddr, uns
  * This is the in-kernel slowpath: we look up the PI state (if any),
  * and do the rt-mutex unlock.
  */
-int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
+static int __futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	u32 curval, uval, vpid = task_pid_vnr(current);
 	union futex_key key = FUTEX_KEY_INIT;
@@ -1138,7 +1138,6 @@ int futex_unlock_pi(u32 __user *uaddr, u
 
 	if (!IS_ENABLED(CONFIG_FUTEX_PI))
 		return -ENOSYS;
-
 retry:
 	if (get_user(uval, uaddr))
 		return -EFAULT;
@@ -1292,3 +1291,15 @@ int futex_unlock_pi(u32 __user *uaddr, u
 	return ret;
 }
 
+int futex_unlock_pi(u32 __user *uaddr, unsigned int flags, void __user *pop)
+{
+	int ret = __futex_unlock_pi(uaddr, flags);
+
+	if (ret || !(flags & FLAGS_UNLOCK_ROBUST))
+		return ret;
+
+	if (!futex_robust_list_clear_pending(pop, flags))
+		return -EFAULT;
+
+	return 0;
+}
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -118,6 +118,13 @@ long do_futex(u32 __user *uaddr, int op,
 			return -ENOSYS;
 	}
 
+	if (flags & FLAGS_UNLOCK_ROBUST) {
+		if (cmd != FUTEX_WAKE &&
+		    cmd != FUTEX_WAKE_BITSET &&
+		    cmd != FUTEX_UNLOCK_PI)
+			return -ENOSYS;
+	}
+
 	switch (cmd) {
 	case FUTEX_WAIT:
 		val3 = FUTEX_BITSET_MATCH_ANY;
@@ -128,7 +135,7 @@ long do_futex(u32 __user *uaddr, int op,
 		val3 = FUTEX_BITSET_MATCH_ANY;
 		fallthrough;
 	case FUTEX_WAKE_BITSET:
-		return futex_wake(uaddr, flags, val, val3);
+		return futex_wake(uaddr, flags, uaddr2, val, val3);
 	case FUTEX_REQUEUE:
 		return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0);
 	case FUTEX_CMP_REQUEUE:
@@ -141,7 +148,7 @@ long do_futex(u32 __user *uaddr, int op,
 	case FUTEX_LOCK_PI2:
 		return futex_lock_pi(uaddr, flags, timeout, 0);
 	case FUTEX_UNLOCK_PI:
-		return futex_unlock_pi(uaddr, flags);
+		return futex_unlock_pi(uaddr, flags, uaddr2);
 	case FUTEX_TRYLOCK_PI:
 		return futex_lock_pi(uaddr, flags, NULL, 1);
 	case FUTEX_WAIT_REQUEUE_PI:
@@ -375,7 +382,7 @@ SYSCALL_DEFINE4(futex_wake,
 	if (!futex_validate_input(flags, mask))
 		return -EINVAL;
 
-	return futex_wake(uaddr, FLAGS_STRICT | flags, nr, mask);
+	return futex_wake(uaddr, FLAGS_STRICT | flags, NULL, nr, mask);
 }
 
 /*
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -150,12 +150,35 @@ void futex_wake_mark(struct wake_q_head
 }
 
 /*
+ * If requested, clear the robust list pending op and unlock the futex
+ */
+static bool futex_robust_unlock(u32 __user *uaddr, unsigned int flags, void __user *pop)
+{
+	if (!(flags & FLAGS_UNLOCK_ROBUST))
+		return true;
+
+	/* First unlock the futex, which requires release semantics. */
+	scoped_user_write_access(uaddr, efault)
+		unsafe_atomic_store_release_user(0, uaddr, efault);
+
+	/*
+	 * Clear the pending list op now. If that fails, then the task is in
+	 * deeper trouble as the robust list head is usually part of the TLS.
+	 * The chance of survival is close to zero.
+	 */
+	return futex_robust_list_clear_pending(pop, flags);
+
+efault:
+	return false;
+}
+
+/*
  * Wake up waiters matching bitset queued on this futex (uaddr).
  */
-int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+int futex_wake(u32 __user *uaddr, unsigned int flags, void __user *pop, int nr_wake, u32 bitset)
 {
-	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
+	struct futex_q *this, *next;
 	DEFINE_WAKE_Q(wake_q);
 	int ret;
 
@@ -166,6 +189,9 @@ int futex_wake(u32 __user *uaddr, unsign
 	if (unlikely(ret != 0))
 		return ret;
 
+	if (!futex_robust_unlock(uaddr, flags, pop))
+		return -EFAULT;
+
 	if ((flags & FLAGS_STRICT) && !nr_wake)
 		return 0;
 


  parent reply	other threads:[~2026-03-19 23:24 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 23:24 [patch v2 00/11] futex: Address the robust futex unlock race for real Thomas Gleixner
2026-03-19 23:24 ` [patch v2 01/11] futex: Move futex task related data into a struct Thomas Gleixner
2026-03-20 14:59   ` André Almeida
2026-03-19 23:24 ` [patch v2 02/11] futex: Move futex related mm_struct " Thomas Gleixner
2026-03-20 15:00   ` André Almeida
2026-03-19 23:24 ` [patch v2 03/11] futex: Provide UABI defines for robust list entry modifiers Thomas Gleixner
2026-03-20 15:01   ` André Almeida
2026-03-19 23:24 ` [patch v2 04/11] uaccess: Provide unsafe_atomic_store_release_user() Thomas Gleixner
2026-03-20  9:11   ` Peter Zijlstra
2026-03-20 12:38     ` Thomas Gleixner
2026-03-20 16:07   ` André Almeida
2026-03-19 23:24 ` [patch v2 05/11] x86: Select ARCH_STORE_IMPLIES_RELEASE Thomas Gleixner
2026-03-20 16:08   ` André Almeida
2026-03-19 23:24 ` [patch v2 06/11] futex: Cleanup UAPI defines Thomas Gleixner
2026-03-20 16:09   ` André Almeida
2026-03-19 23:24 ` Thomas Gleixner [this message]
2026-03-20 17:14   ` [patch v2 07/11] futex: Add support for unlocking robust futexes André Almeida
2026-03-26 22:23     ` Thomas Gleixner
2026-03-27  0:48       ` André Almeida
2026-03-19 23:24 ` [patch v2 08/11] futex: Add robust futex unlock IP range Thomas Gleixner
2026-03-20  9:07   ` Peter Zijlstra
2026-03-20 12:07     ` Thomas Gleixner
2026-03-19 23:24 ` [patch v2 09/11] futex: Provide infrastructure to plug the non contended robust futex unlock race Thomas Gleixner
2026-03-20 13:35   ` Thomas Gleixner
2026-03-19 23:24 ` [patch v2 10/11] x86/vdso: Prepare for robust futex unlock support Thomas Gleixner
2026-03-19 23:25 ` [patch v2 11/11] x86/vdso: Implement __vdso_futex_robust_try_unlock() Thomas Gleixner
2026-03-20  7:14   ` Uros Bizjak
2026-03-20 12:48     ` Thomas Gleixner
2026-03-26 21:59 ` [patch v2 00/11] futex: Address the robust futex unlock race for real Thomas Gleixner
2026-03-26 22:08   ` Rich Felker
2026-03-27  3:42     ` André Almeida

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260319231239.613218128@kernel.org \
    --to=tglx@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=andrealmeid@igalia.com \
    --cc=arnd@arndb.de \
    --cc=bigeasy@linutronix.de \
    --cc=carlos@redhat.com \
    --cc=dalias@aerifal.cx \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@weissschuh.net \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=triegel@redhat.com \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox