From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D914E3B5821 for ; Mon, 16 Mar 2026 17:12:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681174; cv=none; b=LldAU9ZceDqyDrlrJZOnD7pgnro9iEZS10lna42FSg6saJ2wwvAqXR8s+NT4jcLEUHRtR0ZL3hNxzYZ/F4EMVTDRqXQglJhNxlQ6Ns/natW3slD8ZJwVEBd9lz26FnErX9zKhLDOKv18r4pZD7blnM6y/2b+L9WX3Bu4FvG6M2M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773681174; c=relaxed/simple; bh=7YRGMBfiTfU+4oR0Qa7/imiLBJsrhO8bPlFIzn2N5JE=; h=Date:Message-ID:From:To:Cc:Subject; b=ofDeEUU+8A0iExlGNrJPdRow7vGNr7ABJkk9BzyrjSMFuQhxEOrb+d0lP+gmD6pq4EUiCqmjS0yMb0Z/iODJx1DUbnqwQPiKKIJd212umw7i7xQIsiteUk2Q1LF4geoZ6zWi3gaAj1tk2QnGkucKUyhTe6beKGIzCHNt3bKtUAA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nvLGGQ9T; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nvLGGQ9T" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D08C6C19421; Mon, 16 Mar 2026 17:12:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773681174; bh=7YRGMBfiTfU+4oR0Qa7/imiLBJsrhO8bPlFIzn2N5JE=; h=Date:From:To:Cc:Subject:From; b=nvLGGQ9THKpFxh20bCGvTYd6vy1Kumj1gSlROSWNpWuQ+ikzurBRJdE0tvd/nsXF8 PmSjpNiBbmwUFuiLb3QVB8LsHlSA+GHMpE0ZkINxIeSMMSUad0OAb7MHt6T4WQ4aTh 96Mb4GbXb0ETN2zgs7wYiLeNELIsRMjMRNnRZm3e4UHFd/23A8rIJ8pLAJbYqkkckN 7Dt7Ph7GxyhO8HlbuhqULhx8OxDXrX1Pvme4TwEtg5dZLdSnCXBEGXlsQyOGQgp+SZ GChuSQV4lfie/isxUT7lQGElm7i9k/A5tq9wfZLqyo9Q3sKW82ExMKcG2+/A8F6JlS cZdQrN+3pjH9A== Date: Mon, 16 Mar 2026 18:12:50 +0100 Message-ID: <20260316162316.356674433@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Mathieu Desnoyers , =?UTF-8?q?Andr=C3=A9=20Almeida?= , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Rich Felker , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" Subject: [patch 0/8] futex: Address the robust futex unlock race for real Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: The robust futex unlock mechanism is racy in respect to the clearing of the robust_list_head::list_op_pending pointer as described here: https://sourceware.org/bugzilla/show_bug.cgi?id=14485 This issue has been ignored for almost 15 years and has been brought up again recently by Andre: https://lore.kernel.org/lkml/20260220202620.139584-1-andrealmeid@igalia.com/ Florian asked for this to be addressed in a VDSO function and a couple of days ago Mathieu proposed a fix: https://lore.kernel.org/20260311185409.1988269-1-mathieu.desnoyers@efficios.com While the idea of using a VDSO unlock helper function along with kernel side fixups is correct, the proposed solution is incomplete and untested. Aside of that it just glues it somehow into the existing code without taking the larger picture into account. So I sat down and did a complete analysis. The problem is in the unlock sequences of non-PI and PI futexes. non-PI futexes do: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = 0; 3) atomic_xchg(lock, lval); 4) if (lval & FUTEX_WAITERS) 5) sys_futex(WAKE,....); 6) robust_list_clear_op_pending(); PI futexes do: 1) robust_list_set_op_pending(mutex); 2) robust_list_remove(mutex); lval = gettid(); 3) if (!atomic_try_cmpxchg(lock, lval, 0)) 4) sys_futex(UNLOCK_PI,....); 5) robust_list_clear_op_pending(); In both cases the late clearing of the robust list pending op pointer (#6/#5) can cause the following problem: Task 1 Task 2 robust_list_set_op_pending(mutex); unlock(mutex); --> interrupt which forces the T1 to exit. lock(mutex); // Non-contended // Last user of the mapping unlock(mutex); destroy(mutex); unmap(); map(some other file); do_exit() handle_robust_list() handle_pending_op() val = mutex->lock; if (val == TID) mutex->lock = val | FUTEX_OWNER_DIED; This causes memory corruption in the new mapping when: 1) the new mapping covers the address of mutex->lock 2) the memory contains the TID of the exiting task Unlikely to happen, but possible. It does not matter whether the unlock happens in the kernel or whether there are waiters or not as in all scenarios T2 can observe being the last user _before_ T1 manages to clear the pending op pointer. So the first question to ask is why the unlock sequences of non-PI and PI futexes are different. The unlock sequence of non-PI predates both PI and robust futexes. The robust handling was added around it. PI must use try_cmpxchg() and cannot unlock a contended futex in user space as that would create inconsistent PI state in the kernel. But technically there is no requirement to keep the non-PI unlock sequence as it is. It can do the same as PI and let the kernel unlock the futex in the contended case. It has to go into the kernel anyway when there are waiters. That functionality does not exist in the kernel, but it is trivial to add. With that the unlock sequence of non-PI and PI becomes identical and can be easily extended to let the kernel do the clearing of the pending op pointer as well. So that becomes: robust_list_set_op_pending(mutex); lval = gettid(); if (atomic_try_cmpxchg(lock, lval, 0)) robust_list_clear_op_pending(); else sys_futex($OP | FUTEX_ROBUST_UNLOCK, .... op_pointer); which reduces the problem space significantly. The remaining issue is the non-contended unlock case, which can be handled with a VDSO unlock sequence and kernel side fixup. But that op pointer has an issue when COMPAT is enabled. To support mixed-mode futexes for gaming emulators the kernel allows a 64-bit application to register both a 64-bit native robust list head and a 32-bit compat robust list head. That means on COMPAT kernels the type of the op pointer is undefined. That can be solved by marking the pointer size in the pointer itself similar to the way how the robust list pointer marks PI futexes in bit 0. If the bit is set the pointer has to be treated as 32-bit pointer. If not it uses the native size (64-bit). This is required for several reasons: 1) sys_futex() has no compat variant 2) The gaming emulators use both both 64-bit and compat 32-bit robust lists in the same 64-bit application 3) Having the pointer handed in spares the evaluation of the registered robust lists to figure out whether the futex address is matching the registered [compat_]robust_list_head::list_pending_op pointer. Which would become even worse when the multi robust list changes Andre is working on get merged. As a consequence 32-bit applications have to set this bit unconditionally so they can run on a 64-bit kernel in compat mode unmodified. 32-bit kernels return an error code when the bit is not set. 64-bit kernels will happily clear the full 64 bits if user space fails to set it. Arguably this could be done with different functions and more flags, but there is no benefit for that. With that addressed there is "only" the non-contended case left. The important question is when this has to be dealt with. It's only relevant when: 1) the unlocking task was interrupted in the unlock sequence between successful unlock and the store which clears the pointer. 2) the unlocking task needs to handle a signal on return from interrupt to user space The proper place to check for this is therefore right before the invocation of arch_do_signal_or_restart() in __exit_to_user_mode_loop(). That needs to check whether the task was interrupted and the instruction pointer is within the critical section. On x86 (COMPAT=n) this critical section boils down to: mov %esi,%eax // Load TID into EAX xor %ecx,%ecx // Set ECX to 0 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend movq $0x0,(%rdx) // Clear list_op_pending .Lend: I.e. the critical section is: IP >= .Lstart && IP < .Lend. That's a very light weight range check operation which won't affect signal heavy applications in a noticable way. If the range check observes IP in the critical section it invokes the fixup function which determines whether the pointer has to be cleared. In the above x86 case this just has to evaluate regs->flag & X86_EFLAGS_ZF. If ZF is set then the pointer must be cleared. The pointer address is retrieved from regs->dx in this case. The COMPAT case with different sizes is slightly more complex as it has to do the size checking in the success path by testing bit 0. That obviously modifies ZF too, so it has to check whether IP is within the success region. mov %esi,%eax // Load TID into EAX mov %rdx,%rsi // Get the op pointer xor %ecx,%ecx // Set ECX to 0 and $0xfffffffffffffffe,%rsi // Clear the size bit lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition .Lstart: jnz .Lend .Lsuccess: testl $0x1,(%rdx) // Test the size bit jz .Lop64 // Not set: 64-bit movl $0x0,(%rsi) // Clear 32-bit jmp .Lend .Lop64: movq $0x0,(%rsi) // Clear 64-bit .Lend: ret Again it could be argued to do that with two different fixup functions, but there is no point to that: 1) User space needs to handle the two variants instead of just relying on a bit which can be saved in the mutex at initialization time. 2) It becomes pointlessly different from the syscall op pointer handling 3) The fixup decision function has then to evaluate which code path is used. That just adds more symbols and range checking for no real value. Note, that the fixup is done when a signal is pending even if there is no signal delivered. That's required because fatal signals are not reaching the actual signal delivery code where e.g. RSEQ is handled. They never return from get_signal(). Doing it at that point is harmless and the only downside is a redundant store if the task continues. The pending op pointer is task local so it does not matter whether it's nullified twice or not. The remaining question is how to do the range checking. Mathiue's proposal [ab]used the exception table to do the range check. That's not really the correct approach because: 1) It is not an exception handler 2) It is a fixup similar to RSEQ critical section fixups So the right approach is to use a dedicated mechanism and not to glue it into something which has a completely different purpose. That just creates complexity and overhead. The critical section labels can be made visible in the VDSO and the VDSO image generator provides the offsets in the image struct to the kernel. The VDSO [re]map() evaluates them and puts them into mm_struct::futex::XXX. Arguably the range check could chase all of that down via mm->context->vdso->image, but that touches way more cache lines and requires architecture specific range checking functions. Letting the architecture specific VDSO [re]map() code store the information in mm_struct::futex reduces the cache line impact and makes the range check generic. The series implements all of the above and passes the simple user space test case appended below compiled for both 32-bit and 64-bit and tested on a COMPAT enabled 64-bit kernel. 32-bit kernel should just work, but hasn't been tested yet. The fixup mechanism is easy to validate. Run the program in gdb and set a breakpoint at __vdso_futex_robust_try_unlock_cs_start or at __vdso_futex_robust_try_unlock_cs_success. Hitting the breakpoint causes the fixup to run because gdb/ptrace uses signals to control the tracee. So the op pointer must have been cleared if the cmpxchg() succeeded: (gdb) b __vdso_futex_robust_try_unlock_cs_start (gdb) run Testing non contended unlock Breakpoint 1, 0x00007ffff7fc67a3 in __vdso_futex_robust_try_unlock () (gdb) info r rdx 0x7ffff7dc1a70 140737351785072 rsi 0x7ffff7dc1a70 140737351785072 rdi 0x7fffffffea90 140737488349840 eflags 0x246 [ PF ZF IF ] (gdb) x/xg 0x7ffff7dc1a70 0x7ffff7dc1a70: 0x0000000000000000 ZF is set, which indicates that the cmpxchg() succeeded. The fixup is "premature" but as explained above that is harmless. The success path writing 0 again is just a redundant store with no side effects as the pointer is task local. Letting it continue hits the contended case: (gdb) info 5 rdx 0x7ffff7dc1a70 140737351785072 rsi 0x7ffff7dc1a70 140737351785072 rdi 0x7fffffffea90 140737488349840 eflags 0xa87 [ CF PF SF IF OF ] (gdb) x/xg 0x7ffff7dc1a70 0x7ffff7dc1a70: 0x00007fffffffeab0 Note the lack of ZF in eflags, which means cmpxchg() failed. RDI points to mutex->lock and the op pointer still to mutex->list (offset 0x20) as it should be. A 32-bit executable with the same breakpoint shows the bit 0 handling: eax 0x1c0 448 ecx 0x0 0 edx 0xf7fbc5f5 -134494731 ebx 0x56558ff4 1448447988 esi 0xf7fbc5f4 -134494732 edi 0xffffdbf8 -9224 eflags 0x246 [ PF ZF IF ] (gdb) x/xw 0xf7fbc5f4 0xf7fbc5f4: 0x00000000 EDX has bit 0 set and ESI has the real pointer address with bit 0 cleared. ESI is used to clear it. EDX is handed to the fixup function and that takes the correct path by testing bit 0. The series applies on v7.0-rc3 and is also available via git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git locking/futex Thanks tglx --- It's so much easier to suggest solutions when you don't know too much about the problem. - Malcolm Forbes --- Documentation/locking/robust-futexes.rst | 8 arch/x86/Kconfig | 1 arch/x86/entry/vdso/common/vfutex.c | 72 +++++++ arch/x86/entry/vdso/vdso32/Makefile | 5 arch/x86/entry/vdso/vdso32/vdso32.lds.S | 6 arch/x86/entry/vdso/vdso32/vfutex.c | 1 arch/x86/entry/vdso/vdso64/Makefile | 7 arch/x86/entry/vdso/vdso64/vdso64.lds.S | 6 arch/x86/entry/vdso/vdso64/vdsox32.lds.S | 6 arch/x86/entry/vdso/vdso64/vfutex.c | 1 arch/x86/entry/vdso/vma.c | 20 ++ arch/x86/include/asm/futex_robust.h | 44 ++++ arch/x86/include/asm/vdso.h | 3 arch/x86/tools/vdso2c.c | 17 + include/linux/futex.h | 43 +++- include/linux/futex_types.h | 74 +++++++ include/linux/mm_types.h | 12 - include/linux/sched.h | 16 - include/uapi/linux/futex.h | 24 ++ include/vdso/futex.h | 44 ++++ init/Kconfig | 6 io_uring/futex.c | 2 kernel/entry/common.c | 9 kernel/exit.c | 4 kernel/futex/core.c | 303 ++++++++++++++++++------------- kernel/futex/futex.h | 11 - kernel/futex/pi.c | 41 ++-- kernel/futex/syscalls.c | 36 +-- kernel/futex/waitwake.c | 27 ++ 29 files changed, 636 insertions(+), 213 deletions(-) --- #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include typedef uint32_t (*frtu_t)(uint32_t *, uint32_t, void *); static frtu_t frtu; static void get_vdso(void) { void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) vdso = dlopen("linux-vdso32.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) vdso = dlopen("linux-vdso64.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) { printf("Failed to find vDSO\n"); exit(1); } frtu = (frtu_t)dlsym(vdso, "__vdso_futex_robust_try_unlock"); if (!frtu) { printf("Failed to find frtu in vDSO\n"); exit(1); } } static void *mangle_pop(struct robust_list_head *rh) { unsigned long addr = (unsigned long) &rh->list_op_pending; if (__BITS_PER_LONG != 64) addr |= 0x01; return (void *)addr; } #define FUTEX_ROBUST_UNLOCK 512 #define UNLOCK (FUTEX_WAKE | FUTEX_ROBUST_UNLOCK) #define UNLOCK_PI (FUTEX_UNLOCK_PI | FUTEX_ROBUST_UNLOCK) int main(int argc, char * const argv[]) { pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; uint32_t *lock = (uint32_t *)&mutex.__data.__lock; struct robust_list_head *rhead; pid_t ltid, tid = gettid(); struct robust_list *list; size_t sz; void *pop; get_vdso(); syscall(SYS_get_robust_list, 0, &rhead, &sz); pop = mangle_pop(rhead); list = (struct robust_list *)&mutex.__data.__list.__next; printf("Testing non contended unlock\n"); *lock = tid; rhead->list_op_pending = list; ltid = frtu(lock, tid, pop); if (ltid != tid) printf("Non contended unlock failed: LTID %08x\n", ltid); else if (rhead->list_op_pending) printf("List op not cleared: %16lx\n", (unsigned long) rhead->list_op_pending); else if (*lock) printf("Non contended unlock failed: LOCK %08x\n", *lock); else printf("Non contended unlock succeeded\n"); printf("Testing contended FUTEX_WAKE unlock\n"); *lock = tid | FUTEX_WAITERS; rhead->list_op_pending = list; ltid = frtu(lock, tid, pop); if (ltid == tid) { printf("Contended unlock succeeded %08x\n", ltid); } else if (!rhead->list_op_pending) { printf("List op cleared: %16lx\n", (unsigned long) rhead->list_op_pending); } else if (!*lock) { printf("Contended unlock cleared LOCK %08x\n", *lock); } else { int ret = syscall(SYS_futex, lock, UNLOCK, 1, NULL, pop, 0, 0); if (ret < 0) printf("FUTEX_WAKE|FUTEX_ROBUST_UNLOCK failed %d\n", errno); else if (rhead->list_op_pending) printf("List op not cleared in syscall: %16lx\n", (unsigned long) rhead->list_op_pending); else if (*lock) printf("Contended syscall unlock failed: LOCK %08x\n", *lock); else printf("Contended unlock syscall succeeded\n"); } printf("Testing contended FUTEX_UNLOCK_PI unlock\n"); *lock = tid | FUTEX_WAITERS; rhead->list_op_pending = list; ltid = frtu(lock, tid, pop); if (ltid == tid) { printf("Contended unlock succeeded %08x\n", ltid); } else if (!rhead->list_op_pending) { printf("List op cleared: %16lx\n", (unsigned long) rhead->list_op_pending); } else if (!*lock) { printf("Contended unlock cleared LOCK %08x\n", *lock); } else { int ret = syscall(SYS_futex, lock, UNLOCK_PI, 0, NULL, pop, 0, 0); if (ret < 0) printf("FUTEX_UNLOCK_PI|FUTEX_ROBUST_UNLOCK failed %d\n", errno); else if (rhead->list_op_pending) printf("List op not cleared in syscall: %16lx\n", (unsigned long) rhead->list_op_pending); else if (*lock) printf("Contended syscall unlock failed: LOCK %08x\n", *lock); else printf("Contended unlock syscall succeeded\n"); } return 0; }