From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B88B032C942 for ; Fri, 27 Mar 2026 10:08:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774606094; cv=none; b=O079z+XibTFNNS/bS4nbMOGb2c+ZzzYeuPUtpBIZ3XVd5+zETMUO387NvnGqzNUMWGOvkBV0FMS1MKp8/QmmKoMvtrM51SEL0L3iKyy0c+fsKTYCN5FxAK+y5Z6ogMCtJv9x/NXWHKAaD8qVK6la0tIcJevBISauCZP/vEu04r8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774606094; c=relaxed/simple; bh=FHW3VbApLlWk7/hSjDuXU6/C4b7SFYAwNFQugPJot08=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=qcfqu1DB7hEW64pV9t8GRT8l6KxzmzHWzEs52yJx03TtzArVk+9gg/OG/LTcACd6tHgCYbLE0HYTXAMGFzYrS5RUnl0Um5XGilM5RQeHUxJfuHkrr8BPaKfFbOVl6qKm74NkI2ypkVtIZaG0DFZUM5RUKs9FOYiwcAevAPf2+mU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=C0bM27S3; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="C0bM27S3" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7ABD8C2BC9E; Fri, 27 Mar 2026 10:08:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1774606094; bh=FHW3VbApLlWk7/hSjDuXU6/C4b7SFYAwNFQugPJot08=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=C0bM27S3ecQJohurr4HKkJhZ0Xw/VTU04xBE+Zly67KZfnHMFQ0K55DlBKne7R0pR H90eAxY8M+hJZlJd3yfhGSeQWTt5V9Hf4Jsdf9+7onUJ6lW+pnYV4JfLiFGNrAfCB3 CxeacavZNiYk94tid/W55wQw0hg/ulXvFyL0fAcTdeRDcFzEqmvNfNsNR+6cVmp4tr 7HMtii87nuDcSadVnTMqjLL8F/+wJdcKmMqI4er+WY6zGiG3O27A+eNHnJpnGabzGP a4GFtWK1KmOa5r8oh8Z4hMYvzJwGM9t3CdfMHsM1+t7mQsVCiHngLdRdqGVOhDIInV 7D6JCv239thKA== From: Thomas Gleixner To: =?utf-8?Q?Andr=C3=A9?= Almeida , Rich Felker Cc: LKML , Mathieu Desnoyers , Sebastian Andrzej Siewior , Carlos O'Donell , Peter Zijlstra , Florian Weimer , Torvald Riegel , Darren Hart , Ingo Molnar , Davidlohr Bueso , Arnd Bergmann , "Liam R . Howlett" , Uros Bizjak , Thomas =?utf-8?Q?Wei=C3=9Fschuh?= Subject: Re: [patch v2 00/11] futex: Address the robust futex unlock race for real In-Reply-To: References: <20260319225224.853416463@kernel.org> <87bjgackw7.ffs@tglx> <20260326220815.GE18807@brightrain.aerifal.cx> Date: Fri, 27 Mar 2026 11:08:10 +0100 Message-ID: <875x6hd1px.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Fri, Mar 27 2026 at 00:42, Andr=C3=A9 Almeida wrote: > Em 26/03/2026 19:08, Rich Felker escreveu: >> On Thu, Mar 26, 2026 at 10:59:20PM +0100, Thomas Gleixner wrote: >>> On Fri, Mar 20 2026 at 00:24, Thomas Gleixner wrote: >>>> If the functionality itself is agreed on we only need to agree on the = names >>>> and signatures of the functions exposed through the VDSO before we set= them >>>> in stone. That will hopefully not take another 15 years :) >>> >>> Have the libc folks any further opinion on the syscall and the vDSO part >>> before I prepare v3? >>=20 >> This whole conversation has been way too much for me to keep up with, >> so I'm not sure where it's at right now. >>=20 >> From musl's perspective, the way we make robust mutex unlocking safe >> right now is by inhibiting munmap/mremap/MAP_FIXED and >> pthread_mutex_destroy while there are any in-flight robust unlocks. It >> will be nice to be able to conditionally stop doing that if vdso is >> available, but I can't see using a fallback that requires a syscall, >> as that would just be a lot more expensive than what we're doing right >> now and still not work on older kernels. So I think the only part >> we're interested in is the fully-userspace approach in vdso. >>=20 > > You just need the syscall for the contented case (where you would need a= =20 > syscall anyway for a FUTEX_WAKE). > > As Thomas wrote in patch 09/11: > > The resulting code sequence for user space is: > > if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != =3D=20 > tid) > err =3D sys_futex($OP | FUTEX_ROBUST_UNLOCK,....); > > Both the VDSO unlock and the kernel side unlock ensure that the=20 > pending_op pointer is always cleared when the lock becomes unlocked. > > > So you call the vDSO first. If it fails, it means that the lock is=20 > contented and you need to call futex(). It will wake a waiter, release=20 > the lock and clean list_op_pending. See also the V1 cover letter which has a full deep dive: https://lore.kernel.org/20260316162316.356674433@kernel.org TLDR: The problem can be split into two issues: 1) Contended unlock 2) Uncontended unlock #1 is solved by moving the unlock into the kernel instead of unlocking first and then invoking the syscall to wake waiters. The syscall takes the list_pending_op pointer as an argument and after unlocking, i.e. *lock =3D 0, it clears the list_pending_op pointer For this to work, it needs to use try_cmpxchg() like PI unlock does. #2 The race is between the succesful try_cmpxchg() and the clearing of the list_pending_op pointer That's where the VDSO comes into play. Instead of having the try_cmpxchg() in the library code the library invokes the VDSO provided variant. That allows the kernel to check in the signal delivery path whether a successful unlock requires a helping hand to clear the list pending op pointer. If the interrupted IP is in the critical section _and_ the try_cmpxchg() succeeded then the kernel clears the pointer. In x86 ASM: 0000000000001590 <__vdso_futex_robust_list64_try_unlock@@LINUX_2.6>: 1590: mov %esi,%eax 1592: xor %ecx,%ecx 1594: lock cmpxchg %ecx,(%rdi) // Result goes into ZF 1598: jne 159d <- CS start=20=20=20=20 159a: mov %rcx,(%rdx) // Clear list_pending_op 159d: ret <- CS end 159e: xchg %ax,%ax So if the kernel observes IP >=3D CS start && IP < CS end then it checks the ZF flag in pt_regs and if set it clears the list_pending op. Obviously #1 depends on #2 to close all holes. Thanks, tglx