From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A87C21C6FF5
	for <linux-kernel@vger.kernel.org>; Mon, 16 Mar 2026 17:12:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773681158; cv=none; b=PeCg6RUEBuq46eUNeO0aIhppIQynFRLCbautIGlwOnbz9+mRz6Oj2z0YmtW3O2rvaDh7UT1Ao8wqkHCdOHxGezl94ZTbAd79p7Bro/7MWA3f0H/6WK1753AKvIPXNJvB7+ndQZEkOOBzx77vJOTZOBMeCt8ElyQjBePbB9PkNyg=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773681158; c=relaxed/simple;
	bh=vPAlVgyNMzNf3pS+VW34cJ7Q8MHohn2S9VYa7Fgszgk=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=D5hvOrHGK+IDf0ucXxKOoKuurjOWeyNr5FvcmMkj6bQHPRCtbYvH+hRjeynb2WZFgcFl4AuBeKD4cxn2lkJCQd0cwayxPz7Z3pTj8Ff48MQclhJ+ItFSpXAS4JgjXbms+34/CzrW0FnffgxletP05tM+aXPe7NkeGSPNKIvZrLI=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=IoH9btBq; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="IoH9btBq"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8DE97C19425;
	Mon, 16 Mar 2026 17:12:37 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1773681158;
	bh=vPAlVgyNMzNf3pS+VW34cJ7Q8MHohn2S9VYa7Fgszgk=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:From;
	b=IoH9btBqyjVzBhWA71Dao7BbWghw12EsJ6XrO35af9W/f2R8VzOBQcdmLHF/Qu4Do
	 +r8+sfEA5AL3LwIUQXGeeP9ARzpRdS8gjl/5qGbEDx3gbasOpdaMHNYWM4EpaDADOC
	 wknOZdVUFB2FxaMUbRhla+4egWSL57wPsKJhOe3n/APIt1LooIan3EbbEgmoxT7rtY
	 V6zc7ql6ZohF00ulVd059DiO17ZoCt1L5eOCAWPRgP8d+hN81jXsVrZ9OwUrrM0TGR
	 C9yVLzMrg33h1Tjtk7c8r5R9vdUqQRQTeJUxTt6wfdBz3bAab7z6NgbS1oE7g5SBUX
	 GpHYjS/7JxJBQ==
From: Thomas Gleixner <tglx@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, =?utf-8?Q?Andr?=
 =?utf-8?Q?=C3=A9?= Almeida
 <andrealmeid@igalia.com>
Cc: linux-kernel@vger.kernel.org, Carlos O'Donell <carlos@redhat.com>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Peter Zijlstra
 <peterz@infradead.org>, Florian Weimer <fweimer@redhat.com>, Rich Felker
 <dalias@aerifal.cx>, Torvald Riegel <triegel@redhat.com>, Darren Hart
 <dvhart@infradead.org>, Ingo Molnar <mingo@redhat.com>, Davidlohr Bueso
 <dave@stgolabs.net>, Arnd Bergmann <arnd@arndb.de>, "Liam R . Howlett"
 <Liam.Howlett@oracle.com>
Subject: Re: [RFC PATCH] futex: Introduce __vdso_robust_futex_unlock
In-Reply-To: <b91a3ad3-7d87-4eb2-9664-652f809f047f@efficios.com>
References: <20260311185409.1988269-1-mathieu.desnoyers@efficios.com>
 <87eclopu0j.ffs@tglx> <b91a3ad3-7d87-4eb2-9664-652f809f047f@efficios.com>
Date: Mon, 16 Mar 2026 18:12:34 +0100
Message-ID: <874imfpukd.ffs@tglx>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 12 2026 at 18:52, Mathieu Desnoyers wrote:
> On 2026-03-12 18:23, Thomas Gleixner wrote:
>>          xchg(futex->uval, h->unlock_val);
>
> Here is the problem with your proposed approach:
>
>    "XCHG =E2=80=94 Exchange Register/Memory With Register"
>                                          ^^^^^^^^
>
> So only one of the xchg arguments can be a memory location.
> Therefore, you will end up needing an extra store after xchg
> to store the content of the result register into h->unlock_val.

Indeed.

> If the process dies between those two instructions, your proposed
> robust list code will be fooled and fall into the same bug that's
> been lingering for 14 years.

s/lingering/ignored/

To fix this for correctness sake it needs more than a hack in the kernel
without even looking at the overall larger picture. I sat down and did a
full analysis and here are the most important questions:

Q: Have non-PI and PI to be treated differently?

A: No.

   That's just historical evolution. While PI can't use XCHG because that
   would create inconsistent state, there is absolutely no reason why
   non-PI can't use try_cmpxchg().


Q: Is it required to unlock in user space first and then go into the kernel
   to wake up waiters?

A: No.

   That's again a historical leftover from the 1st generation futexes which
   preceeded both robust and PI. There is no technical reason to keep it
   this way.

   So both can do:

       if (cmpxchg(lock, tid, 0) !=3D tid)
       		sys_futex(UNLOCK,....);

   which then allows for both non-PI and PI to hand the pending op pointer
   into the syscall and let the kernel deal with the unlock, the op pointer
   and the wake up in one go.

   That reduces the problem space to take care of the non-contended unlock
   case, where the pending op is cleared after the cmpxchg() succeeded.

   And yes, that part can be done in the VDSO and a fixup mechanism in the
   kernel.


Q: Are robust list pointers guaranteed to be 64-bit when running as a
   64-bit task?

A: No.

   The gaming emulators use both the native 64-bit robust list and the
   32-bit robust list from the same 64-bit application to make the
   emulation work.

   So both the UNLOCK syscall and the fixup need to have means to figure
   out the to be cleared size for that pointer.

   Sure, this can be done with a boat load of different functions and flags
   and whatever, but that makes the actual fixup handling in the kernel
   more complicated than necessary.


Q: Have regular signal delivery and process exit in case of crash or being
   killed by a external signal to be treated differently?

A: No.

   A task always goes through the same signal code path for both cases so
   all of this can be handled in _one_ place without even touching the
   robust list cleanup code.

   sys_exit() is different because there a task voluntarily exits and if
   it does so between the unlock and the clearing of the op pointer,
   then so be it. That'd be wilfull ignorance or malice and not any
   different from the task doing the corruption itself in user space
   right away.


Q: Are exception tables a good idea?

A: No.

   This is not an exception handling case. It's a fixup similar to RSEQ
   critical section fixups and so it has to be handled with dedicated
   mechanisms which are performant and not glued onto something which has a
   completely different purpose.


>> This fixes a long standing data corruption race condition with robust
>>  futexes, as pointed out here:
>>
>>  "File corruption race condition in robust mutex unlocking"
>>  https://sourceware.org/bugzilla/show_bug.cgi?id=3D14485
=20=20
No comment.

Thanks,

	tglx