From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B9561E5B64 for ; Wed, 26 Nov 2025 00:45:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764117915; cv=none; b=hUcVLEm6WWVh9nURttgcoQKoiY0Mb1sSY4v/uJDOnc+p0m3Q/RxEsJeAzfrRUSL4jnSTAsv2pBg+KldfB8VVZKWc015HxuvA/MD6vOCjarbimVdjMVmBkHnRA/oyUjsyADHfMLMXUxcCSvcaC53aUQtOQ1uhP3NBJRjC7Sg9fyk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764117915; c=relaxed/simple; bh=t1Oex2NF1eYOxHSaX2M1CdtRViN9+6LsO1KGAKzof54=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=IbpG5GIzqL2HbIGKErMTu6HRTtbflMdjdCzH6Zg3ZKHzAmmXV1XEQG9ibslpvBfoF0bStp/LkhXiuoVn9juy7m7inuwyb0XodpZw+l3bsrIddRS5nHp08muYlV0I8svLOYuodluNxtTb0eurBIPUT4c1v+WS6mMHZmU3fyyczPI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=yCtt5jvg; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=VfGAI0iT; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="yCtt5jvg"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="VfGAI0iT" From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1764117910; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=StwR4CkEtCWM8NCmcT1A3UZgrMoXKgqtfDy333h1DXU=; b=yCtt5jvg8BvR9cBrA/6CcmPbh7jO72YDQ+Eed5bsYvEYiZGq95EYmyIz2LjxKJTNChU3TE W8jheWvPIQ2VRebroXnkmaA6DKah381lVDN+35aWknLEKAqAf1qjfEBbkMea7Mn6tfvVvY ZfAxKhoa4fe3eU9Ra3VIkLWtu8eujKgg2rjABRIoUHepSHbecfrdwcEvWtuI4/WyUHyjc+ hSYvrW4DjYHrgFmS2eMUDaiU6pYTP84hU5cn+FWSt/qpMr7axJSEp4A48u+VUzVf5p51xC 6Xb5kJ6/h8gj107oCB8EAV8pRUMjWoat65kLt6kTL4xH/3arm5GPZzi+4M/DLw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1764117910; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=StwR4CkEtCWM8NCmcT1A3UZgrMoXKgqtfDy333h1DXU=; b=VfGAI0iTAcdN5NyA9RjkJGM/gTxEjSGObqBkAIUJR2yTxP+TyX/6NVrsz1yQVY8qOewMVF yCUPShIO9FY1K6Dw== To: Kevin Brodsky , Dmitry Vyukov , mathieu.desnoyers@efficios.com, peterz@infradead.org, boqun.feng@gmail.com, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, aruna.ramakrishna@oracle.com, elver@google.com Cc: "Paul E. McKenney" , x86@kernel.org, linux-kernel@vger.kernel.org, Florian Weimer Subject: Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys In-Reply-To: <8079f564-cec0-45e4-857b-74b2e630a9d5@arm.com> References: <138c29bd5f5a0a22270c9384ecc721c40b7d8fbd.1747817128.git.dvyukov@google.com> <8079f564-cec0-45e4-857b-74b2e630a9d5@arm.com> Date: Wed, 26 Nov 2025 01:45:10 +0100 Message-ID: <87ikexhbah.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Mon, Oct 20 2025 at 15:46, Kevin Brodsky wrote: > +Florian Weimer >> * A single struct rseq per thread is allowed. >> + * >> + * If struct rseq or struct rseq_cs is used with Memory Protection Keys, >> + * then the assigned pkey should either be accessible whenever these structs >> + * are registered/installed, or they should be protected with pkey 0. > > I think it's worth pointing out that every case of async uaccess seems > to be handled differently w.r.t. pkeys. Some interesting cases: > > * During signal delivery, the pkey register is reset to permissive > before writing to the signal stack (this is the logic that patch 2 > touches in fact). This is handled in the same way on x86 [1] and arm64 [2]. > > * AFAICT io_uring worker threads inherit the user thread's pkey register > on x86, since they are forked without setting PF_KTHREAD. OTOH on arm64 > the pkey register is reset to the init value (RW access to pkey 0 only) > for all non-user threads [3]. > > * Now with this patch accesses to struct rseq are made with whatever the > pkey register is set to when the thread is interrupted + RW access for > pkey 0. > > This is clearly a difficult problem, since no approach will satisfy all > users. I believe a new API would be desirable to allow users to > configure this; see the thread at [4] regarding signal handlers. > > In any case, it seems best to ignore the value of the pkey register at > the point where the thread is interrupted, because userspace has no > control on that value and it could lead to spurious faults (depending on > when the interruption happens). For rseq, maybe the most logical option > would be to set the pkey register to permissive in > __rseq_handle_notify_resume(), because there is nothing sensible to > check at that point. Checking could be done at the point of registration > instead. That's all broken. Assume: 1) process starts with pkey 0 (default) 2) glibc creates TLS (protected by pkey 0) 3) main() switches to protection pkey 1 If the switch to pkey 1 does not ensure that TLS (where RSEQ sits) is accessible by pkey 1, then how is userspace able to survive? You then do not even need the help of the kernel to die. If the process accesses TLS it dies on it's own. As each map can only be attached to a single pkey, the only solutions for this are: 1) enable both pkey0 and pkey1 or 2) ensure that everything which was mapped _before_ main() changes the protection key is covered by pkey1 when it disables pkey0. For that to pull off you need #1 temporary to finish the transition of _all_ affected mappings via pkey_mprotect() No? And no matter how many PKEY hacks you sprinkle into the kernel none of them will solve all of the related problems. If user space registers shared memory with the kernel (that's what RSEQ is), user space has to make sure that it is accessible under it's own PKEY constraints. If it fails to do so, it dies rightfully whether in user space or in kernel space does not matter. The only exception is the alternative signal stack, which can be set up as an catch all sink which is not covered by the currently active PKEY set on purpose. And looking at how x86 handles this it gets it actually wrong because it enables all keys independent of the stack target it delivers to while this should be strictly restricted to the alternative stack, but that's a different discussion to have. So no, these patches are going nowhere because they try to paper over a fundamental bug in the application and the understanding of how all of this protection key mechanism actualy can work. Just let me quote the cover letter to prove my point: "If an application registers rseq, and ever switches to another pkey protection (such that the rseq becomes inaccessible), then any context switch will cause failure in __rseq_handle_notify_resume() attempting to read/write struct rseq and/or rseq_cs. Since context switches are asynchronous and are outside of the application control (not part of the restricted code scope), temporarily enable access to 0 (default) PKEY to read/write rseq/rseq_cs. 0 is the only PKEY supported for rseq for now." Clearly that does not even notice the fact that switching to pkey 1 prevents access to the TLS. Fact is that RSEQ is part of the TLS whether you like it or not. See above. "Theoretically other PKEYs can be supported, but it's unclear how/if that can work. So for now we don't support that to simplify code." Clearly tells me that this is attempt to scratch an itch which was not understood at all. There are exactly two options to solve this in user space and there is zero justification to inflict any of this nonsense on the kernel. Thanks, tglx