* [PATCH v7 0/4] rseq: Make rseq work with protection keys
@ 2025-05-21 8:47 Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 1/4] pkeys: add API to switch to permissive/zero pkey register Dmitry Vyukov
` (3 more replies)
0 siblings, 4 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-05-21 8:47 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Dmitry Vyukov, Paul E. McKenney, x86, linux-kernel
If an application registers rseq, and ever switches to another pkey
protection (such that the rseq becomes inaccessible), then any
context switch will cause failure in __rseq_handle_notify_resume()
attempting to read/write struct rseq and/or rseq_cs. Since context
switches are asynchronous and are outside of the application control
(not part of the restricted code scope), temporarily enable access
to 0 (default) PKEY to read/write rseq/rseq_cs.
0 is the only PKEY supported for rseq for now.
Theoretically other PKEYs can be supported, but it's unclear
how/if that can work. So for now we don't support that to simplify
code.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Dmitry Vyukov (4):
pkeys: add API to switch to permissive/zero pkey register
x86/signal: Use write_permissive_pkey_val() helper
rseq: Make rseq work with protection keys
selftests/rseq: Add test for rseq+pkeys
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pkeys.h | 30 +++++++
arch/x86/include/asm/pkru.h | 10 ++-
arch/x86/kernel/signal.c | 6 +-
include/linux/pkeys.h | 31 +++++++
include/uapi/linux/rseq.h | 4 +
kernel/rseq.c | 11 +++
mm/Kconfig | 2 +
tools/testing/selftests/rseq/Makefile | 2 +-
tools/testing/selftests/rseq/pkey_test.c | 101 +++++++++++++++++++++++
tools/testing/selftests/rseq/rseq.h | 1 +
11 files changed, 191 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/rseq/pkey_test.c
base-commit: 4a95bc121ccdaee04c4d72f84dbfa6b880a514b6
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v7 1/4] pkeys: add API to switch to permissive/zero pkey register
2025-05-21 8:47 [PATCH v7 0/4] rseq: Make rseq work with protection keys Dmitry Vyukov
@ 2025-05-21 8:47 ` Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 2/4] x86/signal: Use write_permissive_pkey_val() helper Dmitry Vyukov
` (2 subsequent siblings)
3 siblings, 0 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-05-21 8:47 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Dmitry Vyukov, Paul E. McKenney, x86, linux-kernel
The API allows to switch to permissive pkey register that allows accesses
to all PKEYs, and to a value that allows access to the 0 (default) PKEY.
x86 signal delivery already uses switching to permissive PKEY register
value, and rseq needs to allow access to PKEY 0 while accessing
struct rseq/rseq_cs.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
---
Changes in v5:
- Removed leftover dead code in enable_zero_pkey_val
- Clarified commit message
Changes in v4:
- Added Fixes tag
Changes in v3:
- Renamed API functions to write_permissive_pkey_val/write_pkey_val
- Added enable_zero_pkey_val for rseq
- Added Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Changes in v2:
- Fixed typo in commit description
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pkeys.h | 30 ++++++++++++++++++++++++++++++
arch/x86/include/asm/pkru.h | 10 +++++++---
include/linux/pkeys.h | 31 +++++++++++++++++++++++++++++++
mm/Kconfig | 2 ++
5 files changed, 71 insertions(+), 3 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e21cca404943e..90e60f5651bb6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1820,6 +1820,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+ select ARCH_HAS_PERMISSIVE_PKEY
help
Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2e6c04d8a45b4..614099766d5f2 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_PKEYS_H
#define _ASM_X86_PKEYS_H
+#include "pkru.h"
+
/*
* If more than 16 keys are ever supported, a thorough audit
* will be necessary to ensure that the types that store key
@@ -123,4 +125,32 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
}
+typedef u32 pkey_reg_t;
+
+static inline pkey_reg_t write_permissive_pkey_val(void)
+{
+ return write_pkru(0);
+}
+
+static inline pkey_reg_t enable_zero_pkey_val(void)
+{
+ u32 pkru;
+
+ if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+ return 0;
+ /*
+ * WRPKRU is relatively expensive compared to RDPKRU,
+ * avoid it if possible.
+ */
+ pkru = rdpkru();
+ if ((pkru & (PKRU_AD_BIT|PKRU_WD_BIT)) != 0)
+ wrpkru(pkru & ~(PKRU_AD_BIT|PKRU_WD_BIT));
+ return pkru;
+}
+
+static inline void write_pkey_val(pkey_reg_t val)
+{
+ write_pkru(val);
+}
+
#endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h
index 74f0a2d34ffdd..b9bf9b7f2753b 100644
--- a/arch/x86/include/asm/pkru.h
+++ b/arch/x86/include/asm/pkru.h
@@ -39,16 +39,20 @@ static inline u32 read_pkru(void)
return 0;
}
-static inline void write_pkru(u32 pkru)
+static inline u32 write_pkru(u32 pkru)
{
+ u32 old_pkru;
+
if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
- return;
+ return 0;
/*
* WRPKRU is relatively expensive compared to RDPKRU.
* Avoid WRPKRU when it would not change the value.
*/
- if (pkru != rdpkru())
+ old_pkru = rdpkru();
+ if (pkru != old_pkru)
wrpkru(pkru);
+ return old_pkru;
}
static inline void pkru_write_default(void)
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 86be8bf27b41b..262d60f6a15f8 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -48,4 +48,35 @@ static inline bool arch_pkeys_enabled(void)
#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+#ifndef CONFIG_ARCH_HAS_PERMISSIVE_PKEY
+
+/*
+ * Common name for value of the register that controls access to PKEYs
+ * (called differently on different arches: PKRU, POR, AMR).
+ */
+typedef char pkey_reg_t;
+
+/*
+ * Sets PKEY access register to the most permissive value that allows
+ * accesses to all PKEYs. Returns the current value of PKEY register.
+ * Code should generally arrange switching back to the old value
+ * using write_pkey_val(old_value).
+ */
+static inline pkey_reg_t write_permissive_pkey_val(void)
+{
+ return 0;
+}
+
+/*
+ * Sets PKEY access register to a value that allows access to the 0 (default)
+ * PKEY. Returns the current value of PKEY register.
+ */
+static inline pkey_reg_t enable_zero_pkey_val(void)
+{
+ return 0;
+}
+
+static inline void write_pkey_val(pkey_reg_t val) {}
+#endif /* ! CONFIG_ARCH_HAS_PERMISSIVE_PKEY */
+
#endif /* _LINUX_PKEYS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b4938..37f5706445e2e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1131,6 +1131,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
bool
config ARCH_HAS_PKEYS
bool
+config ARCH_HAS_PERMISSIVE_PKEY
+ bool
config ARCH_USES_PG_ARCH_2
bool
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 2/4] x86/signal: Use write_permissive_pkey_val() helper
2025-05-21 8:47 [PATCH v7 0/4] rseq: Make rseq work with protection keys Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 1/4] pkeys: add API to switch to permissive/zero pkey register Dmitry Vyukov
@ 2025-05-21 8:47 ` Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 3/4] rseq: Make rseq work with protection keys Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 4/4] selftests/rseq: Add test for rseq+pkeys Dmitry Vyukov
3 siblings, 0 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-05-21 8:47 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Dmitry Vyukov, Paul E. McKenney, x86, linux-kernel
Use the new switch_to_permissive_pkey_reg() helper instead of the
custom code. No functional changes intended.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
Changes in v3:
- restore sig_prepare_pkru with the large comment and
make it call the new write_permissive_pkey_val
---
arch/x86/kernel/signal.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 5f441039b5725..27a66a0697dd2 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -28,6 +28,7 @@
#include <linux/entry-common.h>
#include <linux/syscalls.h>
#include <linux/rseq.h>
+#include <linux/pkeys.h>
#include <asm/processor.h>
#include <asm/ucontext.h>
@@ -72,10 +73,7 @@ static inline int is_x32_frame(struct ksignal *ksig)
*/
static inline u32 sig_prepare_pkru(void)
{
- u32 orig_pkru = read_pkru();
-
- write_pkru(0);
- return orig_pkru;
+ return write_permissive_pkey_val();
}
/*
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-05-21 8:47 [PATCH v7 0/4] rseq: Make rseq work with protection keys Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 1/4] pkeys: add API to switch to permissive/zero pkey register Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 2/4] x86/signal: Use write_permissive_pkey_val() helper Dmitry Vyukov
@ 2025-05-21 8:47 ` Dmitry Vyukov
2025-05-21 8:59 ` Dmitry Vyukov
2025-10-20 13:46 ` Kevin Brodsky
2025-05-21 8:47 ` [PATCH v7 4/4] selftests/rseq: Add test for rseq+pkeys Dmitry Vyukov
3 siblings, 2 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-05-21 8:47 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Dmitry Vyukov, Paul E. McKenney, x86, linux-kernel
If an application registers rseq, and ever switches to another pkey
protection (such that the rseq becomes inaccessible), then any
context switch will cause failure in __rseq_handle_notify_resume()
attempting to read/write struct rseq and/or rseq_cs. Since context
switches are asynchronous and are outside of the application control
(not part of the restricted code scope), temporarily switch to
pkey value that allows access to the 0 (default) PKEY.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
---
Changes in v7:
- Added Mathieu's Reviewed-by
Changes in v6:
- Added a comment to struct rseq with MPK rules
Changes in v4:
- Added Fixes tag
Changes in v3:
- simplify control flow to always enable access to 0 pkey
Changes in v2:
- fixed typos and reworded the comment
---
include/uapi/linux/rseq.h | 4 ++++
kernel/rseq.c | 11 +++++++++++
2 files changed, 15 insertions(+)
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac90..019fd248cf749 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -58,6 +58,10 @@ struct rseq_cs {
* contained within a single cache-line.
*
* A single struct rseq per thread is allowed.
+ *
+ * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
+ * then the assigned pkey should either be accessible whenever these structs
+ * are registered/installed, or they should be protected with pkey 0.
*/
struct rseq {
/*
diff --git a/kernel/rseq.c b/kernel/rseq.c
index b7a1ec327e811..88fc8cb789b3b 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -10,6 +10,7 @@
#include <linux/sched.h>
#include <linux/uaccess.h>
+#include <linux/pkeys.h>
#include <linux/syscalls.h>
#include <linux/rseq.h>
#include <linux/types.h>
@@ -424,11 +425,19 @@ static int rseq_ip_fixup(struct pt_regs *regs)
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
+ pkey_reg_t saved_pkey;
int ret, sig;
if (unlikely(t->flags & PF_EXITING))
return;
+ /*
+ * Enable access to the default (0) pkey in case the thread has
+ * currently disabled access to it and struct rseq/rseq_cs has
+ * 0 pkey assigned (the only supported value for now).
+ */
+ saved_pkey = enable_zero_pkey_val();
+
/*
* regs is NULL if and only if the caller is in a syscall path. Skip
* fixup and leave rseq_cs as is so that rseq_sycall() will detect and
@@ -441,9 +450,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
+ write_pkey_val(saved_pkey);
return;
error:
+ write_pkey_val(saved_pkey);
sig = ksig ? ksig->sig : 0;
force_sigsegv(sig);
}
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 4/4] selftests/rseq: Add test for rseq+pkeys
2025-05-21 8:47 [PATCH v7 0/4] rseq: Make rseq work with protection keys Dmitry Vyukov
` (2 preceding siblings ...)
2025-05-21 8:47 ` [PATCH v7 3/4] rseq: Make rseq work with protection keys Dmitry Vyukov
@ 2025-05-21 8:47 ` Dmitry Vyukov
3 siblings, 0 replies; 22+ messages in thread
From: Dmitry Vyukov @ 2025-05-21 8:47 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Dmitry Vyukov, Paul E. McKenney, x86, linux-kernel
Add a test that ensures that PKEY-protected struct rseq_cs
works and does not lead to process kills.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
---
Changes in v7:
- Add rseq_unregister_current_thread() call in the test
Changes in v5:
- Use static for variables/functions
- Use RSEQ_READ/WRITE_ONCE instead of volatile
Changes in v4:
- Added Fixes tag
Changes in v3:
- added Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
- rework the test to work when only pkey 0 is supported for rseq
Changes in v2:
- change test to install protected rseq_cs instead of rseq
---
tools/testing/selftests/rseq/Makefile | 2 +-
tools/testing/selftests/rseq/pkey_test.c | 101 +++++++++++++++++++++++
tools/testing/selftests/rseq/rseq.h | 1 +
3 files changed, 103 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae59547..e5fd819011582 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test pkey_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/pkey_test.c b/tools/testing/selftests/rseq/pkey_test.c
new file mode 100644
index 0000000000000..5dc214cd7a1e6
--- /dev/null
+++ b/tools/testing/selftests/rseq/pkey_test.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Ensure that rseq works when rseq data is inaccessible due to PKEYs.
+ */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <ucontext.h>
+#include <unistd.h>
+
+#include "rseq.h"
+#include "rseq-abi.h"
+
+static int pkey;
+static ucontext_t ucp0, ucp1;
+
+static void coroutine(void)
+{
+ int i, orig_pk0, old_pk0, old_pk1, pk0, pk1;
+ /*
+ * When we disable access to pkey 0, globals and TLS become
+ * inaccessible too, so we need to tread carefully.
+ * Pkey is global so we need to copy it onto the stack.
+ */
+ int pk = RSEQ_READ_ONCE(pkey);
+ struct timespec ts;
+
+ orig_pk0 = pkey_get(0);
+ if (pkey_set(0, PKEY_DISABLE_ACCESS))
+ err(1, "pkey_set failed");
+ old_pk0 = pkey_get(0);
+ old_pk1 = pkey_get(pk);
+
+ /*
+ * Prevent compiler from initializing it by loading a 16-global.
+ */
+ RSEQ_WRITE_ONCE(ts.tv_sec, 0);
+ RSEQ_WRITE_ONCE(ts.tv_nsec, 10 * 1000);
+ /*
+ * If the kernel misbehaves, context switches in the following loop
+ * will terminate the process with SIGSEGV.
+ * Trigger preemption w/o accessing TLS.
+ * Note that glibc's usleep touches errno always.
+ */
+ for (i = 0; i < 10; i++)
+ syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, 0, &ts, NULL);
+
+ pk0 = pkey_get(0);
+ pk1 = pkey_get(pk);
+ if (pkey_set(0, orig_pk0))
+ err(1, "pkey_set failed");
+
+ /*
+ * Ensure that the kernel has restored the previous value of pkeys
+ * register after changing them.
+ */
+ if (old_pk0 != pk0)
+ errx(1, "pkey 0 changed %d->%d", old_pk0, pk0);
+ if (old_pk1 != pk1)
+ errx(1, "pkey 1 changed %d->%d", old_pk1, pk1);
+
+ swapcontext(&ucp1, &ucp0);
+ abort();
+}
+
+int main(int argc, char **argv)
+{
+ pkey = pkey_alloc(0, 0);
+ if (pkey == -1) {
+ printf("[SKIP]\tKernel does not support PKEYs: %s\n",
+ strerror(errno));
+ return 0;
+ }
+
+ if (rseq_register_current_thread())
+ err(1, "rseq_register_current_thread failed");
+
+ if (getcontext(&ucp1))
+ err(1, "getcontext failed");
+ ucp1.uc_stack.ss_size = getpagesize() * 4;
+ ucp1.uc_stack.ss_sp = mmap(NULL, ucp1.uc_stack.ss_size,
+ PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
+ if (ucp1.uc_stack.ss_sp == MAP_FAILED)
+ err(1, "mmap failed");
+ if (pkey_mprotect(ucp1.uc_stack.ss_sp, ucp1.uc_stack.ss_size,
+ PROT_READ | PROT_WRITE, pkey))
+ err(1, "pkey_mprotect failed");
+ makecontext(&ucp1, coroutine, 0);
+ if (swapcontext(&ucp0, &ucp1))
+ err(1, "swapcontext failed");
+
+ if (rseq_unregister_current_thread())
+ err(1, "rseq_unregister_current_thread failed");
+ return 0;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index f51a5fdb04443..cdb34cbad1adb 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -8,6 +8,7 @@
#ifndef RSEQ_H
#define RSEQ_H
+#include <assert.h>
#include <stdint.h>
#include <stdbool.h>
#include <pthread.h>
--
2.49.0.1143.g0be31eac6b-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-05-21 8:47 ` [PATCH v7 3/4] rseq: Make rseq work with protection keys Dmitry Vyukov
@ 2025-05-21 8:59 ` Dmitry Vyukov
2025-06-24 9:17 ` Dmitry Vyukov
2025-10-20 13:46 ` Kevin Brodsky
1 sibling, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-05-21 8:59 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Paul E. McKenney, x86, linux-kernel, Ingo Molnar
On Wed, 21 May 2025 at 10:52, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> If an application registers rseq, and ever switches to another pkey
> protection (such that the rseq becomes inaccessible), then any
> context switch will cause failure in __rseq_handle_notify_resume()
> attempting to read/write struct rseq and/or rseq_cs. Since context
> switches are asynchronous and are outside of the application control
> (not part of the restricted code scope), temporarily switch to
> pkey value that allows access to the 0 (default) PKEY.
>
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
Dave, can you please ack this patch? Ingo said he was waiting for your
review before taking this to -tip.
> ---
> Changes in v7:
> - Added Mathieu's Reviewed-by
>
> Changes in v6:
> - Added a comment to struct rseq with MPK rules
>
> Changes in v4:
> - Added Fixes tag
>
> Changes in v3:
> - simplify control flow to always enable access to 0 pkey
>
> Changes in v2:
> - fixed typos and reworded the comment
> ---
> include/uapi/linux/rseq.h | 4 ++++
> kernel/rseq.c | 11 +++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index c233aae5eac90..019fd248cf749 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -58,6 +58,10 @@ struct rseq_cs {
> * contained within a single cache-line.
> *
> * A single struct rseq per thread is allowed.
> + *
> + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
> + * then the assigned pkey should either be accessible whenever these structs
> + * are registered/installed, or they should be protected with pkey 0.
> */
> struct rseq {
> /*
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index b7a1ec327e811..88fc8cb789b3b 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -10,6 +10,7 @@
>
> #include <linux/sched.h>
> #include <linux/uaccess.h>
> +#include <linux/pkeys.h>
> #include <linux/syscalls.h>
> #include <linux/rseq.h>
> #include <linux/types.h>
> @@ -424,11 +425,19 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> {
> struct task_struct *t = current;
> + pkey_reg_t saved_pkey;
> int ret, sig;
>
> if (unlikely(t->flags & PF_EXITING))
> return;
>
> + /*
> + * Enable access to the default (0) pkey in case the thread has
> + * currently disabled access to it and struct rseq/rseq_cs has
> + * 0 pkey assigned (the only supported value for now).
> + */
> + saved_pkey = enable_zero_pkey_val();
> +
> /*
> * regs is NULL if and only if the caller is in a syscall path. Skip
> * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> @@ -441,9 +450,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> }
> if (unlikely(rseq_update_cpu_node_id(t)))
> goto error;
> + write_pkey_val(saved_pkey);
> return;
>
> error:
> + write_pkey_val(saved_pkey);
> sig = ksig ? ksig->sig : 0;
> force_sigsegv(sig);
> }
> --
> 2.49.0.1143.g0be31eac6b-goog
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-05-21 8:59 ` Dmitry Vyukov
@ 2025-06-24 9:17 ` Dmitry Vyukov
2025-07-18 9:01 ` Dmitry Vyukov
0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-06-24 9:17 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Paul E. McKenney, x86, linux-kernel, Ingo Molnar
On Wed, 21 May 2025 at 10:59, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Wed, 21 May 2025 at 10:52, Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > If an application registers rseq, and ever switches to another pkey
> > protection (such that the rseq becomes inaccessible), then any
> > context switch will cause failure in __rseq_handle_notify_resume()
> > attempting to read/write struct rseq and/or rseq_cs. Since context
> > switches are asynchronous and are outside of the application control
> > (not part of the restricted code scope), temporarily switch to
> > pkey value that allows access to the 0 (default) PKEY.
> >
> > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
> > Cc: x86@kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
>
> Dave, can you please ack this patch? Ingo said he was waiting for your
> review before taking this to -tip.
Are there any remaining concerns with this series? If not, Thomas,
Ingo, can you please take this to -tip tree?
> > ---
> > Changes in v7:
> > - Added Mathieu's Reviewed-by
> >
> > Changes in v6:
> > - Added a comment to struct rseq with MPK rules
> >
> > Changes in v4:
> > - Added Fixes tag
> >
> > Changes in v3:
> > - simplify control flow to always enable access to 0 pkey
> >
> > Changes in v2:
> > - fixed typos and reworded the comment
> > ---
> > include/uapi/linux/rseq.h | 4 ++++
> > kernel/rseq.c | 11 +++++++++++
> > 2 files changed, 15 insertions(+)
> >
> > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> > index c233aae5eac90..019fd248cf749 100644
> > --- a/include/uapi/linux/rseq.h
> > +++ b/include/uapi/linux/rseq.h
> > @@ -58,6 +58,10 @@ struct rseq_cs {
> > * contained within a single cache-line.
> > *
> > * A single struct rseq per thread is allowed.
> > + *
> > + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
> > + * then the assigned pkey should either be accessible whenever these structs
> > + * are registered/installed, or they should be protected with pkey 0.
> > */
> > struct rseq {
> > /*
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index b7a1ec327e811..88fc8cb789b3b 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -10,6 +10,7 @@
> >
> > #include <linux/sched.h>
> > #include <linux/uaccess.h>
> > +#include <linux/pkeys.h>
> > #include <linux/syscalls.h>
> > #include <linux/rseq.h>
> > #include <linux/types.h>
> > @@ -424,11 +425,19 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> > void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> > {
> > struct task_struct *t = current;
> > + pkey_reg_t saved_pkey;
> > int ret, sig;
> >
> > if (unlikely(t->flags & PF_EXITING))
> > return;
> >
> > + /*
> > + * Enable access to the default (0) pkey in case the thread has
> > + * currently disabled access to it and struct rseq/rseq_cs has
> > + * 0 pkey assigned (the only supported value for now).
> > + */
> > + saved_pkey = enable_zero_pkey_val();
> > +
> > /*
> > * regs is NULL if and only if the caller is in a syscall path. Skip
> > * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> > @@ -441,9 +450,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> > }
> > if (unlikely(rseq_update_cpu_node_id(t)))
> > goto error;
> > + write_pkey_val(saved_pkey);
> > return;
> >
> > error:
> > + write_pkey_val(saved_pkey);
> > sig = ksig ? ksig->sig : 0;
> > force_sigsegv(sig);
> > }
> > --
> > 2.49.0.1143.g0be31eac6b-goog
> >
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-06-24 9:17 ` Dmitry Vyukov
@ 2025-07-18 9:01 ` Dmitry Vyukov
2025-07-21 13:25 ` Mathieu Desnoyers
0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-07-18 9:01 UTC (permalink / raw)
To: mathieu.desnoyers, peterz, boqun.feng, tglx, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Paul E. McKenney, x86, linux-kernel, Ingo Molnar,
Stephen Röttger
On Tue, 24 Jun 2025 at 11:17, Dmitry Vyukov <dvyukov@google.com> wrote:
> > > If an application registers rseq, and ever switches to another pkey
> > > protection (such that the rseq becomes inaccessible), then any
> > > context switch will cause failure in __rseq_handle_notify_resume()
> > > attempting to read/write struct rseq and/or rseq_cs. Since context
> > > switches are asynchronous and are outside of the application control
> > > (not part of the restricted code scope), temporarily switch to
> > > pkey value that allows access to the 0 (default) PKEY.
> > >
> > > Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> > > Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > Cc: Boqun Feng <boqun.feng@gmail.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Borislav Petkov <bp@alien8.de>
> > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > > Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
> > > Cc: x86@kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
> >
> > Dave, can you please ack this patch? Ingo said he was waiting for your
> > review before taking this to -tip.
>
> Are there any remaining concerns with this series? If not, Thomas,
> Ingo, can you please take this to -tip tree?
Gentle ping. What needs to happen for this series to be merged?
> > > ---
> > > Changes in v7:
> > > - Added Mathieu's Reviewed-by
> > >
> > > Changes in v6:
> > > - Added a comment to struct rseq with MPK rules
> > >
> > > Changes in v4:
> > > - Added Fixes tag
> > >
> > > Changes in v3:
> > > - simplify control flow to always enable access to 0 pkey
> > >
> > > Changes in v2:
> > > - fixed typos and reworded the comment
> > > ---
> > > include/uapi/linux/rseq.h | 4 ++++
> > > kernel/rseq.c | 11 +++++++++++
> > > 2 files changed, 15 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> > > index c233aae5eac90..019fd248cf749 100644
> > > --- a/include/uapi/linux/rseq.h
> > > +++ b/include/uapi/linux/rseq.h
> > > @@ -58,6 +58,10 @@ struct rseq_cs {
> > > * contained within a single cache-line.
> > > *
> > > * A single struct rseq per thread is allowed.
> > > + *
> > > + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
> > > + * then the assigned pkey should either be accessible whenever these structs
> > > + * are registered/installed, or they should be protected with pkey 0.
> > > */
> > > struct rseq {
> > > /*
> > > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > > index b7a1ec327e811..88fc8cb789b3b 100644
> > > --- a/kernel/rseq.c
> > > +++ b/kernel/rseq.c
> > > @@ -10,6 +10,7 @@
> > >
> > > #include <linux/sched.h>
> > > #include <linux/uaccess.h>
> > > +#include <linux/pkeys.h>
> > > #include <linux/syscalls.h>
> > > #include <linux/rseq.h>
> > > #include <linux/types.h>
> > > @@ -424,11 +425,19 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> > > void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> > > {
> > > struct task_struct *t = current;
> > > + pkey_reg_t saved_pkey;
> > > int ret, sig;
> > >
> > > if (unlikely(t->flags & PF_EXITING))
> > > return;
> > >
> > > + /*
> > > + * Enable access to the default (0) pkey in case the thread has
> > > + * currently disabled access to it and struct rseq/rseq_cs has
> > > + * 0 pkey assigned (the only supported value for now).
> > > + */
> > > + saved_pkey = enable_zero_pkey_val();
> > > +
> > > /*
> > > * regs is NULL if and only if the caller is in a syscall path. Skip
> > > * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> > > @@ -441,9 +450,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> > > }
> > > if (unlikely(rseq_update_cpu_node_id(t)))
> > > goto error;
> > > + write_pkey_val(saved_pkey);
> > > return;
> > >
> > > error:
> > > + write_pkey_val(saved_pkey);
> > > sig = ksig ? ksig->sig : 0;
> > > force_sigsegv(sig);
> > > }
> > > --
> > > 2.49.0.1143.g0be31eac6b-goog
> > >
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-07-18 9:01 ` Dmitry Vyukov
@ 2025-07-21 13:25 ` Mathieu Desnoyers
2025-07-21 17:41 ` Dave Hansen
0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-21 13:25 UTC (permalink / raw)
To: Dmitry Vyukov, dave.hansen
Cc: Paul E. McKenney, x86, linux-kernel, Ingo Molnar,
Stephen Röttger, peterz, boqun.feng, tglx, mingo, bp, hpa,
aruna.ramakrishna, elver
On 2025-07-18 05:01, Dmitry Vyukov wrote:
> On Tue, 24 Jun 2025 at 11:17, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>> If an application registers rseq, and ever switches to another pkey
>>>> protection (such that the rseq becomes inaccessible), then any
>>>> context switch will cause failure in __rseq_handle_notify_resume()
>>>> attempting to read/write struct rseq and/or rseq_cs. Since context
>>>> switches are asynchronous and are outside of the application control
>>>> (not part of the restricted code scope), temporarily switch to
>>>> pkey value that allows access to the 0 (default) PKEY.
>>>>
>>>> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
>>>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>>>> Cc: Boqun Feng <boqun.feng@gmail.com>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
>>>> Cc: x86@kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
>>>
>>> Dave, can you please ack this patch? Ingo said he was waiting for your
>>> review before taking this to -tip.
>>
>> Are there any remaining concerns with this series? If not, Thomas,
>> Ingo, can you please take this to -tip tree?
>
> Gentle ping. What needs to happen for this series to be merged?
This series looks OK from my perspective. I think the last piece that
was missing was to get a review from Dave Hansen.
Dave ?
Thanks,
Mathieu
>
>
>>>> ---
>>>> Changes in v7:
>>>> - Added Mathieu's Reviewed-by
>>>>
>>>> Changes in v6:
>>>> - Added a comment to struct rseq with MPK rules
>>>>
>>>> Changes in v4:
>>>> - Added Fixes tag
>>>>
>>>> Changes in v3:
>>>> - simplify control flow to always enable access to 0 pkey
>>>>
>>>> Changes in v2:
>>>> - fixed typos and reworded the comment
>>>> ---
>>>> include/uapi/linux/rseq.h | 4 ++++
>>>> kernel/rseq.c | 11 +++++++++++
>>>> 2 files changed, 15 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
>>>> index c233aae5eac90..019fd248cf749 100644
>>>> --- a/include/uapi/linux/rseq.h
>>>> +++ b/include/uapi/linux/rseq.h
>>>> @@ -58,6 +58,10 @@ struct rseq_cs {
>>>> * contained within a single cache-line.
>>>> *
>>>> * A single struct rseq per thread is allowed.
>>>> + *
>>>> + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
>>>> + * then the assigned pkey should either be accessible whenever these structs
>>>> + * are registered/installed, or they should be protected with pkey 0.
>>>> */
>>>> struct rseq {
>>>> /*
>>>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>>>> index b7a1ec327e811..88fc8cb789b3b 100644
>>>> --- a/kernel/rseq.c
>>>> +++ b/kernel/rseq.c
>>>> @@ -10,6 +10,7 @@
>>>>
>>>> #include <linux/sched.h>
>>>> #include <linux/uaccess.h>
>>>> +#include <linux/pkeys.h>
>>>> #include <linux/syscalls.h>
>>>> #include <linux/rseq.h>
>>>> #include <linux/types.h>
>>>> @@ -424,11 +425,19 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>>>> void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>>>> {
>>>> struct task_struct *t = current;
>>>> + pkey_reg_t saved_pkey;
>>>> int ret, sig;
>>>>
>>>> if (unlikely(t->flags & PF_EXITING))
>>>> return;
>>>>
>>>> + /*
>>>> + * Enable access to the default (0) pkey in case the thread has
>>>> + * currently disabled access to it and struct rseq/rseq_cs has
>>>> + * 0 pkey assigned (the only supported value for now).
>>>> + */
>>>> + saved_pkey = enable_zero_pkey_val();
>>>> +
>>>> /*
>>>> * regs is NULL if and only if the caller is in a syscall path. Skip
>>>> * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
>>>> @@ -441,9 +450,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>>>> }
>>>> if (unlikely(rseq_update_cpu_node_id(t)))
>>>> goto error;
>>>> + write_pkey_val(saved_pkey);
>>>> return;
>>>>
>>>> error:
>>>> + write_pkey_val(saved_pkey);
>>>> sig = ksig ? ksig->sig : 0;
>>>> force_sigsegv(sig);
>>>> }
>>>> --
>>>> 2.49.0.1143.g0be31eac6b-goog
>>>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-07-21 13:25 ` Mathieu Desnoyers
@ 2025-07-21 17:41 ` Dave Hansen
2025-08-21 15:12 ` Dmitry Vyukov
0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2025-07-21 17:41 UTC (permalink / raw)
To: Mathieu Desnoyers, Dmitry Vyukov, dave.hansen
Cc: Paul E. McKenney, x86, linux-kernel, Ingo Molnar,
Stephen Röttger, peterz, boqun.feng, tglx, mingo, bp, hpa,
aruna.ramakrishna, elver
On 7/21/25 06:25, Mathieu Desnoyers wrote:
> This series looks OK from my perspective. I think the last piece that
> was missing was to get a review from Dave Hansen.
>
> Dave ?
It looks fine to me. I think the best thing is if you folks send it in
as an rseq fix. I'm OK with the x86 bits. For the series:
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-07-21 17:41 ` Dave Hansen
@ 2025-08-21 15:12 ` Dmitry Vyukov
2025-09-19 13:07 ` Dmitry Vyukov
0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-08-21 15:12 UTC (permalink / raw)
To: Ingo Molnar
Cc: Mathieu Desnoyers, dave.hansen, Paul E. McKenney, x86,
linux-kernel, Stephen Röttger, peterz, boqun.feng, tglx,
mingo, bp, hpa, aruna.ramakrishna, elver, Dave Hansen
On Mon, 21 Jul 2025 at 10:41, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/21/25 06:25, Mathieu Desnoyers wrote:
> > This series looks OK from my perspective. I think the last piece that
> > was missing was to get a review from Dave Hansen.
> >
> > Dave ?
>
> It looks fine to me. I think the best thing is if you folks send it in
> as an rseq fix. I'm OK with the x86 bits. For the series:
>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Ingo,
Now both Mathieu and Dave reviewed this series.
Can you please take it to your tree? Or suggest who can take it?
Thanks
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-08-21 15:12 ` Dmitry Vyukov
@ 2025-09-19 13:07 ` Dmitry Vyukov
2025-09-22 13:06 ` Mathieu Desnoyers
0 siblings, 1 reply; 22+ messages in thread
From: Dmitry Vyukov @ 2025-09-19 13:07 UTC (permalink / raw)
To: Ingo Molnar, x86, mingo, bp
Cc: Mathieu Desnoyers, dave.hansen, Paul E. McKenney, linux-kernel,
Stephen Röttger, peterz, boqun.feng, tglx, hpa,
aruna.ramakrishna, elver, Dave Hansen
On Thu, 21 Aug 2025 at 17:12, Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Mon, 21 Jul 2025 at 10:41, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 7/21/25 06:25, Mathieu Desnoyers wrote:
> > > This series looks OK from my perspective. I think the last piece that
> > > was missing was to get a review from Dave Hansen.
> > >
> > > Dave ?
> >
> > It looks fine to me. I think the best thing is if you folks send it in
> > as an rseq fix. I'm OK with the x86 bits. For the series:
> >
> > Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>
> Ingo,
>
> Now both Mathieu and Dave reviewed this series.
>
> Can you please take it to your tree? Or suggest who can take it?
>
> Thanks
Ingo, Borislav (not sure who else maintaining tip tree),
Please merge this change into the tip tree.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-09-19 13:07 ` Dmitry Vyukov
@ 2025-09-22 13:06 ` Mathieu Desnoyers
0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-09-22 13:06 UTC (permalink / raw)
To: tglx
Cc: dave.hansen, Paul E. McKenney, linux-kernel, Stephen Röttger,
peterz, boqun.feng, hpa, aruna.ramakrishna, elver, Dave Hansen,
Dmitry Vyukov, Ingo Molnar, x86, mingo, bp
On 2025-09-19 09:07, Dmitry Vyukov wrote:
> On Thu, 21 Aug 2025 at 17:12, Dmitry Vyukov <dvyukov@google.com> wrote:
>>
>> On Mon, 21 Jul 2025 at 10:41, Dave Hansen <dave.hansen@intel.com> wrote:
>>>
>>> On 7/21/25 06:25, Mathieu Desnoyers wrote:
>>>> This series looks OK from my perspective. I think the last piece that
>>>> was missing was to get a review from Dave Hansen.
>>>>
>>>> Dave ?
>>>
>>> It looks fine to me. I think the best thing is if you folks send it in
>>> as an rseq fix. I'm OK with the x86 bits. For the series:
>>>
>>> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Ingo,
>>
>> Now both Mathieu and Dave reviewed this series.
>>
>> Can you please take it to your tree? Or suggest who can take it?
>>
>> Thanks
>
> Ingo, Borislav (not sure who else maintaining tip tree),
>
> Please merge this change into the tip tree.
Hi Thomas,
This series from Dmitry is at v7:
https://lore.kernel.org/lkml/cover.1747817128.git.dvyukov@google.com/
The initial version from February 2025 can be found here:
https://lore.kernel.org/lkml/ffd123bb0c73df5cdd3a5807b360bd390983150b.1739790300.git.dvyukov@google.com/
What are the next steps to get it upstream ? We have not heard back
from the tip tree maintainers since June 2025 after repeated pings
from Dmitry.
Note: it will likely conflict with your rseq rework.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-05-21 8:47 ` [PATCH v7 3/4] rseq: Make rseq work with protection keys Dmitry Vyukov
2025-05-21 8:59 ` Dmitry Vyukov
@ 2025-10-20 13:46 ` Kevin Brodsky
2025-11-26 0:45 ` Thomas Gleixner
1 sibling, 1 reply; 22+ messages in thread
From: Kevin Brodsky @ 2025-10-20 13:46 UTC (permalink / raw)
To: Dmitry Vyukov, mathieu.desnoyers, peterz, boqun.feng, tglx, mingo,
bp, dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Paul E. McKenney, x86, linux-kernel, Florian Weimer
+Florian Weimer
On 21/05/2025 10:47, Dmitry Vyukov wrote:
> If an application registers rseq, and ever switches to another pkey
> protection (such that the rseq becomes inaccessible), then any
> context switch will cause failure in __rseq_handle_notify_resume()
> attempting to read/write struct rseq and/or rseq_cs. Since context
> switches are asynchronous and are outside of the application control
> (not part of the restricted code scope), temporarily switch to
> pkey value that allows access to the 0 (default) PKEY.
>
> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: d7822b1e24f2 ("rseq: Introduce restartable sequences system call")
>
> ---
> Changes in v7:
> - Added Mathieu's Reviewed-by
>
> Changes in v6:
> - Added a comment to struct rseq with MPK rules
>
> Changes in v4:
> - Added Fixes tag
>
> Changes in v3:
> - simplify control flow to always enable access to 0 pkey
>
> Changes in v2:
> - fixed typos and reworded the comment
> ---
> include/uapi/linux/rseq.h | 4 ++++
> kernel/rseq.c | 11 +++++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index c233aae5eac90..019fd248cf749 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -58,6 +58,10 @@ struct rseq_cs {
> * contained within a single cache-line.
> *
> * A single struct rseq per thread is allowed.
> + *
> + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
> + * then the assigned pkey should either be accessible whenever these structs
> + * are registered/installed, or they should be protected with pkey 0.
I think it's worth pointing out that every case of async uaccess seems
to be handled differently w.r.t. pkeys. Some interesting cases:
* During signal delivery, the pkey register is reset to permissive
before writing to the signal stack (this is the logic that patch 2
touches in fact). This is handled in the same way on x86 [1] and arm64 [2].
* AFAICT io_uring worker threads inherit the user thread's pkey register
on x86, since they are forked without setting PF_KTHREAD. OTOH on arm64
the pkey register is reset to the init value (RW access to pkey 0 only)
for all non-user threads [3].
* Now with this patch accesses to struct rseq are made with whatever the
pkey register is set to when the thread is interrupted + RW access for
pkey 0.
This is clearly a difficult problem, since no approach will satisfy all
users. I believe a new API would be desirable to allow users to
configure this; see the thread at [4] regarding signal handlers.
In any case, it seems best to ignore the value of the pkey register at
the point where the thread is interrupted, because userspace has no
control on that value and it could lead to spurious faults (depending on
when the interruption happens). For rseq, maybe the most logical option
would be to set the pkey register to permissive in
__rseq_handle_notify_resume(), because there is nothing sensible to
check at that point. Checking could be done at the point of registration
instead.
- Kevin
[1]
https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@oracle.com/
[2]
https://lore.kernel.org/all/20241023150511.3923558-1-kevin.brodsky@arm.com/
[3] https://lore.kernel.org/all/20241001133618.1547996-2-joey.gouly@arm.com/
[4]
https://inbox.sourceware.org/libc-alpha/87jza5pxmx.fsf@oldenburg.str.redhat.com/
> */
> struct rseq {
> /*
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index b7a1ec327e811..88fc8cb789b3b 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -10,6 +10,7 @@
>
> #include <linux/sched.h>
> #include <linux/uaccess.h>
> +#include <linux/pkeys.h>
> #include <linux/syscalls.h>
> #include <linux/rseq.h>
> #include <linux/types.h>
> @@ -424,11 +425,19 @@ static int rseq_ip_fixup(struct pt_regs *regs)
> void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> {
> struct task_struct *t = current;
> + pkey_reg_t saved_pkey;
> int ret, sig;
>
> if (unlikely(t->flags & PF_EXITING))
> return;
>
> + /*
> + * Enable access to the default (0) pkey in case the thread has
> + * currently disabled access to it and struct rseq/rseq_cs has
> + * 0 pkey assigned (the only supported value for now).
> + */
> + saved_pkey = enable_zero_pkey_val();
> +
> /*
> * regs is NULL if and only if the caller is in a syscall path. Skip
> * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> @@ -441,9 +450,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> }
> if (unlikely(rseq_update_cpu_node_id(t)))
> goto error;
> + write_pkey_val(saved_pkey);
> return;
>
> error:
> + write_pkey_val(saved_pkey);
> sig = ksig ? ksig->sig : 0;
> force_sigsegv(sig);
> }
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-10-20 13:46 ` Kevin Brodsky
@ 2025-11-26 0:45 ` Thomas Gleixner
2025-11-26 9:32 ` Florian Weimer
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2025-11-26 0:45 UTC (permalink / raw)
To: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver
Cc: Paul E. McKenney, x86, linux-kernel, Florian Weimer
On Mon, Oct 20 2025 at 15:46, Kevin Brodsky wrote:
> +Florian Weimer
>> * A single struct rseq per thread is allowed.
>> + *
>> + * If struct rseq or struct rseq_cs is used with Memory Protection Keys,
>> + * then the assigned pkey should either be accessible whenever these structs
>> + * are registered/installed, or they should be protected with pkey 0.
>
> I think it's worth pointing out that every case of async uaccess seems
> to be handled differently w.r.t. pkeys. Some interesting cases:
>
> * During signal delivery, the pkey register is reset to permissive
> before writing to the signal stack (this is the logic that patch 2
> touches in fact). This is handled in the same way on x86 [1] and arm64 [2].
>
> * AFAICT io_uring worker threads inherit the user thread's pkey register
> on x86, since they are forked without setting PF_KTHREAD. OTOH on arm64
> the pkey register is reset to the init value (RW access to pkey 0 only)
> for all non-user threads [3].
>
> * Now with this patch accesses to struct rseq are made with whatever the
> pkey register is set to when the thread is interrupted + RW access for
> pkey 0.
>
> This is clearly a difficult problem, since no approach will satisfy all
> users. I believe a new API would be desirable to allow users to
> configure this; see the thread at [4] regarding signal handlers.
>
> In any case, it seems best to ignore the value of the pkey register at
> the point where the thread is interrupted, because userspace has no
> control on that value and it could lead to spurious faults (depending on
> when the interruption happens). For rseq, maybe the most logical option
> would be to set the pkey register to permissive in
> __rseq_handle_notify_resume(), because there is nothing sensible to
> check at that point. Checking could be done at the point of registration
> instead.
That's all broken. Assume:
1) process starts with pkey 0 (default)
2) glibc creates TLS (protected by pkey 0)
3) main() switches to protection pkey 1
If the switch to pkey 1 does not ensure that TLS (where RSEQ sits) is
accessible by pkey 1, then how is userspace able to survive?
You then do not even need the help of the kernel to die. If the process
accesses TLS it dies on it's own.
As each map can only be attached to a single pkey, the only solutions for
this are:
1) enable both pkey0 and pkey1
or
2) ensure that everything which was mapped _before_ main() changes
the protection key is covered by pkey1 when it disables pkey0.
For that to pull off you need #1 temporary to finish the
transition of _all_ affected mappings via pkey_mprotect()
No?
And no matter how many PKEY hacks you sprinkle into the kernel none of
them will solve all of the related problems.
If user space registers shared memory with the kernel (that's what RSEQ
is), user space has to make sure that it is accessible under it's own
PKEY constraints. If it fails to do so, it dies rightfully whether in
user space or in kernel space does not matter.
The only exception is the alternative signal stack, which can be set up
as an catch all sink which is not covered by the currently active PKEY
set on purpose.
And looking at how x86 handles this it gets it actually wrong because it
enables all keys independent of the stack target it delivers to while
this should be strictly restricted to the alternative stack, but that's
a different discussion to have.
So no, these patches are going nowhere because they try to paper over a
fundamental bug in the application and the understanding of how all of
this protection key mechanism actualy can work.
Just let me quote the cover letter to prove my point:
"If an application registers rseq, and ever switches to another pkey
protection (such that the rseq becomes inaccessible), then any
context switch will cause failure in __rseq_handle_notify_resume()
attempting to read/write struct rseq and/or rseq_cs. Since context
switches are asynchronous and are outside of the application control
(not part of the restricted code scope), temporarily enable access
to 0 (default) PKEY to read/write rseq/rseq_cs.
0 is the only PKEY supported for rseq for now."
Clearly that does not even notice the fact that switching to pkey 1
prevents access to the TLS. Fact is that RSEQ is part of the TLS whether
you like it or not. See above.
"Theoretically other PKEYs can be supported, but it's unclear
how/if that can work. So for now we don't support that to simplify
code."
Clearly tells me that this is attempt to scratch an itch which was not
understood at all.
There are exactly two options to solve this in user space and there is
zero justification to inflict any of this nonsense on the kernel.
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 0:45 ` Thomas Gleixner
@ 2025-11-26 9:32 ` Florian Weimer
2025-11-26 17:56 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2025-11-26 9:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver,
Paul E. McKenney, x86, linux-kernel
* Thomas Gleixner:
> That's all broken. Assume:
>
> 1) process starts with pkey 0 (default)
> 2) glibc creates TLS (protected by pkey 0)
> 3) main() switches to protection pkey 1
>
> If the switch to pkey 1 does not ensure that TLS (where RSEQ sits) is
> accessible by pkey 1, then how is userspace able to survive?
>
> You then do not even need the help of the kernel to die. If the process
> accesses TLS it dies on it's own.
Signals have the same problem. With the x86 approach to disable all
access, protection keys are not really usable without tight control over
all code in the process. This behavior breaks encapsulation.
I'm less concerned about the impact on restart of restartable sequences
because by design, it's a non-modular feature: syscalls and function
calls are already banned. If the code wants to restart, it has to make
sure that the access rights at the restart point are correct. But
that's like any other register contents, I think.
In the other direction, code that sets a restrictive access mask is
already not allowed to call into arbitrary code. For example, we could
use protection keys internally within glibc in the dynamic linker and
require that a key that we allocated retains read access.
Unfortunately, there's a use case for singleton access rights that does
not include key 0: validate that a pointer points to memory colored in a
specific way (e.g, for vtables, or for bytecode).
If the kernel/scheduler cannot bypass restrictions on access key 0, then
supporting this kind of memory color check is rather difficult because
userspace would always have to put key 0 into the accessible set.
Would it help to allocate a dedicated key for rseq and specify that
userspace must always include this access in the accessible set?
In glibc, we cannot easily set a different key for the TLS area today
because it's not necessarily on an isolated page on which we could call
pkey_mprotect. We plan to fix this next year, but it's not a trivial
change.
On the other hand, I get the idea that protection keys are pretty dead.
So far, I couldn't get the x86 signal issue fixed in the kernel, so we
can't use them for glibc hardening. AArch64 duplicated the x86
behavior, too. And POWER removed protection key support with the switch
to the radix MMU.
Thanks,
Florian
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 9:32 ` Florian Weimer
@ 2025-11-26 17:56 ` Thomas Gleixner
2025-11-26 19:06 ` Florian Weimer
2025-12-02 19:19 ` Kevin Brodsky
0 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2025-11-26 17:56 UTC (permalink / raw)
To: Florian Weimer
Cc: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver,
Paul E. McKenney, x86, linux-kernel, Jens Axboe
On Wed, Nov 26 2025 at 10:32, Florian Weimer wrote:
> * Thomas Gleixner:
>
>> That's all broken. Assume:
>>
>> 1) process starts with pkey 0 (default)
>> 2) glibc creates TLS (protected by pkey 0)
>> 3) main() switches to protection pkey 1
>>
>> If the switch to pkey 1 does not ensure that TLS (where RSEQ sits) is
>> accessible by pkey 1, then how is userspace able to survive?
>>
>> You then do not even need the help of the kernel to die. If the process
>> accesses TLS it dies on it's own.
>
> Signals have the same problem. With the x86 approach to disable all
> access, protection keys are not really usable without tight control over
> all code in the process. This behavior breaks encapsulation.
It enables PKEY0, but I agree that the signal muck is broken. See below.
> I'm less concerned about the impact on restart of restartable sequences
> because by design, it's a non-modular feature: syscalls and function
> calls are already banned. If the code wants to restart, it has to make
> sure that the access rights at the restart point are correct. But
> that's like any other register contents, I think.
It's not only restart. RSEQ is also accessed by the kernel for storing
CPUID, NODEID, CID. Some of that is used in glibc today, no?
> In the other direction, code that sets a restrictive access mask is
> already not allowed to call into arbitrary code. For example, we could
> use protection keys internally within glibc in the dynamic linker and
> require that a key that we allocated retains read access.
>
> Unfortunately, there's a use case for singleton access rights that does
> not include key 0: validate that a pointer points to memory colored in a
> specific way (e.g, for vtables, or for bytecode).
Fair enough.
> If the kernel/scheduler cannot bypass restrictions on access key 0, then
> supporting this kind of memory color check is rather difficult because
> userspace would always have to put key 0 into the accessible set.
Right, but blindly bypassing restrictions on key 0 is not a real good
solution either. It's just another piece of duct tape.
> Would it help to allocate a dedicated key for rseq and specify that
> userspace must always include this access in the accessible set?
That would definitely be helpful to avoid switching PKRU in rseq
handling code on exit to user space.
Though with the reworked RSEQ code the extra overhead might not be
horrible. See below.
But like with signals just blindly enabling key0 and hope that it works
is not really a solution. Nothing prevents me from disabling RSEQ for
glibc. Then install my own RSEQ page and mprotect it. When that key
becomes disabled in PKRU and the code section is interrupted then exit
to user space will fault and die in exactly the same way as
today. That's progress...
> In glibc, we cannot easily set a different key for the TLS area today
> because it's not necessarily on an isolated page on which we could call
> pkey_mprotect. We plan to fix this next year, but it's not a trivial
> change.
I understand.
> On the other hand, I get the idea that protection keys are pretty dead.
> So far, I couldn't get the x86 signal issue fixed in the kernel, so we
> can't use them for glibc hardening.
Then let's sit down and fix it once and forever.
> AArch64 duplicated the x86 behavior, too. And POWER removed
> protection key support with the switch to the radix MMU.
I'm not sure whether we should declare them dead.
They definitely have a value, but none of this PKEY muck has been really
thought through and we just ended up with a cobbled together ABI and a
hard (impossible for some stuff) to use programming model.
So we really need to sit down and actually define a proper programming
model first instead of trying to duct tape the current ill defined mess
forever.
What do we have to take into account:
1) signals
Broken as we know already.
IMO, the proper solution is to provide a mechanism to register a
set of permissions which are used for signal delivery. The
resulting hardware value should expand the permission, but keep
the current active ones enabled.
That can be kinda kept backwards compatible as the signal perms
would default to PKEY0.
2) rseq
The option of having a separate key which needs to be always
enabled is definitely simple, but it wastes a key just for
that. There are only 16 of them :(
If we solve the signal case with an explicit permission set, we
can just reuse those signal permissions. They are maybe wider than
what's required to access RSEQ, but the signal permissions have to
include the TLS/RSEQ area to actually work.
3) io-uring
"Works" on x86 as the worker inherits the permissions of the task
which creates the worker unless the user memory is not accessible
with the tasks current permissions. That's pretty much preventing
a full isolation of the memory which that worker can access
because the task which creates it must have the keys to access
stack, rseq and whatever enabled to survive the syscall.
Fails on ARM64 if the user memory is not accessible via the
default key, which enforces that stack, rseq and the worker memory
is accessible via the default key. Again no isolation of the
worker memory possible.
I think it should have a mechanism to set the required permissions
explicitly and default to the current behaviour, but that's
solvable within the io uring space I think. Jens?
Thoughts?
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 17:56 ` Thomas Gleixner
@ 2025-11-26 19:06 ` Florian Weimer
2025-11-26 20:52 ` Thomas Gleixner
2025-12-02 19:19 ` Kevin Brodsky
1 sibling, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2025-11-26 19:06 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver,
Paul E. McKenney, x86, linux-kernel, Jens Axboe
* Thomas Gleixner:
>> I'm less concerned about the impact on restart of restartable sequences
>> because by design, it's a non-modular feature: syscalls and function
>> calls are already banned. If the code wants to restart, it has to make
>> sure that the access rights at the restart point are correct. But
>> that's like any other register contents, I think.
>
> It's not only restart. RSEQ is also accessed by the kernel for storing
> CPUID, NODEID, CID. Some of that is used in glibc today, no?
But glibc code cannot run from within an rseq critical section. And I
think it's not reasonable to expect that if you revoke access to all
allocated protection keys, it's well-defined t o call library code.
>> Would it help to allocate a dedicated key for rseq and specify that
>> userspace must always include this access in the accessible set?
>
> That would definitely be helpful to avoid switching PKRU in rseq
> handling code on exit to user space.
>
> Though with the reworked RSEQ code the extra overhead might not be
> horrible. See below.
We might have to dedicate an extra page, too. So I prefer to avoid it
possible.
I think I missed the below part?
> But like with signals just blindly enabling key0 and hope that it works
> is not really a solution. Nothing prevents me from disabling RSEQ for
> glibc. Then install my own RSEQ page and mprotect it. When that key
> becomes disabled in PKRU and the code section is interrupted then exit
> to user space will fault and die in exactly the same way as
> today. That's progress...
But does that matter? If I mprotect the stack and a signal arrives,
that results in a crash, too. Some things just don't work.
> So we really need to sit down and actually define a proper programming
> model first instead of trying to duct tape the current ill defined mess
> forever.
>
> What do we have to take into account:
>
> 1) signals
>
> Broken as we know already.
>
> IMO, the proper solution is to provide a mechanism to register a
> set of permissions which are used for signal delivery. The
> resulting hardware value should expand the permission, but keep
> the current active ones enabled.
>
> That can be kinda kept backwards compatible as the signal perms
> would default to PKEY0.
I had validated at one point that this works (although the patch that
enables internal pkeys usage in glibc did not exist back then).
pkeys: Support setting access rights for signal handlers
<https://lore.kernel.org/linux-mm/5fee976a-42d4-d469-7058-b78ad8897219@redhat.com/>
> 2) rseq
>
> The option of having a separate key which needs to be always
> enabled is definitely simple, but it wastes a key just for
> that. There are only 16 of them :(
>
> If we solve the signal case with an explicit permission set, we
> can just reuse those signal permissions. They are maybe wider than
> what's required to access RSEQ, but the signal permissions have to
> include the TLS/RSEQ area to actually work.
Would it address the use case for single-colored memory access? Or
would that still crash if the process gets descheduled while the access
rights register is set to the restricted value?
Thanks,
Florian
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 19:06 ` Florian Weimer
@ 2025-11-26 20:52 ` Thomas Gleixner
2025-11-26 22:06 ` Florian Weimer
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2025-11-26 20:52 UTC (permalink / raw)
To: Florian Weimer
Cc: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver,
Paul E. McKenney, x86, linux-kernel, Jens Axboe
On Wed, Nov 26 2025 at 20:06, Florian Weimer wrote:
> * Thomas Gleixner:
>> But like with signals just blindly enabling key0 and hope that it works
>> is not really a solution. Nothing prevents me from disabling RSEQ for
>> glibc. Then install my own RSEQ page and mprotect it. When that key
>> becomes disabled in PKRU and the code section is interrupted then exit
>> to user space will fault and die in exactly the same way as
>> today. That's progress...
>
> But does that matter? If I mprotect the stack and a signal arrives,
> that results in a crash, too. Some things just don't work.
They can be made work when we have a dedicated permission setting for
signals, which can be used for rseq access too. And having the explicit
signal permissions make a lot of sense independent of the above absurd
use case which I just used for illustration.
>> So we really need to sit down and actually define a proper programming
>> model first instead of trying to duct tape the current ill defined mess
>> forever.
>>
>> What do we have to take into account:
>>
>> 1) signals
>>
>> Broken as we know already.
>>
>> IMO, the proper solution is to provide a mechanism to register a
>> set of permissions which are used for signal delivery. The
>> resulting hardware value should expand the permission, but keep
>> the current active ones enabled.
>>
>> That can be kinda kept backwards compatible as the signal perms
>> would default to PKEY0.
>
> I had validated at one point that this works (although the patch that
> enables internal pkeys usage in glibc did not exist back then).
>
> pkeys: Support setting access rights for signal handlers
> <https://lore.kernel.org/linux-mm/5fee976a-42d4-d469-7058-b78ad8897219@redhat.com/>
That looks about right and what I had in mind. Seems I missed that back
in the days and that discussion unfortunately ran into a dead end :(
>> 2) rseq
>>
>> The option of having a separate key which needs to be always
>> enabled is definitely simple, but it wastes a key just for
>> that. There are only 16 of them :(
>>
>> If we solve the signal case with an explicit permission set, we
>> can just reuse those signal permissions. They are maybe wider than
>> what's required to access RSEQ, but the signal permissions have to
>> include the TLS/RSEQ area to actually work.
>
> Would it address the use case for single-colored memory access? Or
> would that still crash if the process gets descheduled while the access
> rights register is set to the restricted value?
It would just work the same way as signals. Assume
signal_perms = [PK0=RW, PK1=R, PK2=RW]
set_pkey(PK0..6=NONE, PK7=R)
access() <- can fault
<- or interrupt can happen
set_pkey(normal)
So when the fault or interrupt results in a signal and/or the return to
user space needs to access RSEQ we have in signal delivery:
cur = pkey_extend(signal_perms);
--> Perms are now [PK0=RW, PK1=R, PK2=RW, PK7=R]
access_user_stack();
....
// Return with the extended permissions to deliver the signal
// Will be restored on sigreturn
and in rseq:
cur = pkey_extend(signal_perms);
--> Perms are now [PK0=RW, PK1=R, PK2=RW, PK7=R]
access_user_rseq();
pkey_set(cur);
If the RSEQ access is nested in the signal delivery return then nothing
happens as the permissions are not changing because they are already
extended: A | A = A :).
The kernel does not care about the PKEY permissions when the user to
kernel transition is due to an interrupt/exception except for the signal
and rseq case.
In fact the above also works with my made up example. Just assume the
RSEQ page is protected by PK2. :)
Syscalls are a different story as copy_to/from_user() obviously requires
the proper permissions and the kernel can rightfully expect that stack
and rseq are accessible, but that's not what we are debating here.
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 20:52 ` Thomas Gleixner
@ 2025-11-26 22:06 ` Florian Weimer
2025-11-27 14:38 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2025-11-26 22:06 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver,
Paul E. McKenney, x86, linux-kernel, Jens Axboe
* Thomas Gleixner:
>>> What do we have to take into account:
>>>
>>> 1) signals
>>>
>>> Broken as we know already.
>>>
>>> IMO, the proper solution is to provide a mechanism to register a
>>> set of permissions which are used for signal delivery. The
>>> resulting hardware value should expand the permission, but keep
>>> the current active ones enabled.
>>>
>>> That can be kinda kept backwards compatible as the signal perms
>>> would default to PKEY0.
>>
>> I had validated at one point that this works (although the patch that
>> enables internal pkeys usage in glibc did not exist back then).
>>
>> pkeys: Support setting access rights for signal handlers
>> <https://lore.kernel.org/linux-mm/5fee976a-42d4-d469-7058-b78ad8897219@redhat.com/>
>
> That looks about right and what I had in mind. Seems I missed that back
> in the days and that discussion unfortunately ran into a dead end :(
There was a follow-up where I tried to incorporate the feedback
(PKEY_ALLOC_SIGNALINHERIT), but based on more recent discussions (here
and before that), the original approach referenced above seems
preferable.
>>> 2) rseq
>>>
>>> The option of having a separate key which needs to be always
>>> enabled is definitely simple, but it wastes a key just for
>>> that. There are only 16 of them :(
>>>
>>> If we solve the signal case with an explicit permission set, we
>>> can just reuse those signal permissions. They are maybe wider than
>>> what's required to access RSEQ, but the signal permissions have to
>>> include the TLS/RSEQ area to actually work.
>>
>> Would it address the use case for single-colored memory access? Or
>> would that still crash if the process gets descheduled while the access
>> rights register is set to the restricted value?
>
> It would just work the same way as signals. Assume
>
> signal_perms = [PK0=RW, PK1=R, PK2=RW]
>
> set_pkey(PK0..6=NONE, PK7=R)
>
> access() <- can fault
> <- or interrupt can happen
>
> set_pkey(normal)
>
> So when the fault or interrupt results in a signal and/or the return to
> user space needs to access RSEQ we have in signal delivery:
>
> cur = pkey_extend(signal_perms);
>
> --> Perms are now [PK0=RW, PK1=R, PK2=RW, PK7=R]
>
> access_user_stack();
> ....
> // Return with the extended permissions to deliver the signal
> // Will be restored on sigreturn
>
> and in rseq:
>
> cur = pkey_extend(signal_perms);
>
> --> Perms are now [PK0=RW, PK1=R, PK2=RW, PK7=R]
>
> access_user_rseq();
> pkey_set(cur);
>
> If the RSEQ access is nested in the signal delivery return then nothing
> happens as the permissions are not changing because they are already
> extended: A | A = A :).
Agreed. And the pkey_extend/pkey_set don't have a prohibitive cost, I
assume. I got the impression you were trying to avoid that sequence,
but I think it's more about defining the way pkey_extend works.
There's an unmerged glibc patch that allocates a protection key for the
dynamic linker, so we might end up with every process using rseq
(without critical sections) and protection keys.
Thanks,
Florian
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 22:06 ` Florian Weimer
@ 2025-11-27 14:38 ` Thomas Gleixner
0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2025-11-27 14:38 UTC (permalink / raw)
To: Florian Weimer
Cc: Kevin Brodsky, Dmitry Vyukov, mathieu.desnoyers, peterz,
boqun.feng, mingo, bp, dave.hansen, hpa, aruna.ramakrishna, elver,
Paul E. McKenney, x86, linux-kernel, Jens Axboe
On Wed, Nov 26 2025 at 23:06, Florian Weimer wrote:
> * Thomas Gleixner:
>> cur = pkey_extend(signal_perms);
>>
>> --> Perms are now [PK0=RW, PK1=R, PK2=RW, PK7=R]
>>
>> access_user_rseq();
>> pkey_set(cur);
>>
>> If the RSEQ access is nested in the signal delivery return then nothing
>> happens as the permissions are not changing because they are already
>> extended: A | A = A :).
>
> Agreed. And the pkey_extend/pkey_set don't have a prohibitive cost, I
> assume. I got the impression you were trying to avoid that sequence,
> but I think it's more about defining the way pkey_extend works.
Correct.
I was trying to avoid doing it on the exit to user path, but after
thinking more about it, I'm not so worried about it anymore.
With the recent rewrite of the RSEQ handling in the exit code path that
should more or less vanish in the noise. I'll hack something up and
evaluate the impact.
The real question is the semantics of this mechanism. In your proposed
patch you define it as (if I read it correctly):
1) The default signal protection is the same as the default for fork,
which is PKEY0 unless user space changes it.
2) pkey_alloc() with PKEY_ALLOC_SETSIGNAL set propagates the
protection to the signal mask/permission set
3) The permissions on signal delivery are:
(default & signal_mask) | signal_perms
I'm not entirely sure whether #3 is the right thing to do. My initial
thought was to keep the current permission set at the point where the
task got interrupted and expand it with the signal permissions, i.e:
(current & signal_mask) | signal_perms
Both variants can be argued for, but we have to agree on one them :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/4] rseq: Make rseq work with protection keys
2025-11-26 17:56 ` Thomas Gleixner
2025-11-26 19:06 ` Florian Weimer
@ 2025-12-02 19:19 ` Kevin Brodsky
1 sibling, 0 replies; 22+ messages in thread
From: Kevin Brodsky @ 2025-12-02 19:19 UTC (permalink / raw)
To: Thomas Gleixner, Florian Weimer
Cc: Dmitry Vyukov, mathieu.desnoyers, peterz, boqun.feng, mingo, bp,
dave.hansen, hpa, aruna.ramakrishna, elver, Paul E. McKenney, x86,
linux-kernel, Jens Axboe, Catalin Marinas
+ Catalin
On 26/11/2025 18:56, Thomas Gleixner wrote:
> [...]
>
>> In the other direction, code that sets a restrictive access mask is
>> already not allowed to call into arbitrary code. For example, we could
>> use protection keys internally within glibc in the dynamic linker and
>> require that a key that we allocated retains read access.
>>
>> Unfortunately, there's a use case for singleton access rights that does
>> not include key 0: validate that a pointer points to memory colored in a
>> specific way (e.g, for vtables, or for bytecode).
> Fair enough.
This goes beyond this singleton pattern: if one wants to isolate
untrusted (e.g. jitted) code by giving it its own pkey, then preventing
access to the default/privileged pkey 0 is essential while executing the
sandboxed code.
> [...]
>
>> On the other hand, I get the idea that protection keys are pretty dead.
>> So far, I couldn't get the x86 signal issue fixed in the kernel, so we
>> can't use them for glibc hardening.
> Then let's sit down and fix it once and forever.
That's all I've been hoping for :)
>> AArch64 duplicated the x86 behavior, too. And POWER removed
>> protection key support with the switch to the radix MMU.
> I'm not sure whether we should declare them dead.
>
> They definitely have a value, but none of this PKEY muck has been really
> thought through and we just ended up with a cobbled together ABI and a
> hard (impossible for some stuff) to use programming model.
Agreed, all that has been done so far in terms of async uaccess (such as
writing the signal frame) is working around the problem rather than
providing an actual solution.
> So we really need to sit down and actually define a proper programming
> model first instead of trying to duct tape the current ill defined mess
> forever.
>
> What do we have to take into account:
>
> 1) signals
>
> Broken as we know already.
And unnecessarily complicated, as we use two different pkey register
values: one to write the signal frame (no restriction), one to invoke
the signal handler (pkru_init / pkey 0 on arm64). A big piece of duct
tape :)
> IMO, the proper solution is to provide a mechanism to register a
> set of permissions which are used for signal delivery. The
> resulting hardware value should expand the permission, but keep
> the current active ones enabled.
It feels like there are really two situations to consider, as you
mentioned in your first reply:
a. Delivering the signal on the regular stack. This should work without
intervention: the interrupted context should be able to write to its own
stack pointer.
b. Delivering on the alternate signal stack. There is no expectation or
requirement that the interrupted context is able to access it, so we
need to update the pkey register.
In this specific case I feel that signal handlers should be able to rely
on a consistent ABI regardless of SA_ONSTACK. Preserving the register
value and expanding it with user-defined permissions seems like a
reasonable compromise.
> That can be kinda kept backwards compatible as the signal perms
> would default to PKEY0.
If anyone uses the current cobbled-up mechanism [1], i.e. uses some
assembly trampoline to switch the pkey register before invoking the
actual signal handler on the alternate signal stack (pkey != 0), then it
would break. I doubt there are many users if at all, though.
[1]
https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oKUfHRw@mail.gmail.com/
> 2) rseq
>
> The option of having a separate key which needs to be always
> enabled is definitely simple, but it wastes a key just for
> that. There are only 16 of them :(
And even fewer on arm64, just 8 :/ I don't think reserving a pkey just
for rseq is a reasonable option.
> If we solve the signal case with an explicit permission set, we
> can just reuse those signal permissions. They are maybe wider than
> what's required to access RSEQ, but the signal permissions have to
> include the TLS/RSEQ area to actually work.
But as you mentioned further up, nothing prevents the user from
bypassing glibc and using rseq directly. It seems rather strange to bake
in such assumption in the kernel.
Because accesses to the rseq struct are truly asynchronous and unrelated
to the interrupted context, I do not think that inheriting the
interrupted pkey register makes sense in that case. We could have a
separate mechanism to set that value, or maybe use the same value as
when the struct was registered (surely the context that called
rseq(&rseq_struct) must have access to rseq_struct).
> 3) io-uring
>
> "Works" on x86 as the worker inherits the permissions of the task
> which creates the worker unless the user memory is not accessible
> with the tasks current permissions. That's pretty much preventing
> a full isolation of the memory which that worker can access
> because the task which creates it must have the keys to access
> stack, rseq and whatever enabled to survive the syscall.
>
> Fails on ARM64 if the user memory is not accessible via the
> default key, which enforces that stack, rseq and the worker memory
> is accessible via the default key. Again no isolation of the
> worker memory possible.
>
> I think it should have a mechanism to set the required permissions
> explicitly and default to the current behaviour, but that's
> solvable within the io uring space I think. Jens?
This case seems very similar to rseq to me, but we may want more control
as io_uring allows to perform a wide range of accesses.
There are other situations where async uaccess occurs, and I think those
should be considered too. Users of __copy_from_user_inatomic() and
functions based on it such as copy_from_user_nofault() often fall in
that category.
From what I've gathered so far, most of these situations don't look too
concerning:
- Functions that inspect the stack (perf_callchain_user(),
arch_stack_walk_user()). The access is asynchronous, but like in the
signal delivery case it seems safe to assume that the stack is
accessible in any context.
- uprobe tracing allows reading memory at some arbitrary address. This
feels like a synchronous situation to me: when a particular function is
called, read some memory. The access will fail if prevented by the pkey
register, but that is exactly the same as if the function itself
accessed the memory.
One situation looks a lot more concerning:
- There are BPF helpers to read/write user memory (bpf_copy_from_user,
bpf_probe_read_user, etc.). These accesses do not necessarily occur in
any given user context AFAIU. Does it really make sense to apply pkey
restrictions in that case?
> Thoughts?
Thanks for getting the ball rolling! I really hope we can settle on a
consistent way to handle the pkey register for all those uaccess calls
that are not directly initiated by userspace.
- Kevin
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-12-02 19:19 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21 8:47 [PATCH v7 0/4] rseq: Make rseq work with protection keys Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 1/4] pkeys: add API to switch to permissive/zero pkey register Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 2/4] x86/signal: Use write_permissive_pkey_val() helper Dmitry Vyukov
2025-05-21 8:47 ` [PATCH v7 3/4] rseq: Make rseq work with protection keys Dmitry Vyukov
2025-05-21 8:59 ` Dmitry Vyukov
2025-06-24 9:17 ` Dmitry Vyukov
2025-07-18 9:01 ` Dmitry Vyukov
2025-07-21 13:25 ` Mathieu Desnoyers
2025-07-21 17:41 ` Dave Hansen
2025-08-21 15:12 ` Dmitry Vyukov
2025-09-19 13:07 ` Dmitry Vyukov
2025-09-22 13:06 ` Mathieu Desnoyers
2025-10-20 13:46 ` Kevin Brodsky
2025-11-26 0:45 ` Thomas Gleixner
2025-11-26 9:32 ` Florian Weimer
2025-11-26 17:56 ` Thomas Gleixner
2025-11-26 19:06 ` Florian Weimer
2025-11-26 20:52 ` Thomas Gleixner
2025-11-26 22:06 ` Florian Weimer
2025-11-27 14:38 ` Thomas Gleixner
2025-12-02 19:19 ` Kevin Brodsky
2025-05-21 8:47 ` [PATCH v7 4/4] selftests/rseq: Add test for rseq+pkeys Dmitry Vyukov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox