linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery
@ 2024-10-23 15:05 Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 1/5] arm64: signal: Remove unused macro Kevin Brodsky
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-23 15:05 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Kevin Brodsky, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, catalin.marinas, dave.hansen, dave.martin, jeffxu,
	joey.gouly, pierre.langlois, shuah, sroettger, will,
	linux-kselftest, x86

This series is a follow-up to Joey's Permission Overlay Extension (POE)
series [1] that recently landed on mainline. The goal is to improve the
way we handle the register that governs which pkeys/POIndex are
accessible (POR_EL0) during signal delivery. As things stand, we may
unexpectedly fail to write the signal frame on the stack because POR_EL0
is not reset before the uaccess operations. See patch 3 for more details
and the main changes this series brings.

A similar series landed recently for x86/MPK [2]; the present series
aims at aligning arm64 with x86. Worth noting: once the signal frame is
written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0
only. This means that a program that sets up an alternate signal stack
with a non-zero pkey will need some assembly trampoline to set POR_EL0
before invoking the real signal handler, as discussed here [3]. This is
not ideal, but it makes experimentation with pkeys in signal handlers
possible while waiting for a potential interface to control the pkey
state when delivering a signal. See Pierre's reply [4] for more
information about use-cases and a potential interface.

The x86 series also added kselftests to ensure that no spurious SIGSEGV
occurs during signal delivery regardless of which pkey is accessible at
the point where the signal is delivered. This series adapts those
kselftests to allow running them on arm64 (patch 4-5).

Finally patch 2 is a clean-up following feedback on Joey's series [5].

I have tested this series on arm64 and x86_64 (booting and running the
protection_keys and pkey_sighandler_tests mm kselftests).

v1..v2:
* In setup_rt_frame(), ensured that POR_EL0 is reset to its original
  value if we fail to deliver the signal (addresses Catalin's concern [6]).
* Renamed *unpriv_access* to *user_access* in patch 3 (suggestion from
  Dave).
* Made what patch 1-2 do explicit in the commit message body (suggestion
  from Dave).

- Kevin

[1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.gouly@arm.com/
[2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@oracle.com/
[3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oKUfHRw@mail.gmail.com/
[4] https://lore.kernel.org/linux-arm-kernel/87plns8owh.fsf@arm.com/
[5] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-truck/
[6] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/

Cc: akpm@linux-foundation.org
Cc: anshuman.khandual@arm.com
Cc: aruna.ramakrishna@oracle.com
Cc: broonie@kernel.org
Cc: catalin.marinas@arm.com
Cc: dave.hansen@linux.intel.com
Cc: dave.martin@arm.com
Cc: jeffxu@chromium.org
Cc: joey.gouly@arm.com
Cc: pierre.langlois@arm.com
Cc: shuah@kernel.org
Cc: sroettger@google.com
Cc: will@kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: x86@kernel.org


Kevin Brodsky (5):
  arm64: signal: Remove unused macro
  arm64: signal: Remove unnecessary check when saving POE state
  arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  selftests/mm: Use generic pkey register manipulation
  selftests/mm: Enable pkey_sighandler_tests on arm64

 arch/arm64/kernel/signal.c                    |  95 +++++++++++++---
 tools/testing/selftests/mm/Makefile           |   8 +-
 tools/testing/selftests/mm/pkey-arm64.h       |   1 +
 tools/testing/selftests/mm/pkey-x86.h         |   2 +
 .../selftests/mm/pkey_sighandler_tests.c      | 101 +++++++++++++-----
 5 files changed, 162 insertions(+), 45 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/5] arm64: signal: Remove unused macro
  2024-10-23 15:05 [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery Kevin Brodsky
@ 2024-10-23 15:05 ` Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 2/5] arm64: signal: Remove unnecessary check when saving POE state Kevin Brodsky
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-23 15:05 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Kevin Brodsky, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, catalin.marinas, dave.hansen, dave.martin, jeffxu,
	joey.gouly, pierre.langlois, shuah, sroettger, will,
	linux-kselftest, x86, Dave Martin

Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal
frame") introduced the BASE_SIGFRAME_SIZE macro but it has
apparently never been used; just remove it.

Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/kernel/signal.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 561986947530..dc998326e24d 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -66,7 +66,6 @@ struct rt_sigframe_user_layout {
 	unsigned long end_offset;
 };
 
-#define BASE_SIGFRAME_SIZE round_up(sizeof(struct rt_sigframe), 16)
 #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16)
 #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16)
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/5] arm64: signal: Remove unnecessary check when saving POE state
  2024-10-23 15:05 [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 1/5] arm64: signal: Remove unused macro Kevin Brodsky
@ 2024-10-23 15:05 ` Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures Kevin Brodsky
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-23 15:05 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Kevin Brodsky, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, catalin.marinas, dave.hansen, dave.martin, jeffxu,
	joey.gouly, pierre.langlois, shuah, sroettger, will,
	linux-kselftest, x86, Dave Martin

The POE frame record is allocated unconditionally if POE is
supported. If the allocation fails, a SIGSEGV is delivered before
setup_sigframe() can be reached. As a result there is no need to
consider poe_offset before saving POR_EL0; just remove that check.
This is in line with other frame records (FPMR, TPIDR2).

Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index dc998326e24d..f5fb48dabebe 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1092,7 +1092,7 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user,
 		err |= preserve_fpmr_context(fpmr_ctx);
 	}
 
-	if (system_supports_poe() && err == 0 && user->poe_offset) {
+	if (system_supports_poe() && err == 0) {
 		struct poe_context __user *poe_ctx =
 			apply_user_offset(user, user->poe_offset);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-23 15:05 [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 1/5] arm64: signal: Remove unused macro Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 2/5] arm64: signal: Remove unnecessary check when saving POE state Kevin Brodsky
@ 2024-10-23 15:05 ` Kevin Brodsky
  2024-10-24 10:59   ` Catalin Marinas
  2024-10-23 15:05 ` [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation Kevin Brodsky
  2024-10-23 15:05 ` [PATCH v2 5/5] selftests/mm: Enable pkey_sighandler_tests on arm64 Kevin Brodsky
  4 siblings, 1 reply; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-23 15:05 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Kevin Brodsky, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, catalin.marinas, dave.hansen, dave.martin, jeffxu,
	joey.gouly, pierre.langlois, shuah, sroettger, will,
	linux-kselftest, x86

TL;DR: reset POR_EL0 to "allow all" before writing the signal frame,
preventing spurious uaccess failures. Also, make sure that POR_EL0
remains unchanged if delivering the signal fails.

When POE is supported, the POR_EL0 register constrains memory
accesses based on the target page's POIndex (pkey). This raises the
question: what constraints should apply to a signal handler? The
current answer is that POR_EL0 is reset to POR_EL0_INIT when
invoking the handler, giving it full access to POIndex 0. This is in
line with x86's MPK support and remains unchanged.

This is only part of the story, though. POR_EL0 constrains all
unprivileged memory accesses, meaning that uaccess routines such as
put_user() are also impacted. As a result POR_EL0 may prevent the
signal frame from being written to the signal stack (ultimately
causing a SIGSEGV). This is especially concerning when an alternate
signal stack is used, because userspace may want to prevent access
to it outside of signal handlers. There is currently no provision
for that: POR_EL0 is reset after writing to the stack, and
POR_EL0_INIT only enables access to POIndex 0.

This patch ensures that POR_EL0 is reset to its most permissive
state before the signal stack is accessed. Once the signal frame has
been fully written, POR_EL0 is still set to POR_EL0_INIT - it is up
to the signal handler to enable access to additional pkeys if
needed. As to sigreturn(), it expects having access to the stack
like any other syscall; we only need to ensure that POR_EL0 is
restored from the signal frame after all uaccess calls. This
approach is in line with the recent x86/pkeys series [1].

Resetting POR_EL0 early introduces some complications, in that we
can no longer read the register directly in preserve_poe_context().
This is addressed by introducing a struct (user_access_state)
and helpers to manage any such register impacting user accesses
(uaccess and accesses in userspace). Things look like this on signal
delivery:

1. Save original POR_EL0 into struct [save_reset_user_access_state()]
2. Set POR_EL0 to "allow all"  [save_reset_user_access_state()]
3. Create signal frame
4. Write saved POR_EL0 value to the signal frame [preserve_poe_context()]
5. Finalise signal frame
6. If all operations succeeded:
  a. Set POR_EL0 to POR_EL0_INIT [set_handler_user_access_state()]
  b. Else reset POR_EL0 to its original value [restore_user_access_state()]

If any step fails when setting up the signal frame, the process will
be sent a SIGSEGV, which it may be able to handle. Step 6.b ensures
that the original POR_EL0 is saved in the signal frame when
delivering that SIGSEGV (so that the original value is restored by
sigreturn).

The return path (sys_rt_sigreturn) doesn't strictly require any change
since restore_poe_context() is already called last. However, to
avoid uaccess calls being accidentally added after that point, we
use the same approach as in the delivery path, i.e. separating
uaccess from writing to the register:

1. Read saved POR_EL0 value from the signal frame [restore_poe_context()]
2. Set POR_EL0 to the saved value [restore_user_access_state()]

[1] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@oracle.com/

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---

Note: the addtional check on err at the end of setup_rt_frame() is
made on purpose, in expectation of [2] which will allow setup_return()
to fail too.

[2] https://lore.kernel.org/all/20241001-arm64-gcs-v13-25-222b78d87eee@kernel.org/
---

 arch/arm64/kernel/signal.c | 92 ++++++++++++++++++++++++++++++++------
 1 file changed, 78 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index f5fb48dabebe..d2e4e50977ae 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -66,9 +66,63 @@ struct rt_sigframe_user_layout {
 	unsigned long end_offset;
 };
 
+/*
+ * Holds any EL0-controlled state that influences unprivileged memory accesses.
+ * This includes both accesses done in userspace and uaccess done in the kernel.
+ *
+ * This state needs to be carefully managed to ensure that it doesn't cause
+ * uaccess to fail when setting up the signal frame, and the signal handler
+ * itself also expects a well-defined state when entered.
+ */
+struct user_access_state {
+	u64 por_el0;
+};
+
 #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16)
 #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16)
 
+/*
+ * Save the unpriv access state into ua_state and reset it to disable any
+ * restrictions.
+ */
+static void save_reset_user_access_state(struct user_access_state *ua_state)
+{
+	if (system_supports_poe()) {
+		/*
+		 * Enable all permissions in all 8 keys
+		 * (inspired by REPEAT_BYTE())
+		 */
+		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
+
+		ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0);
+		write_sysreg_s(por_enable_all, SYS_POR_EL0);
+		/* Ensure that any subsequent uaccess observes the updated value */
+		isb();
+	}
+}
+
+/*
+ * Set the unpriv access state for invoking the signal handler.
+ *
+ * No uaccess should be done after that function is called.
+ */
+static void set_handler_user_access_state(void)
+{
+	if (system_supports_poe())
+		write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
+}
+
+/*
+ * Restore the unpriv access state to the values saved in ua_state.
+ *
+ * No uaccess should be done after that function is called.
+ */
+static void restore_user_access_state(const struct user_access_state *ua_state)
+{
+	if (system_supports_poe())
+		write_sysreg_s(ua_state->por_el0, SYS_POR_EL0);
+}
+
 static void init_user_layout(struct rt_sigframe_user_layout *user)
 {
 	const size_t reserved_size =
@@ -260,18 +314,20 @@ static int restore_fpmr_context(struct user_ctxs *user)
 	return err;
 }
 
-static int preserve_poe_context(struct poe_context __user *ctx)
+static int preserve_poe_context(struct poe_context __user *ctx,
+				const struct user_access_state *ua_state)
 {
 	int err = 0;
 
 	__put_user_error(POE_MAGIC, &ctx->head.magic, err);
 	__put_user_error(sizeof(*ctx), &ctx->head.size, err);
-	__put_user_error(read_sysreg_s(SYS_POR_EL0), &ctx->por_el0, err);
+	__put_user_error(ua_state->por_el0, &ctx->por_el0, err);
 
 	return err;
 }
 
-static int restore_poe_context(struct user_ctxs *user)
+static int restore_poe_context(struct user_ctxs *user,
+			       struct user_access_state *ua_state)
 {
 	u64 por_el0;
 	int err = 0;
@@ -281,7 +337,7 @@ static int restore_poe_context(struct user_ctxs *user)
 
 	__get_user_error(por_el0, &(user->poe->por_el0), err);
 	if (!err)
-		write_sysreg_s(por_el0, SYS_POR_EL0);
+		ua_state->por_el0 = por_el0;
 
 	return err;
 }
@@ -849,7 +905,8 @@ static int parse_user_sigframe(struct user_ctxs *user,
 }
 
 static int restore_sigframe(struct pt_regs *regs,
-			    struct rt_sigframe __user *sf)
+			    struct rt_sigframe __user *sf,
+			    struct user_access_state *ua_state)
 {
 	sigset_t set;
 	int i, err;
@@ -898,7 +955,7 @@ static int restore_sigframe(struct pt_regs *regs,
 		err = restore_zt_context(&user);
 
 	if (err == 0 && system_supports_poe() && user.poe)
-		err = restore_poe_context(&user);
+		err = restore_poe_context(&user, ua_state);
 
 	return err;
 }
@@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
 {
 	struct pt_regs *regs = current_pt_regs();
 	struct rt_sigframe __user *frame;
+	struct user_access_state ua_state;
 
 	/* Always make any pending restarted system calls return -EINTR */
 	current->restart_block.fn = do_no_restart_syscall;
@@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (!access_ok(frame, sizeof (*frame)))
 		goto badframe;
 
-	if (restore_sigframe(regs, frame))
+	if (restore_sigframe(regs, frame, &ua_state))
 		goto badframe;
 
 	if (restore_altstack(&frame->uc.uc_stack))
 		goto badframe;
 
+	restore_user_access_state(&ua_state);
+
 	return regs->regs[0];
 
 badframe:
@@ -1034,7 +1094,8 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user,
 }
 
 static int setup_sigframe(struct rt_sigframe_user_layout *user,
-			  struct pt_regs *regs, sigset_t *set)
+			  struct pt_regs *regs, sigset_t *set,
+			  const struct user_access_state *ua_state)
 {
 	int i, err = 0;
 	struct rt_sigframe __user *sf = user->sigframe;
@@ -1096,10 +1157,9 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user,
 		struct poe_context __user *poe_ctx =
 			apply_user_offset(user, user->poe_offset);
 
-		err |= preserve_poe_context(poe_ctx);
+		err |= preserve_poe_context(poe_ctx, ua_state);
 	}
 
-
 	/* ZA state if present */
 	if (system_supports_sme() && err == 0 && user->za_offset) {
 		struct za_context __user *za_ctx =
@@ -1236,9 +1296,6 @@ static void setup_return(struct pt_regs *regs, struct k_sigaction *ka,
 		sme_smstop();
 	}
 
-	if (system_supports_poe())
-		write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
-
 	if (ka->sa.sa_flags & SA_RESTORER)
 		sigtramp = ka->sa.sa_restorer;
 	else
@@ -1252,6 +1309,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set,
 {
 	struct rt_sigframe_user_layout user;
 	struct rt_sigframe __user *frame;
+	struct user_access_state ua_state;
 	int err = 0;
 
 	fpsimd_signal_preserve_current_state();
@@ -1259,13 +1317,14 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set,
 	if (get_sigframe(&user, ksig, regs))
 		return 1;
 
+	save_reset_user_access_state(&ua_state);
 	frame = user.sigframe;
 
 	__put_user_error(0, &frame->uc.uc_flags, err);
 	__put_user_error(NULL, &frame->uc.uc_link, err);
 
 	err |= __save_altstack(&frame->uc.uc_stack, regs->sp);
-	err |= setup_sigframe(&user, regs, set);
+	err |= setup_sigframe(&user, regs, set, &ua_state);
 	if (err == 0) {
 		setup_return(regs, &ksig->ka, &user, usig);
 		if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
@@ -1275,6 +1334,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set,
 		}
 	}
 
+	if (err == 0)
+		set_handler_user_access_state();
+	else
+		restore_user_access_state(&ua_state);
+
 	return err;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation
  2024-10-23 15:05 [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery Kevin Brodsky
                   ` (2 preceding siblings ...)
  2024-10-23 15:05 ` [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures Kevin Brodsky
@ 2024-10-23 15:05 ` Kevin Brodsky
  2024-10-23 16:51   ` Dave Hansen
  2024-10-23 15:05 ` [PATCH v2 5/5] selftests/mm: Enable pkey_sighandler_tests on arm64 Kevin Brodsky
  4 siblings, 1 reply; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-23 15:05 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Kevin Brodsky, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, catalin.marinas, dave.hansen, dave.martin, jeffxu,
	joey.gouly, pierre.langlois, shuah, sroettger, will,
	linux-kselftest, x86

pkey_sighandler_tests.c currently hardcodes x86 PKRU encodings. The
first step towards running those tests on arm64 is to abstract away
the pkey register values.

Since those tests want to deny access to all keys except a few,
we have each arch define PKEY_ALLOW_NONE, the pkey register value
denying access to all keys. We then use the existing set_pkey_bits()
helper to grant access to specific keys.

Because pkeys may also remove the execute permission on arm64, we
need to be a little careful: all code is mapped with pkey 0, and we
need it to remain executable. pkey_reg_no_access is introduced for
that purpose: this value prevents RW access to all pkeys, but
retains X permission for pkey 0.

test_pkru_preserved_after_sigusr1() only checks that the pkey
register value remains unchanged after a signal is delivered, so the
particular value is irrelevant. We enable pkey 0 and a few more
arbitrary keys in the smallest range available on all architectures
(8 keys on arm64).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 tools/testing/selftests/mm/pkey-arm64.h       |  1 +
 tools/testing/selftests/mm/pkey-x86.h         |  2 +
 .../selftests/mm/pkey_sighandler_tests.c      | 39 ++++++++++++++-----
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/mm/pkey-arm64.h b/tools/testing/selftests/mm/pkey-arm64.h
index 580e1b0bb38e..5ec53d67dfc7 100644
--- a/tools/testing/selftests/mm/pkey-arm64.h
+++ b/tools/testing/selftests/mm/pkey-arm64.h
@@ -31,6 +31,7 @@
 #define NR_RESERVED_PKEYS	1 /* pkey-0 */
 
 #define PKEY_ALLOW_ALL		0x77777777
+#define PKEY_ALLOW_NONE		0
 
 #define PKEY_BITS_PER_PKEY	4
 #define PAGE_SIZE		sysconf(_SC_PAGESIZE)
diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h
index 5f28e26a2511..53ed9a336ffe 100644
--- a/tools/testing/selftests/mm/pkey-x86.h
+++ b/tools/testing/selftests/mm/pkey-x86.h
@@ -34,6 +34,8 @@
 #define PAGE_SIZE		4096
 #define MB			(1<<20)
 
+#define PKEY_ALLOW_NONE		0x55555555
+
 static inline void __page_o_noops(void)
 {
 	/* 8-bytes of instruction * 512 bytes = 1 page */
diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c
index a8088b645ad6..b5e1767ee5d9 100644
--- a/tools/testing/selftests/mm/pkey_sighandler_tests.c
+++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c
@@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
 pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
 siginfo_t siginfo = {0};
 
+static u64 pkey_reg_no_access;
+
 /*
  * We need to use inline assembly instead of glibc's syscall because glibc's
  * syscall will attempt to access the PLT in order to call a library function
@@ -113,7 +115,7 @@ static void raise_sigusr2(void)
 static void *thread_segv_with_pkey0_disabled(void *ptr)
 {
 	/* Disable MPK 0 (and all others too) */
-	__write_pkey_reg(0x55555555);
+	__write_pkey_reg(pkey_reg_no_access);
 
 	/* Segfault (with SEGV_MAPERR) */
 	*(int *) (0x1) = 1;
@@ -123,7 +125,7 @@ static void *thread_segv_with_pkey0_disabled(void *ptr)
 static void *thread_segv_pkuerr_stack(void *ptr)
 {
 	/* Disable MPK 0 (and all others too) */
-	__write_pkey_reg(0x55555555);
+	__write_pkey_reg(pkey_reg_no_access);
 
 	/* After we disable MPK 0, we can't access the stack to return */
 	return NULL;
@@ -133,6 +135,7 @@ static void *thread_segv_maperr_ptr(void *ptr)
 {
 	stack_t *stack = ptr;
 	int *bad = (int *)1;
+	u64 pkey_reg;
 
 	/*
 	 * Setup alternate signal stack, which should be pkey_mprotect()ed by
@@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr)
 	syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
 
 	/* Disable MPK 0.  Only MPK 1 is enabled. */
-	__write_pkey_reg(0x55555551);
+	pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0);
+	__write_pkey_reg(pkey_reg);
 
 	/* Segfault */
 	*bad = 1;
@@ -240,6 +244,7 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
 	int pkey;
 	int parent_pid = 0;
 	int child_pid = 0;
+	u64 pkey_reg;
 
 	sa.sa_flags = SA_SIGINFO | SA_ONSTACK;
 
@@ -257,7 +262,9 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
 	assert(stack != MAP_FAILED);
 
 	/* Allow access to MPK 0 and MPK 1 */
-	__write_pkey_reg(0x55555550);
+	pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0);
+	pkey_reg = set_pkey_bits(pkey_reg, 1, 0);
+	__write_pkey_reg(pkey_reg);
 
 	/* Protect the new stack with MPK 1 */
 	pkey = pkey_alloc(0, 0);
@@ -307,7 +314,12 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
 static void test_pkru_preserved_after_sigusr1(void)
 {
 	struct sigaction sa;
-	unsigned long pkru = 0x45454544;
+	u64 pkey_reg;
+
+	/* Allow access to MPK 0 and an arbitrary set of keys */
+	pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0);
+	pkey_reg = set_pkey_bits(pkey_reg, 3, 0);
+	pkey_reg = set_pkey_bits(pkey_reg, 7, 0);
 
 	sa.sa_flags = SA_SIGINFO;
 
@@ -320,7 +332,7 @@ static void test_pkru_preserved_after_sigusr1(void)
 
 	memset(&siginfo, 0, sizeof(siginfo));
 
-	__write_pkey_reg(pkru);
+	__write_pkey_reg(pkey_reg);
 
 	raise(SIGUSR1);
 
@@ -330,7 +342,7 @@ static void test_pkru_preserved_after_sigusr1(void)
 	pthread_mutex_unlock(&mutex);
 
 	/* Ensure the pkru value is the same after returning from signal. */
-	ksft_test_result(pkru == __read_pkey_reg() &&
+	ksft_test_result(pkey_reg == __read_pkey_reg() &&
 			 siginfo.si_signo == SIGUSR1,
 			 "%s\n", __func__);
 }
@@ -347,6 +359,7 @@ static noinline void *thread_sigusr2_self(void *ptr)
 		'S', 'I', 'G', 'U', 'S', 'R', '2',
 		'.', '.', '.', '\n', '\0'};
 	stack_t *stack = ptr;
+	u64 pkey_reg;
 
 	/*
 	 * Setup alternate signal stack, which should be pkey_mprotect()ed by
@@ -356,7 +369,8 @@ static noinline void *thread_sigusr2_self(void *ptr)
 	syscall(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
 
 	/* Disable MPK 0.  Only MPK 2 is enabled. */
-	__write_pkey_reg(0x55555545);
+	pkey_reg = set_pkey_bits(pkey_reg_no_access, 2, 0);
+	__write_pkey_reg(pkey_reg);
 
 	raise_sigusr2();
 
@@ -384,6 +398,7 @@ static void test_pkru_sigreturn(void)
 	int pkey;
 	int parent_pid = 0;
 	int child_pid = 0;
+	u64 pkey_reg;
 
 	sa.sa_handler = SIG_DFL;
 	sa.sa_flags = 0;
@@ -418,7 +433,9 @@ static void test_pkru_sigreturn(void)
 	 * the current thread's stack is protected by the default MPK 0. Hence
 	 * both need to be enabled.
 	 */
-	__write_pkey_reg(0x55555544);
+	pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0);
+	pkey_reg = set_pkey_bits(pkey_reg, 2, 0);
+	__write_pkey_reg(pkey_reg);
 
 	/* Protect the stack with MPK 2 */
 	pkey = pkey_alloc(0, 0);
@@ -473,6 +490,10 @@ int main(int argc, char *argv[])
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(pkey_tests));
 
+	/* Only allow X for MPK 0 and nothing for other keys */
+	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0,
+					   PKEY_DISABLE_ACCESS);
+
 	for (i = 0; i < ARRAY_SIZE(pkey_tests); i++)
 		(*pkey_tests[i])();
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/5] selftests/mm: Enable pkey_sighandler_tests on arm64
  2024-10-23 15:05 [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery Kevin Brodsky
                   ` (3 preceding siblings ...)
  2024-10-23 15:05 ` [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation Kevin Brodsky
@ 2024-10-23 15:05 ` Kevin Brodsky
  4 siblings, 0 replies; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-23 15:05 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Kevin Brodsky, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, catalin.marinas, dave.hansen, dave.martin, jeffxu,
	joey.gouly, pierre.langlois, shuah, sroettger, will,
	linux-kselftest, x86

pkey_sighandler_tests.c makes raw syscalls using its own helper,
syscall_raw(). One of those syscalls is clone, which is problematic
as every architecture has a different opinion on the order of its
arguments.

To complete arm64 support, we therefore add an appropriate
implementation in syscall_raw(), and introduce a clone_raw() helper
that shuffles arguments as needed for each arch.

Having done this, we enable building pkey_sighandler_tests for arm64
in the Makefile.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 tools/testing/selftests/mm/Makefile           |  8 +--
 .../selftests/mm/pkey_sighandler_tests.c      | 62 ++++++++++++++-----
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 02e1204971b0..0f8c110e0805 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -105,12 +105,12 @@ endif
 ifeq ($(CAN_BUILD_X86_64),1)
 TEST_GEN_FILES += $(BINARIES_64)
 endif
-else
 
-ifneq (,$(filter $(ARCH),arm64 powerpc))
+else ifeq ($(ARCH),arm64)
+TEST_GEN_FILES += protection_keys
+TEST_GEN_FILES += pkey_sighandler_tests
+else ifeq ($(ARCH),powerpc)
 TEST_GEN_FILES += protection_keys
-endif
-
 endif
 
 ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390))
diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c
index b5e1767ee5d9..97460980811c 100644
--- a/tools/testing/selftests/mm/pkey_sighandler_tests.c
+++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c
@@ -61,12 +61,44 @@ long syscall_raw(long n, long a1, long a2, long a3, long a4, long a5, long a6)
 		      : "=a"(ret)
 		      : "a"(n), "b"(a1), "c"(a2), "d"(a3), "S"(a4), "D"(a5)
 		      : "memory");
+#elif defined __aarch64__
+	register long x0 asm("x0") = a1;
+	register long x1 asm("x1") = a2;
+	register long x2 asm("x2") = a3;
+	register long x3 asm("x3") = a4;
+	register long x4 asm("x4") = a5;
+	register long x5 asm("x5") = a6;
+	register long x8 asm("x8") = n;
+	asm volatile ("svc #0"
+		      : "=r"(x0)
+		      : "r"(x0), "r"(x1), "r"(x2), "r"(x3), "r"(x4), "r"(x5), "r"(x8)
+		      : "memory");
+	ret = x0;
 #else
 # error syscall_raw() not implemented
 #endif
 	return ret;
 }
 
+static inline long clone_raw(unsigned long flags, void *stack,
+			     int *parent_tid, int *child_tid)
+{
+	long a1 = flags;
+	long a2 = (long)stack;
+	long a3 = (long)parent_tid;
+#if defined(__x86_64__) || defined(__i386)
+	long a4 = (long)child_tid;
+	long a5 = 0;
+#elif defined(__aarch64__)
+	long a4 = 0;
+	long a5 = (long)child_tid;
+#else
+# error clone_raw() not implemented
+#endif
+
+	return syscall_raw(SYS_clone, a1, a2, a3, a4, a5, 0);
+}
+
 static void sigsegv_handler(int signo, siginfo_t *info, void *ucontext)
 {
 	pthread_mutex_lock(&mutex);
@@ -279,14 +311,13 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
 	memset(&siginfo, 0, sizeof(siginfo));
 
 	/* Use clone to avoid newer glibcs using rseq on new threads */
-	long ret = syscall_raw(SYS_clone,
-			       CLONE_VM | CLONE_FS | CLONE_FILES |
-			       CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
-			       CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |
-			       CLONE_DETACHED,
-			       (long) ((char *)(stack) + STACK_SIZE),
-			       (long) &parent_pid,
-			       (long) &child_pid, 0, 0);
+	long ret = clone_raw(CLONE_VM | CLONE_FS | CLONE_FILES |
+			     CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
+			     CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |
+			     CLONE_DETACHED,
+			     stack + STACK_SIZE,
+			     &parent_pid,
+			     &child_pid);
 
 	if (ret < 0) {
 		errno = -ret;
@@ -448,14 +479,13 @@ static void test_pkru_sigreturn(void)
 	sigstack.ss_size = STACK_SIZE;
 
 	/* Use clone to avoid newer glibcs using rseq on new threads */
-	long ret = syscall_raw(SYS_clone,
-			       CLONE_VM | CLONE_FS | CLONE_FILES |
-			       CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
-			       CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |
-			       CLONE_DETACHED,
-			       (long) ((char *)(stack) + STACK_SIZE),
-			       (long) &parent_pid,
-			       (long) &child_pid, 0, 0);
+	long ret = clone_raw(CLONE_VM | CLONE_FS | CLONE_FILES |
+			     CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
+			     CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |
+			     CLONE_DETACHED,
+			     stack + STACK_SIZE,
+			     &parent_pid,
+			     &child_pid);
 
 	if (ret < 0) {
 		errno = -ret;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation
  2024-10-23 15:05 ` [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation Kevin Brodsky
@ 2024-10-23 16:51   ` Dave Hansen
  2024-10-25  8:31     ` Kevin Brodsky
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2024-10-23 16:51 UTC (permalink / raw)
  To: Kevin Brodsky, linux-arm-kernel
  Cc: akpm, anshuman.khandual, aruna.ramakrishna, broonie,
	catalin.marinas, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On 10/23/24 08:05, Kevin Brodsky wrote:
...> diff --git a/tools/testing/selftests/mm/pkey-x86.h
b/tools/testing/selftests/mm/pkey-x86.h
> index 5f28e26a2511..53ed9a336ffe 100644
> --- a/tools/testing/selftests/mm/pkey-x86.h
> +++ b/tools/testing/selftests/mm/pkey-x86.h
> @@ -34,6 +34,8 @@
>  #define PAGE_SIZE		4096
>  #define MB			(1<<20)
>  
> +#define PKEY_ALLOW_NONE		0x55555555

Hi Kevin,

Looking at this in context, I think "PKEY_ALLOW_NONE" is not a great
name.  On one hand, we have:

	PKEY_DISABLE_ACCESS
	PKEY_DISABLE_WRITE

which are values for *A* pkey.

But PKEY_ALLOW_NONE is a whole register value and spans permissions for
many keys.  We don't want folks trying to do something like:

	pkey_alloc(flags, PKEY_ALLOW_NONE);

If I were naming it in x86 code, I'd probably call it:

	PKRU_ALLOW_NONE

or something.

>  static inline void __page_o_noops(void)
>  {
>  	/* 8-bytes of instruction * 512 bytes = 1 page */
> diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c
> index a8088b645ad6..b5e1767ee5d9 100644
> --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c
> +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c
> @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
>  pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
>  siginfo_t siginfo = {0};
>  
> +static u64 pkey_reg_no_access;

Ideally, this would be a real const or a #define because it really is
static.  Right?  Or is there something dynamic about the ARM
implementation's value?

...
>  	 * Setup alternate signal stack, which should be pkey_mprotect()ed by
> @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr)
>  	syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
>  
>  	/* Disable MPK 0.  Only MPK 1 is enabled. */
> -	__write_pkey_reg(0x55555551);
> +	pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0);
> +	__write_pkey_reg(pkey_reg);

The existing magic numbers are not great, but could we do:

#define PKEY_ALLOW_ALL 0x0

So that this can be written like this:

	pkey_reg = PKRU_ALLOW_NONE;
	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);

That would get rid of the magic '0'.

>  	/* Segfault */
>  	*bad = 1;
> @@ -240,6 +244,7 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
>  	int pkey;
>  	int parent_pid = 0;
>  	int child_pid = 0;
> +	u64 pkey_reg;
>  
>  	sa.sa_flags = SA_SIGINFO | SA_ONSTACK;
>  
> @@ -257,7 +262,9 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
>  	assert(stack != MAP_FAILED);
>  
>  	/* Allow access to MPK 0 and MPK 1 */
> -	__write_pkey_reg(0x55555550);
> +	pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0);
> +	pkey_reg = set_pkey_bits(pkey_reg, 1, 0);
> +	__write_pkey_reg(pkey_reg);

... and using the pattern from above, this is quite a bit more readable:

	pkey_reg = PKRU_ALLOW_NONE;
	pkey_reg = set_pkey_bits(pkey_reg, 0, PKEY_ALLOW_ALL);
	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);

...
> +	/* Only allow X for MPK 0 and nothing for other keys */
> +	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0,
> +					   PKEY_DISABLE_ACCESS);
If the comment says "only allow X", then I'd expect the code to say:

	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X);

... or something similar.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-23 15:05 ` [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures Kevin Brodsky
@ 2024-10-24 10:59   ` Catalin Marinas
  2024-10-24 14:55     ` Kevin Brodsky
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2024-10-24 10:59 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-arm-kernel, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index f5fb48dabebe..d2e4e50977ae 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -66,9 +66,63 @@ struct rt_sigframe_user_layout {
>  	unsigned long end_offset;
>  };
>  
> +/*
> + * Holds any EL0-controlled state that influences unprivileged memory accesses.
> + * This includes both accesses done in userspace and uaccess done in the kernel.
> + *
> + * This state needs to be carefully managed to ensure that it doesn't cause
> + * uaccess to fail when setting up the signal frame, and the signal handler
> + * itself also expects a well-defined state when entered.
> + */
> +struct user_access_state {
> +	u64 por_el0;
> +};
> +
>  #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16)
>  #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16)
>  
> +/*
> + * Save the unpriv access state into ua_state and reset it to disable any
> + * restrictions.
> + */
> +static void save_reset_user_access_state(struct user_access_state *ua_state)
> +{
> +	if (system_supports_poe()) {
> +		/*
> +		 * Enable all permissions in all 8 keys
> +		 * (inspired by REPEAT_BYTE())
> +		 */
> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;

I think this should be ~0ul.

> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
>  {
>  	struct pt_regs *regs = current_pt_regs();
>  	struct rt_sigframe __user *frame;
> +	struct user_access_state ua_state;
>  
>  	/* Always make any pending restarted system calls return -EINTR */
>  	current->restart_block.fn = do_no_restart_syscall;
> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
>  	if (!access_ok(frame, sizeof (*frame)))
>  		goto badframe;
>  
> -	if (restore_sigframe(regs, frame))
> +	if (restore_sigframe(regs, frame, &ua_state))
>  		goto badframe;
>  
>  	if (restore_altstack(&frame->uc.uc_stack))
>  		goto badframe;
>  
> +	restore_user_access_state(&ua_state);
> +
>  	return regs->regs[0];
>  
>  badframe:

The saving part I'm fine with. For restoring, I was wondering whether we
can get a more privileged POR_EL0 if reading the frame somehow failed.
This is largely theoretical, there are other ways to attack like
writing POR_EL0 directly than unmapping/remapping the signal stack.

What I'd change here is always restore_user_access_state() to
POR_EL0_INIT. Maybe just initialise ua_state above and add the function
call after the badframe label.

Either way:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-24 10:59   ` Catalin Marinas
@ 2024-10-24 14:55     ` Kevin Brodsky
  2024-10-24 15:42       ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-24 14:55 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On 24/10/2024 12:59, Catalin Marinas wrote:
> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
>> +/*
>> + * Save the unpriv access state into ua_state and reset it to disable any
>> + * restrictions.
>> + */
>> +static void save_reset_user_access_state(struct user_access_state *ua_state)
>> +{
>> +	if (system_supports_poe()) {
>> +		/*
>> +		 * Enable all permissions in all 8 keys
>> +		 * (inspired by REPEAT_BYTE())
>> +		 */
>> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
> I think this should be ~0ul.

It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the
lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32
bits are RES0). That said, given that D128 has 4-bit pkeys, we could
anticipate and fill the top 32 bits too (should make no difference on D64).

>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
>>  {
>>  	struct pt_regs *regs = current_pt_regs();
>>  	struct rt_sigframe __user *frame;
>> +	struct user_access_state ua_state;
>>  
>>  	/* Always make any pending restarted system calls return -EINTR */
>>  	current->restart_block.fn = do_no_restart_syscall;
>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
>>  	if (!access_ok(frame, sizeof (*frame)))
>>  		goto badframe;
>>  
>> -	if (restore_sigframe(regs, frame))
>> +	if (restore_sigframe(regs, frame, &ua_state))
>>  		goto badframe;
>>  
>>  	if (restore_altstack(&frame->uc.uc_stack))
>>  		goto badframe;
>>  
>> +	restore_user_access_state(&ua_state);
>> +
>>  	return regs->regs[0];
>>  
>>  badframe:
> The saving part I'm fine with. For restoring, I was wondering whether we
> can get a more privileged POR_EL0 if reading the frame somehow failed.
> This is largely theoretical, there are other ways to attack like
> writing POR_EL0 directly than unmapping/remapping the signal stack.
>
> What I'd change here is always restore_user_access_state() to
> POR_EL0_INIT. Maybe just initialise ua_state above and add the function
> call after the badframe label.

I'm not sure I understand. When we enter this function, POR_EL0 is set
to whatever the signal handler set it to (POR_EL0_INIT by default).
There are then two cases:
1) Everything succeeds, including reading the saved POR_EL0 from the
frame. We then call restore_user_access_state(), setting POR_EL0 to the
value we've read, and return to userspace.
2) Any uaccess fails (for instance reading POR_EL0). In that case we
leave POR_EL0 unchanged and deliver SIGSEGV.

In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
whatever the signal handler set it to. It's not clear to me that forcing
it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
handler will be able to recover, since the new signal frame we will
create for it may be a mix of interrupted state and signal handler state
(depending on exactly where we fail).

Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-24 14:55     ` Kevin Brodsky
@ 2024-10-24 15:42       ` Catalin Marinas
  2024-10-24 16:19         ` Dave Martin
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2024-10-24 15:42 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-arm-kernel, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote:
> On 24/10/2024 12:59, Catalin Marinas wrote:
> > On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
> >> +/*
> >> + * Save the unpriv access state into ua_state and reset it to disable any
> >> + * restrictions.
> >> + */
> >> +static void save_reset_user_access_state(struct user_access_state *ua_state)
> >> +{
> >> +	if (system_supports_poe()) {
> >> +		/*
> >> +		 * Enable all permissions in all 8 keys
> >> +		 * (inspired by REPEAT_BYTE())
> >> +		 */
> >> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
> > I think this should be ~0ul.
> 
> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the
> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32
> bits are RES0). That said, given that D128 has 4-bit pkeys, we could
> anticipate and fill the top 32 bits too (should make no difference on D64).

I guess we could leave it as 32-bit for now and remember to update it
when we enable more keys with D128. Setting the top RES0 bits doesn't
hurt either since they are already documented in the Arm ARM. Up to you,
it's fine like above as well.

> >> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >>  {
> >>  	struct pt_regs *regs = current_pt_regs();
> >>  	struct rt_sigframe __user *frame;
> >> +	struct user_access_state ua_state;
> >>  
> >>  	/* Always make any pending restarted system calls return -EINTR */
> >>  	current->restart_block.fn = do_no_restart_syscall;
> >> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >>  	if (!access_ok(frame, sizeof (*frame)))
> >>  		goto badframe;
> >>  
> >> -	if (restore_sigframe(regs, frame))
> >> +	if (restore_sigframe(regs, frame, &ua_state))
> >>  		goto badframe;
> >>  
> >>  	if (restore_altstack(&frame->uc.uc_stack))
> >>  		goto badframe;
> >>  
> >> +	restore_user_access_state(&ua_state);
> >> +
> >>  	return regs->regs[0];
> >>  
> >>  badframe:
> > The saving part I'm fine with. For restoring, I was wondering whether we
> > can get a more privileged POR_EL0 if reading the frame somehow failed.
> > This is largely theoretical, there are other ways to attack like
> > writing POR_EL0 directly than unmapping/remapping the signal stack.
> >
> > What I'd change here is always restore_user_access_state() to
> > POR_EL0_INIT. Maybe just initialise ua_state above and add the function
> > call after the badframe label.
> 
> I'm not sure I understand. When we enter this function, POR_EL0 is set
> to whatever the signal handler set it to (POR_EL0_INIT by default).
> There are then two cases:
> 1) Everything succeeds, including reading the saved POR_EL0 from the
> frame. We then call restore_user_access_state(), setting POR_EL0 to the
> value we've read, and return to userspace.
> 2) Any uaccess fails (for instance reading POR_EL0). In that case we
> leave POR_EL0 unchanged and deliver SIGSEGV.
> 
> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
> whatever the signal handler set it to. It's not clear to me that forcing
> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
> handler will be able to recover, since the new signal frame we will
> create for it may be a mix of interrupted state and signal handler state
> (depending on exactly where we fail).

If the SIGSEGV delivery succeeds, returning would restore the POR_EL0
set up by the previous signal handler, potentially more privileged. Does
it matter? Can it return all the way to the original context?

-- 
Catalin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-24 15:42       ` Catalin Marinas
@ 2024-10-24 16:19         ` Dave Martin
  2024-10-25  8:24           ` Kevin Brodsky
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Martin @ 2024-10-24 16:19 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kevin Brodsky, linux-arm-kernel, akpm, anshuman.khandual,
	aruna.ramakrishna, broonie, dave.hansen, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote:
> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote:
> > On 24/10/2024 12:59, Catalin Marinas wrote:
> > > On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
> > >> +/*
> > >> + * Save the unpriv access state into ua_state and reset it to disable any
> > >> + * restrictions.
> > >> + */
> > >> +static void save_reset_user_access_state(struct user_access_state *ua_state)
> > >> +{
> > >> +	if (system_supports_poe()) {
> > >> +		/*
> > >> +		 * Enable all permissions in all 8 keys
> > >> +		 * (inspired by REPEAT_BYTE())
> > >> +		 */
> > >> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
> > > I think this should be ~0ul.
> > 
> > It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the
> > lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32
> > bits are RES0). That said, given that D128 has 4-bit pkeys, we could
> > anticipate and fill the top 32 bits too (should make no difference on D64).
> 
> I guess we could leave it as 32-bit for now and remember to update it
> when we enable more keys with D128. Setting the top RES0 bits doesn't
> hurt either since they are already documented in the Arm ARM. Up to you,
> it's fine like above as well.

Can we maybe just have a brute-force loop that constructs the value
using the appropriate #define macros?

The compiler will const-fold it; I'd be prepared to bet that the
generated code would be identical...


> > >> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > >>  {
> > >>  	struct pt_regs *regs = current_pt_regs();
> > >>  	struct rt_sigframe __user *frame;
> > >> +	struct user_access_state ua_state;
> > >>  
> > >>  	/* Always make any pending restarted system calls return -EINTR */
> > >>  	current->restart_block.fn = do_no_restart_syscall;
> > >> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > >>  	if (!access_ok(frame, sizeof (*frame)))
> > >>  		goto badframe;
> > >>  
> > >> -	if (restore_sigframe(regs, frame))
> > >> +	if (restore_sigframe(regs, frame, &ua_state))
> > >>  		goto badframe;
> > >>  
> > >>  	if (restore_altstack(&frame->uc.uc_stack))
> > >>  		goto badframe;
> > >>  
> > >> +	restore_user_access_state(&ua_state);
> > >> +
> > >>  	return regs->regs[0];
> > >>  
> > >>  badframe:
> > > The saving part I'm fine with. For restoring, I was wondering whether we
> > > can get a more privileged POR_EL0 if reading the frame somehow failed.
> > > This is largely theoretical, there are other ways to attack like
> > > writing POR_EL0 directly than unmapping/remapping the signal stack.
> > >
> > > What I'd change here is always restore_user_access_state() to
> > > POR_EL0_INIT. Maybe just initialise ua_state above and add the function
> > > call after the badframe label.
> > 
> > I'm not sure I understand. When we enter this function, POR_EL0 is set
> > to whatever the signal handler set it to (POR_EL0_INIT by default).
> > There are then two cases:
> > 1) Everything succeeds, including reading the saved POR_EL0 from the
> > frame. We then call restore_user_access_state(), setting POR_EL0 to the
> > value we've read, and return to userspace.
> > 2) Any uaccess fails (for instance reading POR_EL0). In that case we
> > leave POR_EL0 unchanged and deliver SIGSEGV.
> > 
> > In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
> > whatever the signal handler set it to. It's not clear to me that forcing
> > it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
> > handler will be able to recover, since the new signal frame we will
> > create for it may be a mix of interrupted state and signal handler state
> > (depending on exactly where we fail).
> 
> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0
> set up by the previous signal handler, potentially more privileged. Does
> it matter? Can it return all the way to the original context?

That seems a valid concern.

It looks a bit like we don't back out the temporary change to POR_EL0
if writing the sigframe fails, so the temporary "allow all" perms might
get saved out into the SIGSEGV sigframe on the alternate signal
stack, and will then be restored as the user thread's POR_EL0 when the
SIGSEGV returns.

(This is all assuming that the force_sig(SIGSEGV) logic works properly
at all...  I'm still trying to puzzle it out!)

Cheers
---Dave


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-24 16:19         ` Dave Martin
@ 2024-10-25  8:24           ` Kevin Brodsky
  2024-10-25 11:04             ` Dave Martin
  2024-10-25 11:33             ` Dave Martin
  0 siblings, 2 replies; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-25  8:24 UTC (permalink / raw)
  To: Dave Martin, Catalin Marinas
  Cc: linux-arm-kernel, akpm, anshuman.khandual, aruna.ramakrishna,
	broonie, dave.hansen, jeffxu, joey.gouly, pierre.langlois, shuah,
	sroettger, will, linux-kselftest, x86

On 24/10/2024 18:19, Dave Martin wrote:
> On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote:
>> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote:
>>> On 24/10/2024 12:59, Catalin Marinas wrote:
>>>> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
>>>>> +/*
>>>>> + * Save the unpriv access state into ua_state and reset it to disable any
>>>>> + * restrictions.
>>>>> + */
>>>>> +static void save_reset_user_access_state(struct user_access_state *ua_state)
>>>>> +{
>>>>> +	if (system_supports_poe()) {
>>>>> +		/*
>>>>> +		 * Enable all permissions in all 8 keys
>>>>> +		 * (inspired by REPEAT_BYTE())
>>>>> +		 */
>>>>> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
>>>> I think this should be ~0ul.
>>> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the
>>> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32
>>> bits are RES0). That said, given that D128 has 4-bit pkeys, we could
>>> anticipate and fill the top 32 bits too (should make no difference on D64).
>> I guess we could leave it as 32-bit for now and remember to update it
>> when we enable more keys with D128. Setting the top RES0 bits doesn't
>> hurt either since they are already documented in the Arm ARM. Up to you,
>> it's fine like above as well.
> Can we maybe just have a brute-force loop that constructs the value
> using the appropriate #define macros?
>
> The compiler will const-fold it; I'd be prepared to bet that the
> generated code would be identical...

Fine by me, I suppose I was too eager to use the one-liner I had found
:) Building that value based on arch_max_pkey() is probably a better
idea in the long run. (And indeed the codegen is the same, it boils down
to a mov w0, #0x77777777 in both case.)

>>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
>>>>>  {
>>>>>  	struct pt_regs *regs = current_pt_regs();
>>>>>  	struct rt_sigframe __user *frame;
>>>>> +	struct user_access_state ua_state;
>>>>>  
>>>>>  	/* Always make any pending restarted system calls return -EINTR */
>>>>>  	current->restart_block.fn = do_no_restart_syscall;
>>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
>>>>>  	if (!access_ok(frame, sizeof (*frame)))
>>>>>  		goto badframe;
>>>>>  
>>>>> -	if (restore_sigframe(regs, frame))
>>>>> +	if (restore_sigframe(regs, frame, &ua_state))
>>>>>  		goto badframe;
>>>>>  
>>>>>  	if (restore_altstack(&frame->uc.uc_stack))
>>>>>  		goto badframe;
>>>>>  
>>>>> +	restore_user_access_state(&ua_state);
>>>>> +
>>>>>  	return regs->regs[0];
>>>>>  
>>>>>  badframe:
>>>> The saving part I'm fine with. For restoring, I was wondering whether we
>>>> can get a more privileged POR_EL0 if reading the frame somehow failed.
>>>> This is largely theoretical, there are other ways to attack like
>>>> writing POR_EL0 directly than unmapping/remapping the signal stack.
>>>>
>>>> What I'd change here is always restore_user_access_state() to
>>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function
>>>> call after the badframe label.
>>> I'm not sure I understand. When we enter this function, POR_EL0 is set
>>> to whatever the signal handler set it to (POR_EL0_INIT by default).
>>> There are then two cases:
>>> 1) Everything succeeds, including reading the saved POR_EL0 from the
>>> frame. We then call restore_user_access_state(), setting POR_EL0 to the
>>> value we've read, and return to userspace.
>>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we
>>> leave POR_EL0 unchanged and deliver SIGSEGV.
>>>
>>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
>>> whatever the signal handler set it to. It's not clear to me that forcing
>>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
>>> handler will be able to recover, since the new signal frame we will
>>> create for it may be a mix of interrupted state and signal handler state
>>> (depending on exactly where we fail).
>> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0
>> set up by the previous signal handler, potentially more privileged. Does
>> it matter? Can it return all the way to the original context?

What we store into the signal frame when delivering that SIGSEGV is a
mixture of the original state (up to the point of failure) and the
signal handler's state (what we couldn't restore). It's hard to reason
about how that SIGSEGV handler could possibly handle this, but in any
case it would have to massage its signal frame so that the next
sigreturn does the right thing. Restoring only part of the frame records
is bound to cause trouble and that's true for POR_EL0 as well - I doubt
there's much value in special-casing it.

>
> That seems a valid concern.
>
> It looks a bit like we don't back out the temporary change to POR_EL0
> if writing the sigframe fails, so the temporary "allow all" perms might
> get saved out into the SIGSEGV sigframe on the alternate signal
> stack, and will then be restored as the user thread's POR_EL0 when the
> SIGSEGV returns.

It sounds like you're referring to the delivery case, not the return
case. In the delivery case (setup_rt_frame()), the "allow all" value
will never be saved into the sigframe because we call
restore_user_access_state() if anything failed (this is new in v2,
exactly to prevent that scenario).

Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation
  2024-10-23 16:51   ` Dave Hansen
@ 2024-10-25  8:31     ` Kevin Brodsky
  2024-10-25 15:09       ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-25  8:31 UTC (permalink / raw)
  To: Dave Hansen, linux-arm-kernel
  Cc: akpm, anshuman.khandual, aruna.ramakrishna, broonie,
	catalin.marinas, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86,
	Yury Khrustalev

On 23/10/2024 18:51, Dave Hansen wrote:
> On 10/23/24 08:05, Kevin Brodsky wrote:
> ...> diff --git a/tools/testing/selftests/mm/pkey-x86.h
> b/tools/testing/selftests/mm/pkey-x86.h
>> index 5f28e26a2511..53ed9a336ffe 100644
>> --- a/tools/testing/selftests/mm/pkey-x86.h
>> +++ b/tools/testing/selftests/mm/pkey-x86.h
>> @@ -34,6 +34,8 @@
>>  #define PAGE_SIZE		4096
>>  #define MB			(1<<20)
>>  
>> +#define PKEY_ALLOW_NONE		0x55555555
> Hi Kevin,
>
> Looking at this in context, I think "PKEY_ALLOW_NONE" is not a great
> name.  On one hand, we have:
>
> 	PKEY_DISABLE_ACCESS
> 	PKEY_DISABLE_WRITE
>
> which are values for *A* pkey.
>
> But PKEY_ALLOW_NONE is a whole register value and spans permissions for
> many keys.  We don't want folks trying to do something like:
>
> 	pkey_alloc(flags, PKEY_ALLOW_NONE);
>
> If I were naming it in x86 code, I'd probably call it:
>
> 	PKRU_ALLOW_NONE
>
> or something.

I agree, the naming is not ideal, I lacked inspiration! Maybe
PKEY_REG_ALLOW_NONE to remain generic?

>
>>  static inline void __page_o_noops(void)
>>  {
>>  	/* 8-bytes of instruction * 512 bytes = 1 page */
>> diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c
>> index a8088b645ad6..b5e1767ee5d9 100644
>> --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c
>> +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c
>> @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
>>  pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
>>  siginfo_t siginfo = {0};
>>  
>> +static u64 pkey_reg_no_access;
> Ideally, this would be a real const or a #define because it really is
> static.  Right?  Or is there something dynamic about the ARM
> implementation's value?

It isn't dynamic no, the issue is that on architectures where pkeys
restrict execution we need to allow X for pkey 0. Of course it would be
possible to define PKEY_REG_ALLOW_ALL in such a way that X is allowed
for pkey 0, but I was concerned this might be misleading. No strong
opinion either way, happy to make it purely a macro, maybe with a better
name?

> ...
>>  	 * Setup alternate signal stack, which should be pkey_mprotect()ed by
>> @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr)
>>  	syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
>>  
>>  	/* Disable MPK 0.  Only MPK 1 is enabled. */
>> -	__write_pkey_reg(0x55555551);
>> +	pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0);
>> +	__write_pkey_reg(pkey_reg);
> The existing magic numbers are not great, but could we do:
>
> #define PKEY_ALLOW_ALL 0x0
>
> So that this can be written like this:
>
> 	pkey_reg = PKRU_ALLOW_NONE;
> 	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);
>
> That would get rid of the magic '0'.

Definitely better yes. But how about using Yury's uapi addition,
PKEY_UNRESTRICTED [1]?

[1]
https://lore.kernel.org/all/20241022120128.359652-1-yury.khrustalev@arm.com/

>
>>  	/* Segfault */
>>  	*bad = 1;
>> @@ -240,6 +244,7 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
>>  	int pkey;
>>  	int parent_pid = 0;
>>  	int child_pid = 0;
>> +	u64 pkey_reg;
>>  
>>  	sa.sa_flags = SA_SIGINFO | SA_ONSTACK;
>>  
>> @@ -257,7 +262,9 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void)
>>  	assert(stack != MAP_FAILED);
>>  
>>  	/* Allow access to MPK 0 and MPK 1 */
>> -	__write_pkey_reg(0x55555550);
>> +	pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0);
>> +	pkey_reg = set_pkey_bits(pkey_reg, 1, 0);
>> +	__write_pkey_reg(pkey_reg);
> ... and using the pattern from above, this is quite a bit more readable:
>
> 	pkey_reg = PKRU_ALLOW_NONE;
> 	pkey_reg = set_pkey_bits(pkey_reg, 0, PKEY_ALLOW_ALL);
> 	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);
>
> ...
>> +	/* Only allow X for MPK 0 and nothing for other keys */
>> +	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0,
>> +					   PKEY_DISABLE_ACCESS);
> If the comment says "only allow X", then I'd expect the code to say:
>
> 	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X);
>
> ... or something similar.

I could #define PKEY_X PKEY_DISABLE_ACCESS but is the mixture of
negative and positive polarity really helping? We cannot define PKEY_R
and PKEY_W so that (for instance) PKEY_R | PKEY_X does what it says.
Having to use PKEY_DISABLE_ACCESS to mean "X only" is not ideal, but
this is what userspace already has to do.

Either way if we define PKEY_REG_ALLOW_NONE or similar to allow X for
pkey 0 as suggested then this will go.

Thanks for the review!

Kevin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-25  8:24           ` Kevin Brodsky
@ 2024-10-25 11:04             ` Dave Martin
  2024-10-25 11:33             ` Dave Martin
  1 sibling, 0 replies; 19+ messages in thread
From: Dave Martin @ 2024-10-25 11:04 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: Catalin Marinas, linux-arm-kernel, akpm, anshuman.khandual,
	aruna.ramakrishna, broonie, dave.hansen, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote:
> On 24/10/2024 18:19, Dave Martin wrote:
> > On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote:
> >> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote:
> >>> On 24/10/2024 12:59, Catalin Marinas wrote:
> >>>> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
> >>>>> +/*
> >>>>> + * Save the unpriv access state into ua_state and reset it to disable any
> >>>>> + * restrictions.
> >>>>> + */
> >>>>> +static void save_reset_user_access_state(struct user_access_state *ua_state)
> >>>>> +{
> >>>>> +	if (system_supports_poe()) {
> >>>>> +		/*
> >>>>> +		 * Enable all permissions in all 8 keys
> >>>>> +		 * (inspired by REPEAT_BYTE())
> >>>>> +		 */
> >>>>> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
> >>>> I think this should be ~0ul.
> >>> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the
> >>> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32
> >>> bits are RES0). That said, given that D128 has 4-bit pkeys, we could
> >>> anticipate and fill the top 32 bits too (should make no difference on D64).
> >> I guess we could leave it as 32-bit for now and remember to update it
> >> when we enable more keys with D128. Setting the top RES0 bits doesn't
> >> hurt either since they are already documented in the Arm ARM. Up to you,
> >> it's fine like above as well.
> > Can we maybe just have a brute-force loop that constructs the value
> > using the appropriate #define macros?
> >
> > The compiler will const-fold it; I'd be prepared to bet that the
> > generated code would be identical...
> 
> Fine by me, I suppose I was too eager to use the one-liner I had found
> :) Building that value based on arch_max_pkey() is probably a better
> idea in the long run. (And indeed the codegen is the same, it boils down
> to a mov w0, #0x77777777 in both case.)

The one-line was a neat trick (after the brief WTF moment) :)

I guess my uneasiness comes from baking the number of pkeys in via the
type of 0u and an implicit relationship that this happens to have with
the number bits per pkey in the POR.

[...]

Cheers
---Dave


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-25  8:24           ` Kevin Brodsky
  2024-10-25 11:04             ` Dave Martin
@ 2024-10-25 11:33             ` Dave Martin
  2024-10-25 15:34               ` Kevin Brodsky
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Martin @ 2024-10-25 11:33 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: Catalin Marinas, linux-arm-kernel, akpm, anshuman.khandual,
	aruna.ramakrishna, broonie, dave.hansen, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote:
> On 24/10/2024 18:19, Dave Martin wrote:
> > On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote:
> >> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote:
> >>> On 24/10/2024 12:59, Catalin Marinas wrote:
> >>>> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote:
> >>>>> +/*
> >>>>> + * Save the unpriv access state into ua_state and reset it to disable any
> >>>>> + * restrictions.
> >>>>> + */
> >>>>> +static void save_reset_user_access_state(struct user_access_state *ua_state)
> >>>>> +{
> >>>>> +	if (system_supports_poe()) {
> >>>>> +		/*
> >>>>> +		 * Enable all permissions in all 8 keys
> >>>>> +		 * (inspired by REPEAT_BYTE())
> >>>>> +		 */
> >>>>> +		u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
> >>>> I think this should be ~0ul.
> >>> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the
> >>> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32
> >>> bits are RES0). That said, given that D128 has 4-bit pkeys, we could
> >>> anticipate and fill the top 32 bits too (should make no difference on D64).
> >> I guess we could leave it as 32-bit for now and remember to update it
> >> when we enable more keys with D128. Setting the top RES0 bits doesn't
> >> hurt either since they are already documented in the Arm ARM. Up to you,
> >> it's fine like above as well.
> > Can we maybe just have a brute-force loop that constructs the value
> > using the appropriate #define macros?
> >
> > The compiler will const-fold it; I'd be prepared to bet that the
> > generated code would be identical...
> 
> Fine by me, I suppose I was too eager to use the one-liner I had found
> :) Building that value based on arch_max_pkey() is probably a better
> idea in the long run. (And indeed the codegen is the same, it boils down
> to a mov w0, #0x77777777 in both case.)
> 
> >>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >>>>>  {
> >>>>>  	struct pt_regs *regs = current_pt_regs();
> >>>>>  	struct rt_sigframe __user *frame;
> >>>>> +	struct user_access_state ua_state;
> >>>>>  
> >>>>>  	/* Always make any pending restarted system calls return -EINTR */
> >>>>>  	current->restart_block.fn = do_no_restart_syscall;
> >>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >>>>>  	if (!access_ok(frame, sizeof (*frame)))
> >>>>>  		goto badframe;
> >>>>>  
> >>>>> -	if (restore_sigframe(regs, frame))
> >>>>> +	if (restore_sigframe(regs, frame, &ua_state))
> >>>>>  		goto badframe;
> >>>>>  
> >>>>>  	if (restore_altstack(&frame->uc.uc_stack))
> >>>>>  		goto badframe;
> >>>>>  
> >>>>> +	restore_user_access_state(&ua_state);
> >>>>> +
> >>>>>  	return regs->regs[0];
> >>>>>  
> >>>>>  badframe:
> >>>> The saving part I'm fine with. For restoring, I was wondering whether we
> >>>> can get a more privileged POR_EL0 if reading the frame somehow failed.
> >>>> This is largely theoretical, there are other ways to attack like
> >>>> writing POR_EL0 directly than unmapping/remapping the signal stack.
> >>>>
> >>>> What I'd change here is always restore_user_access_state() to
> >>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function
> >>>> call after the badframe label.
> >>> I'm not sure I understand. When we enter this function, POR_EL0 is set
> >>> to whatever the signal handler set it to (POR_EL0_INIT by default).
> >>> There are then two cases:
> >>> 1) Everything succeeds, including reading the saved POR_EL0 from the
> >>> frame. We then call restore_user_access_state(), setting POR_EL0 to the
> >>> value we've read, and return to userspace.
> >>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we
> >>> leave POR_EL0 unchanged and deliver SIGSEGV.
> >>>
> >>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
> >>> whatever the signal handler set it to. It's not clear to me that forcing
> >>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
> >>> handler will be able to recover, since the new signal frame we will
> >>> create for it may be a mix of interrupted state and signal handler state
> >>> (depending on exactly where we fail).
> >> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0
> >> set up by the previous signal handler, potentially more privileged. Does
> >> it matter? Can it return all the way to the original context?
> 
> What we store into the signal frame when delivering that SIGSEGV is a
> mixture of the original state (up to the point of failure) and the
> signal handler's state (what we couldn't restore). It's hard to reason
> about how that SIGSEGV handler could possibly handle this, but in any
> case it would have to massage its signal frame so that the next
> sigreturn does the right thing. Restoring only part of the frame records
> is bound to cause trouble and that's true for POR_EL0 as well - I doubt
> there's much value in special-casing it.

This feels like a simplification?

We can leave a mix of restored and unrestored state when generating the
SIGSEGV signal frame, providing that those changes will make no
difference when the rt_sigreturn is replayed.

POR_EL0 will make a difference, though.

The POR_EL0 image in the SIGSEGV signal frame needs be the same value
that caused the original rt_sigreturn to barf (if this is what caused
the barf).  It should be up to the SIGSEGV handler to decide what (if
anything) to do about that.  The kernel can't know what userspace
intended.

Note that for this to work, the SIGSEGV stack (whether main or
alternate) must be accessible with POR_EL0_INIT permissions, or the
SIGSEGV handler must start with a (gross) asm shim to establish a
usable POR_EL0.  But that's not really our problem here.

(I'm not saying that the kernel necessarily fails to do this -- I
haven't checked -- but just trying to understand the problem here.)


The actual problem here is that if the SIGSEGV handler wants to bail
out with a siglongjmp(), there is no way to determine the correct value
of POR_EL0 to restore.

I wonder whether POR_EL0 should be saved in sigjmp_buf (depending on
whether sigjmp_buf is horribly inextensible and also full up, or merely
horribly inextensible).

Does anyone know whether PKRU in sigjmp_buf on x86?

> 
> >
> > That seems a valid concern.
> >
> > It looks a bit like we don't back out the temporary change to POR_EL0
> > if writing the sigframe fails, so the temporary "allow all" perms might
> > get saved out into the SIGSEGV sigframe on the alternate signal
> > stack, and will then be restored as the user thread's POR_EL0 when the
> > SIGSEGV returns.
> 
> It sounds like you're referring to the delivery case, not the return
> case. In the delivery case (setup_rt_frame()), the "allow all" value
> will never be saved into the sigframe because we call
> restore_user_access_state() if anything failed (this is new in v2,
> exactly to prevent that scenario).

Ah, right -- I missed that detail.

Cheers
---Dave


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation
  2024-10-25  8:31     ` Kevin Brodsky
@ 2024-10-25 15:09       ` Dave Hansen
  2024-10-28 10:20         ` Kevin Brodsky
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2024-10-25 15:09 UTC (permalink / raw)
  To: Kevin Brodsky, linux-arm-kernel
  Cc: akpm, anshuman.khandual, aruna.ramakrishna, broonie,
	catalin.marinas, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86,
	Yury Khrustalev

On 10/25/24 01:31, Kevin Brodsky wrote:
> I agree, the naming is not ideal, I lacked inspiration! Maybe
> PKEY_REG_ALLOW_NONE to remain generic?

Works for me.

>>>  static inline void __page_o_noops(void)
>>>  {
>>>  	/* 8-bytes of instruction * 512 bytes = 1 page */
>>> diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c
>>> index a8088b645ad6..b5e1767ee5d9 100644
>>> --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c
>>> +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c
>>> @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
>>>  pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
>>>  siginfo_t siginfo = {0};
>>>  
>>> +static u64 pkey_reg_no_access;
>> Ideally, this would be a real const or a #define because it really is
>> static.  Right?  Or is there something dynamic about the ARM
>> implementation's value?
> 
> It isn't dynamic no, the issue is that on architectures where pkeys
> restrict execution we need to allow X for pkey 0. Of course it would be
> possible to define PKEY_REG_ALLOW_ALL in such a way that X is allowed
> for pkey 0, but I was concerned this might be misleading. No strong
> opinion either way, happy to make it purely a macro, maybe with a better
> name?

I do think we should differentiate truly "no access" value from the one
that allows X on pkey 0, at least in the selftest.  Define a helper that
uses the *real* "no access" value:

/*
 * Returns the most restrictive register value
 * that can be used in the selftest.
 */
static inline u64 pkey_reg_restrictive_default(void)
{
	/*
	 * The selftest code runs (mostly) with its code mapped with
	 * pkey-0.  Allows execution on pkey-0 so that each site doesn't
	 * have to do this:
	 */
	return set_pkey_bits(PKEY_REG_NO_ACCESS, 0, PKEY_X);
}

and then use it like this:

	pkey_reg = pkey_reg_restrictive_default();
 	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);

>>>  	 * Setup alternate signal stack, which should be pkey_mprotect()ed by
>>> @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr)
>>>  	syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
>>>  
>>>  	/* Disable MPK 0.  Only MPK 1 is enabled. */
>>> -	__write_pkey_reg(0x55555551);
>>> +	pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0);
>>> +	__write_pkey_reg(pkey_reg);
>> The existing magic numbers are not great, but could we do:
>>
>> #define PKEY_ALLOW_ALL 0x0
>>
>> So that this can be written like this:
>>
>> 	pkey_reg = PKRU_ALLOW_NONE;
>> 	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);
>>
>> That would get rid of the magic '0'.
> 
> Definitely better yes. But how about using Yury's uapi addition,
> PKEY_UNRESTRICTED [1]?
> 

Works for me.

>> ...
>>> +	/* Only allow X for MPK 0 and nothing for other keys */
>>> +	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0,
>>> +					   PKEY_DISABLE_ACCESS);
>> If the comment says "only allow X", then I'd expect the code to say:
>>
>> 	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X);
>>
>> ... or something similar.
> 
> I could #define PKEY_X PKEY_DISABLE_ACCESS but is the mixture of
> negative and positive polarity really helping? We cannot define PKEY_R
> and PKEY_W so that (for instance) PKEY_R | PKEY_X does what it says.
> Having to use PKEY_DISABLE_ACCESS to mean "X only" is not ideal, but
> this is what userspace already has to do.

There would be some churn, but we could also convert the whole thing
over to just use explicit RWX enable bits, like in the
thread_segv_maperr_ptr() test:

	// Truly turn everything off:
	pkey_reg = PKEY_REG_NO_ACCESS;
 	pkey_reg = set_pkey_perm(pkey_reg, 1, PKEY_RW);

I'm not sure that's worth the churn though.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-25 11:33             ` Dave Martin
@ 2024-10-25 15:34               ` Kevin Brodsky
  2024-11-18 15:06                 ` Dave Martin
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-25 15:34 UTC (permalink / raw)
  To: Dave Martin
  Cc: Catalin Marinas, linux-arm-kernel, akpm, anshuman.khandual,
	aruna.ramakrishna, broonie, dave.hansen, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On 25/10/2024 13:33, Dave Martin wrote:
> On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote:
>>>>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
>>>>>>>  {
>>>>>>>  	struct pt_regs *regs = current_pt_regs();
>>>>>>>  	struct rt_sigframe __user *frame;
>>>>>>> +	struct user_access_state ua_state;
>>>>>>>  
>>>>>>>  	/* Always make any pending restarted system calls return -EINTR */
>>>>>>>  	current->restart_block.fn = do_no_restart_syscall;
>>>>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
>>>>>>>  	if (!access_ok(frame, sizeof (*frame)))
>>>>>>>  		goto badframe;
>>>>>>>  
>>>>>>> -	if (restore_sigframe(regs, frame))
>>>>>>> +	if (restore_sigframe(regs, frame, &ua_state))
>>>>>>>  		goto badframe;
>>>>>>>  
>>>>>>>  	if (restore_altstack(&frame->uc.uc_stack))
>>>>>>>  		goto badframe;
>>>>>>>  
>>>>>>> +	restore_user_access_state(&ua_state);
>>>>>>> +
>>>>>>>  	return regs->regs[0];
>>>>>>>  
>>>>>>>  badframe:
>>>>>> The saving part I'm fine with. For restoring, I was wondering whether we
>>>>>> can get a more privileged POR_EL0 if reading the frame somehow failed.
>>>>>> This is largely theoretical, there are other ways to attack like
>>>>>> writing POR_EL0 directly than unmapping/remapping the signal stack.
>>>>>>
>>>>>> What I'd change here is always restore_user_access_state() to
>>>>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function
>>>>>> call after the badframe label.
>>>>> I'm not sure I understand. When we enter this function, POR_EL0 is set
>>>>> to whatever the signal handler set it to (POR_EL0_INIT by default).
>>>>> There are then two cases:
>>>>> 1) Everything succeeds, including reading the saved POR_EL0 from the
>>>>> frame. We then call restore_user_access_state(), setting POR_EL0 to the
>>>>> value we've read, and return to userspace.
>>>>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we
>>>>> leave POR_EL0 unchanged and deliver SIGSEGV.
>>>>>
>>>>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
>>>>> whatever the signal handler set it to. It's not clear to me that forcing
>>>>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
>>>>> handler will be able to recover, since the new signal frame we will
>>>>> create for it may be a mix of interrupted state and signal handler state
>>>>> (depending on exactly where we fail).
>>>> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0
>>>> set up by the previous signal handler, potentially more privileged. Does
>>>> it matter? Can it return all the way to the original context?
>> What we store into the signal frame when delivering that SIGSEGV is a
>> mixture of the original state (up to the point of failure) and the
>> signal handler's state (what we couldn't restore). It's hard to reason
>> about how that SIGSEGV handler could possibly handle this, but in any
>> case it would have to massage its signal frame so that the next
>> sigreturn does the right thing. Restoring only part of the frame records
>> is bound to cause trouble and that's true for POR_EL0 as well - I doubt
>> there's much value in special-casing it.
> This feels like a simplification?
>
> We can leave a mix of restored and unrestored state when generating the
> SIGSEGV signal frame, providing that those changes will make no
> difference when the rt_sigreturn is replayed.

I'm not sure I understand what this means. If the SIGSEGV handler were
to sigreturn without touching its signal frame, things are likely to
explode: it may be returning to the point where the original handler
called sigreturn, for instance (if the first uaccess failed during that
sigreturn call).

> POR_EL0 will make a difference, though.
>
> The POR_EL0 image in the SIGSEGV signal frame needs be the same value
> that caused the original rt_sigreturn to barf (if this is what caused
> the barf).  It should be up to the SIGSEGV handler to decide what (if
> anything) to do about that.  The kernel can't know what userspace
> intended.

Unless I'm missing something this is exactly what happens now: what we
store in the SIGSEGV frame is the POR_EL0 value the original handler was
using.

> Note that for this to work, the SIGSEGV stack (whether main or
> alternate) must be accessible with POR_EL0_INIT permissions, or the
> SIGSEGV handler must start with a (gross) asm shim to establish a
> usable POR_EL0.  But that's not really our problem here.

This is indeed orthogonal - the SIGSEGV handler will be run with
POR_EL0_INIT, like any other handler. The value we store in the frame is
unrelated.

> (I'm not saying that the kernel necessarily fails to do this -- I
> haven't checked -- but just trying to understand the problem here.)
>
>
> The actual problem here is that if the SIGSEGV handler wants to bail
> out with a siglongjmp(), there is no way to determine the correct value
> of POR_EL0 to restore.

Correct, but again this is true of any other record - for instance TPIDR2.

> I wonder whether POR_EL0 should be saved in sigjmp_buf (depending on
> whether sigjmp_buf is horribly inextensible and also full up, or merely
> horribly inextensible).

It very much feels that this is the case - if a handler relies on
longjmp() or setcontext() to restore a known state, then POR_EL0 should
be part of that state.

>
> Does anyone know whether PKRU in sigjmp_buf on x86?

I can't say for sure but I don't see PKRU being handled in
setjmp/longjmp in glibc at least.

Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation
  2024-10-25 15:09       ` Dave Hansen
@ 2024-10-28 10:20         ` Kevin Brodsky
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin Brodsky @ 2024-10-28 10:20 UTC (permalink / raw)
  To: Dave Hansen, linux-arm-kernel
  Cc: akpm, anshuman.khandual, aruna.ramakrishna, broonie,
	catalin.marinas, dave.hansen, dave.martin, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86,
	Yury Khrustalev

On 25/10/2024 17:09, Dave Hansen wrote:
> [...]
>>>>  static inline void __page_o_noops(void)
>>>>  {
>>>>  	/* 8-bytes of instruction * 512 bytes = 1 page */
>>>> diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c
>>>> index a8088b645ad6..b5e1767ee5d9 100644
>>>> --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c
>>>> +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c
>>>> @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
>>>>  pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
>>>>  siginfo_t siginfo = {0};
>>>>  
>>>> +static u64 pkey_reg_no_access;
>>> Ideally, this would be a real const or a #define because it really is
>>> static.  Right?  Or is there something dynamic about the ARM
>>> implementation's value?
>> It isn't dynamic no, the issue is that on architectures where pkeys
>> restrict execution we need to allow X for pkey 0. Of course it would be
>> possible to define PKEY_REG_ALLOW_ALL in such a way that X is allowed
>> for pkey 0, but I was concerned this might be misleading. No strong
>> opinion either way, happy to make it purely a macro, maybe with a better
>> name?
> I do think we should differentiate truly "no access" value from the one
> that allows X on pkey 0, at least in the selftest.  Define a helper that
> uses the *real* "no access" value:
>
> /*
>  * Returns the most restrictive register value
>  * that can be used in the selftest.
>  */
> static inline u64 pkey_reg_restrictive_default(void)
> {
> 	/*
> 	 * The selftest code runs (mostly) with its code mapped with
> 	 * pkey-0.  Allows execution on pkey-0 so that each site doesn't
> 	 * have to do this:
> 	 */
> 	return set_pkey_bits(PKEY_REG_NO_ACCESS, 0, PKEY_X);
> }
>
> and then use it like this:
>
> 	pkey_reg = pkey_reg_restrictive_default();
>  	pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL);

Didn't think of introducing a helper, that looks good, thanks! Inlining
and optimisations should be able to reduce the call to a constant,
another win compared to using a global. Will go for that in v3.

> [...]
>>> ...
>>>> +	/* Only allow X for MPK 0 and nothing for other keys */
>>>> +	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0,
>>>> +					   PKEY_DISABLE_ACCESS);
>>> If the comment says "only allow X", then I'd expect the code to say:
>>>
>>> 	pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X);
>>>
>>> ... or something similar.
>> I could #define PKEY_X PKEY_DISABLE_ACCESS but is the mixture of
>> negative and positive polarity really helping? We cannot define PKEY_R
>> and PKEY_W so that (for instance) PKEY_R | PKEY_X does what it says.
>> Having to use PKEY_DISABLE_ACCESS to mean "X only" is not ideal, but
>> this is what userspace already has to do.
> There would be some churn, but we could also convert the whole thing
> over to just use explicit RWX enable bits, like in the
> thread_segv_maperr_ptr() test:
>
> 	// Truly turn everything off:
> 	pkey_reg = PKEY_REG_NO_ACCESS;
>  	pkey_reg = set_pkey_perm(pkey_reg, 1, PKEY_RW);

I suppose that not granting X permission for pkey 1 would make sense,
but since we don't have anything mapped with PROT_EXEC and pkey 1 the
benefits are minimal.

> I'm not sure that's worth the churn though.

I'm not convinced either. I'd rather keep using the standard uapi
macros, even though they can be confusing (but they at least combine in
the way you'd expect).

Kevin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
  2024-10-25 15:34               ` Kevin Brodsky
@ 2024-11-18 15:06                 ` Dave Martin
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Martin @ 2024-11-18 15:06 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: Catalin Marinas, linux-arm-kernel, akpm, anshuman.khandual,
	aruna.ramakrishna, broonie, dave.hansen, jeffxu, joey.gouly,
	pierre.langlois, shuah, sroettger, will, linux-kselftest, x86

On Fri, Oct 25, 2024 at 05:34:24PM +0200, Kevin Brodsky wrote:
> On 25/10/2024 13:33, Dave Martin wrote:
> > On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote:
> >>>>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >>>>>>>  {
> >>>>>>>  	struct pt_regs *regs = current_pt_regs();
> >>>>>>>  	struct rt_sigframe __user *frame;
> >>>>>>> +	struct user_access_state ua_state;
> >>>>>>>  
> >>>>>>>  	/* Always make any pending restarted system calls return -EINTR */
> >>>>>>>  	current->restart_block.fn = do_no_restart_syscall;
> >>>>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >>>>>>>  	if (!access_ok(frame, sizeof (*frame)))
> >>>>>>>  		goto badframe;
> >>>>>>>  
> >>>>>>> -	if (restore_sigframe(regs, frame))
> >>>>>>> +	if (restore_sigframe(regs, frame, &ua_state))
> >>>>>>>  		goto badframe;
> >>>>>>>  
> >>>>>>>  	if (restore_altstack(&frame->uc.uc_stack))
> >>>>>>>  		goto badframe;
> >>>>>>>  
> >>>>>>> +	restore_user_access_state(&ua_state);
> >>>>>>> +
> >>>>>>>  	return regs->regs[0];
> >>>>>>>  
> >>>>>>>  badframe:
> >>>>>> The saving part I'm fine with. For restoring, I was wondering whether we
> >>>>>> can get a more privileged POR_EL0 if reading the frame somehow failed.
> >>>>>> This is largely theoretical, there are other ways to attack like
> >>>>>> writing POR_EL0 directly than unmapping/remapping the signal stack.
> >>>>>>
> >>>>>> What I'd change here is always restore_user_access_state() to
> >>>>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function
> >>>>>> call after the badframe label.
> >>>>> I'm not sure I understand. When we enter this function, POR_EL0 is set
> >>>>> to whatever the signal handler set it to (POR_EL0_INIT by default).
> >>>>> There are then two cases:
> >>>>> 1) Everything succeeds, including reading the saved POR_EL0 from the
> >>>>> frame. We then call restore_user_access_state(), setting POR_EL0 to the
> >>>>> value we've read, and return to userspace.
> >>>>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we
> >>>>> leave POR_EL0 unchanged and deliver SIGSEGV.
> >>>>>
> >>>>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or
> >>>>> whatever the signal handler set it to. It's not clear to me that forcing
> >>>>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV
> >>>>> handler will be able to recover, since the new signal frame we will
> >>>>> create for it may be a mix of interrupted state and signal handler state
> >>>>> (depending on exactly where we fail).
> >>>> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0
> >>>> set up by the previous signal handler, potentially more privileged. Does
> >>>> it matter? Can it return all the way to the original context?
> >> What we store into the signal frame when delivering that SIGSEGV is a
> >> mixture of the original state (up to the point of failure) and the
> >> signal handler's state (what we couldn't restore). It's hard to reason
> >> about how that SIGSEGV handler could possibly handle this, but in any
> >> case it would have to massage its signal frame so that the next
> >> sigreturn does the right thing. Restoring only part of the frame records
> >> is bound to cause trouble and that's true for POR_EL0 as well - I doubt
> >> there's much value in special-casing it.
> > This feels like a simplification?
> >
> > We can leave a mix of restored and unrestored state when generating the
> > SIGSEGV signal frame, providing that those changes will make no
> > difference when the rt_sigreturn is replayed.
> 
> I'm not sure I understand what this means. If the SIGSEGV handler were
> to sigreturn without touching its signal frame, things are likely to
> explode: it may be returning to the point where the original handler
> called sigreturn, for instance (if the first uaccess failed during that
> sigreturn call).

Not sure if this still matters, but the scenario I was thinking of is
where the initial sigreturn fails due to something under process
control (such mprotect()) that the SIGSEGV may patch up before
returning.

If the initial sigreturn has restored X0, say, before it blows up, this
iwon't affect what the sigreturn does on the second pass after the
SIGSEGV handler fixes up whatever needs fixing up.

It does matter if sigreturn can blow up and trigger a SIGSEGV after
restoring parts of the thread state that will themselves affect the
replay of the interrupted sigreturn.

This use case is pretty marginal; I don't know how much software really
relies on it.


It does look there are potential issues here: I think that we may be
able to barf after destructively restoring SP, which would mean that a
replay of the sigreturn would go wrong.

The POE support doesn't seem to make this worse though (the ua_state
restoration is done after confirming that the sigreturn can succeed,
if I understand right).

> 
> > POR_EL0 will make a difference, though.
> >
> > The POR_EL0 image in the SIGSEGV signal frame needs be the same value
> > that caused the original rt_sigreturn to barf (if this is what caused
> > the barf).  It should be up to the SIGSEGV handler to decide what (if
> > anything) to do about that.  The kernel can't know what userspace
> > intended.
> 
> Unless I'm missing something this is exactly what happens now: what we
> store in the SIGSEGV frame is the POR_EL0 value the original handler was
> using.

Yep, I had forgotten how the SIGSEGV injection works with regard to this.

> 
> > Note that for this to work, the SIGSEGV stack (whether main or
> > alternate) must be accessible with POR_EL0_INIT permissions, or the
> > SIGSEGV handler must start with a (gross) asm shim to establish a
> > usable POR_EL0.  But that's not really our problem here.
> 
> This is indeed orthogonal - the SIGSEGV handler will be run with
> POR_EL0_INIT, like any other handler. The value we store in the frame is
> unrelated.
> 
> > (I'm not saying that the kernel necessarily fails to do this -- I
> > haven't checked -- but just trying to understand the problem here.)
> >
> >
> > The actual problem here is that if the SIGSEGV handler wants to bail
> > out with a siglongjmp(), there is no way to determine the correct value
> > of POR_EL0 to restore.
> 
> Correct, but again this is true of any other record - for instance TPIDR2.
> 
> > I wonder whether POR_EL0 should be saved in sigjmp_buf (depending on
> > whether sigjmp_buf is horribly inextensible and also full up, or merely
> > horribly inextensible).
> 
> It very much feels that this is the case - if a handler relies on
> longjmp() or setcontext() to restore a known state, then POR_EL0 should
> be part of that state.

Oh well.  This ship sailed a while ago.

Jumping out of a signal handler with siglongjmp() is non-portable in
most practical situations anyway.

For POR_EL0 specifically though, it should be a non-issue if no SIGSEGV
can be generated after sigreturn destroys the failing handler's POR_EL0.

[...]

Cheers
---Dave


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-11-18 15:09 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-23 15:05 [PATCH v2 0/5] Improve arm64 pkeys handling in signal delivery Kevin Brodsky
2024-10-23 15:05 ` [PATCH v2 1/5] arm64: signal: Remove unused macro Kevin Brodsky
2024-10-23 15:05 ` [PATCH v2 2/5] arm64: signal: Remove unnecessary check when saving POE state Kevin Brodsky
2024-10-23 15:05 ` [PATCH v2 3/5] arm64: signal: Improve POR_EL0 handling to avoid uaccess failures Kevin Brodsky
2024-10-24 10:59   ` Catalin Marinas
2024-10-24 14:55     ` Kevin Brodsky
2024-10-24 15:42       ` Catalin Marinas
2024-10-24 16:19         ` Dave Martin
2024-10-25  8:24           ` Kevin Brodsky
2024-10-25 11:04             ` Dave Martin
2024-10-25 11:33             ` Dave Martin
2024-10-25 15:34               ` Kevin Brodsky
2024-11-18 15:06                 ` Dave Martin
2024-10-23 15:05 ` [PATCH v2 4/5] selftests/mm: Use generic pkey register manipulation Kevin Brodsky
2024-10-23 16:51   ` Dave Hansen
2024-10-25  8:31     ` Kevin Brodsky
2024-10-25 15:09       ` Dave Hansen
2024-10-28 10:20         ` Kevin Brodsky
2024-10-23 15:05 ` [PATCH v2 5/5] selftests/mm: Enable pkey_sighandler_tests on arm64 Kevin Brodsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).