[PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation
@ 2026-01-02 13:11 Ryan Roberts
  2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton
  Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening

Hi All,

As I reported at [1], kstack offset randomisation suffers from a couple of bugs
and, on arm64 at least, the performance is poor. This series attempts to fix
both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
propose a performance improvement approach.

I've looked at a few different options but ultimately decided that Jeremy's
original prng approach is the fastest. I made the argument that this approach is
secure "enough" in the RFC [2] and the responses indicated agreement.

More details in the commit logs.


Performance
===========

Mean and tail performance of 3 "small" syscalls was measured. syscall was made
10 million times and each individually measured and binned. These results have
low noise so I'm confident that they are trustworthy.

The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
performance cost of turning it on without any changes to the implementation,
then the reduced performance cost of turning it on with my changes applied.

**NOTE**: The below results were generated using the RFC patches but there is no
meaningful change, so the numbers are still valid.

arm64 (AWS Graviton3):
+-----------------+--------------+-------------+---------------+
| Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
|                 |              | rndstack-on |               |
|                 |              |             |               |
+=================+==============+=============+===============+
| syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |
|                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |
|                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |
|                 | p99 (ns)     | (R) 152.81% |         1.55% |
|                 | p99.9 (ns)   | (R) 153.67% |         1.77% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |
|                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |
|                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |
+-----------------+--------------+-------------+---------------+

Because arm64 was previously using get_random_u16(), it was expensive when it
didn't have any buffered bits and had to call into the crng. That's what caused
the enormous tail latency.


x86 (AWS Sapphire Rapids):
+-----------------+--------------+-------------+---------------+
| Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
|                 |              | rndstack-on |               |
|                 |              |             |               |
+=================+==============+=============+===============+
| syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
|                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
|                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns)    |  (R) 11.96% |     (R) 5.26% |
|                 | p99 (ns)     |  (R) 11.83% |     (R) 8.35% |
|                 | p99.9 (ns)   |  (R) 11.42% |    (R) 22.37% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns)    |  (R) 10.58% |     (R) 2.91% |
|                 | p99 (ns)     |  (R) 10.51% |     (R) 4.36% |
|                 | p99.9 (ns)   |  (R) 10.35% |    (R) 21.97% |
+-----------------+--------------+-------------+---------------+

I was surprised to see that the baseline cost on x86 is 10-12% since it is just
using rdtsc. But as I say, I believe the results are accurate.


Changes since v2 (RFC) [3]
==========================

- Moved late_initcall() to initialize kstack_rnd_state out of
  randomize_kstack.h and into main.c. (issue noticed by kernel test robot)

Changes since v1 (RFC) [2]
==========================

- Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
  its called from noinstr code)
- In patch 3, prng is now per-cpu instead of per-task (per Ard)


[1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
[2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/

Thanks,
Ryan


Ryan Roberts (3):
  randomize_kstack: Maintain kstack_offset per task
  prandom: Convert prandom_u32_state() to __always_inline
  randomize_kstack: Unify random source across arches

 arch/Kconfig                         |  5 ++-
 arch/arm64/kernel/syscall.c          | 11 ------
 arch/loongarch/kernel/syscall.c      | 11 ------
 arch/powerpc/kernel/syscall.c        | 12 -------
 arch/riscv/kernel/traps.c            | 12 -------
 arch/s390/include/asm/entry-common.h |  8 -----
 arch/x86/include/asm/entry-common.h  | 12 -------
 include/linux/prandom.h              | 19 +++++++++-
 include/linux/randomize_kstack.h     | 54 +++++++++++-----------------
 init/main.c                          |  9 ++++-
 kernel/fork.c                        |  1 +
 lib/random32.c                       | 19 ----------
 12 files changed, 49 insertions(+), 124 deletions(-)

--
2.43.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task
  2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts
@ 2026-01-02 13:11 ` Ryan Roberts
  2026-01-02 22:44   ` David Laight
  2026-01-19 10:23   ` Mark Rutland
  2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton
  Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening, stable

kstack_offset was previously maintained per-cpu, but this caused a
couple of issues. So let's instead make it per-task.

Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
expected and required to be called with interrupts and preemption
disabled so that it could manipulate per-cpu state. But arm64, loongarch
and risc-v are calling them with interrupts and preemption enabled. I
don't _think_ this causes any functional issues, but it's certainly
unexpected and could lead to manipulating the wrong cpu's state, which
could cause a minor performance degradation due to bouncing the cache
lines. By maintaining the state per-task those functions can safely be
called in preemptible context.

Issue 2: add_random_kstack_offset() is called before executing the
syscall and expands the stack using a previously chosen rnadom offset.
choose_random_kstack_offset() is called after executing the syscall and
chooses and stores a new random offset for the next syscall. With
per-cpu storage for this offset, an attacker could force cpu migration
during the execution of the syscall and prevent the offset from being
updated for the original cpu such that it is predictable for the next
syscall on that cpu. By maintaining the state per-task, this problem
goes away because the per-task random offset is updated after the
syscall regardless of which cpu it is executing on.

Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall")
Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
Cc: stable@vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/randomize_kstack.h | 26 +++++++++++++++-----------
 include/linux/sched.h            |  4 ++++
 init/main.c                      |  1 -
 kernel/fork.c                    |  2 ++
 4 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
index 1d982dbdd0d0..5d3916ca747c 100644
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -9,7 +9,6 @@
 
 DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 			 randomize_kstack_offset);
-DECLARE_PER_CPU(u32, kstack_offset);
 
 /*
  * Do not use this anywhere else in the kernel. This is used here because
@@ -50,15 +49,14 @@ DECLARE_PER_CPU(u32, kstack_offset);
  * add_random_kstack_offset - Increase stack utilization by previously
  *			      chosen random offset
  *
- * This should be used in the syscall entry path when interrupts and
- * preempt are disabled, and after user registers have been stored to
- * the stack. For testing the resulting entropy, please see:
- * tools/testing/selftests/lkdtm/stack-entropy.sh
+ * This should be used in the syscall entry path after user registers have been
+ * stored to the stack. Preemption may be enabled. For testing the resulting
+ * entropy, please see: tools/testing/selftests/lkdtm/stack-entropy.sh
  */
 #define add_random_kstack_offset() do {					\
 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
 				&randomize_kstack_offset)) {		\
-		u32 offset = raw_cpu_read(kstack_offset);		\
+		u32 offset = current->kstack_offset;			\
 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
 		/* Keep allocation even after "ptr" loses scope. */	\
 		asm volatile("" :: "r"(ptr) : "memory");		\
@@ -69,9 +67,9 @@ DECLARE_PER_CPU(u32, kstack_offset);
  * choose_random_kstack_offset - Choose the random offset for the next
  *				 add_random_kstack_offset()
  *
- * This should only be used during syscall exit when interrupts and
- * preempt are disabled. This position in the syscall flow is done to
- * frustrate attacks from userspace attempting to learn the next offset:
+ * This should only be used during syscall exit. Preemption may be enabled. This
+ * position in the syscall flow is done to frustrate attacks from userspace
+ * attempting to learn the next offset:
  * - Maximize the timing uncertainty visible from userspace: if the
  *   offset is chosen at syscall entry, userspace has much more control
  *   over the timing between choosing offsets. "How long will we be in
@@ -85,14 +83,20 @@ DECLARE_PER_CPU(u32, kstack_offset);
 #define choose_random_kstack_offset(rand) do {				\
 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
 				&randomize_kstack_offset)) {		\
-		u32 offset = raw_cpu_read(kstack_offset);		\
+		u32 offset = current->kstack_offset;			\
 		offset = ror32(offset, 5) ^ (rand);			\
-		raw_cpu_write(kstack_offset, offset);			\
+		current->kstack_offset = offset;			\
 	}								\
 } while (0)
+
+static inline void random_kstack_task_init(struct task_struct *tsk)
+{
+	tsk->kstack_offset = 0;
+}
 #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
 #define add_random_kstack_offset()		do { } while (0)
 #define choose_random_kstack_offset(rand)	do { } while (0)
+#define random_kstack_task_init(tsk)		do { } while (0)
 #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..9e0080ed1484 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1591,6 +1591,10 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	u32				kstack_offset;
+#endif
+
 #ifdef CONFIG_X86_MCE
 	void __user			*mce_vaddr;
 	__u64				mce_kflags;
diff --git a/init/main.c b/init/main.c
index b84818ad9685..27fcbbde933e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -830,7 +830,6 @@ static inline void initcall_debug_enable(void)
 #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
 DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 			   randomize_kstack_offset);
-DEFINE_PER_CPU(u32, kstack_offset);
 
 static int __init early_randomize_kstack_offset(char *buf)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..b061e1edbc43 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -95,6 +95,7 @@
 #include <linux/thread_info.h>
 #include <linux/kstack_erase.h>
 #include <linux/kasan.h>
+#include <linux/randomize_kstack.h>
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
@@ -2231,6 +2232,7 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
+	random_kstack_task_init(p);
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts
  2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
@ 2026-01-02 13:11 ` Ryan Roberts
  2026-01-02 13:39   ` Jason A. Donenfeld
  2026-01-19 10:26   ` Mark Rutland
  2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts
  2026-01-19 10:52 ` [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Mark Rutland
  3 siblings, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton
  Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening

We will shortly use prandom_u32_state() to implement kstack offset
randomization and some arches need to call it from non-instrumentable
context. Given the function is just a handful of operations and doesn't
call out to any other functions, let's take the easy path and make it
__always_inline.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/prandom.h | 19 ++++++++++++++++++-
 lib/random32.c          | 19 -------------------
 2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index ff7dcc3fa105..e797b3709f5c 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -17,7 +17,24 @@ struct rnd_state {
 	__u32 s1, s2, s3, s4;
 };
 
-u32 prandom_u32_state(struct rnd_state *state);
+/**
+ * prandom_u32_state - seeded pseudo-random number generator.
+ * @state: pointer to state structure holding seeded state.
+ *
+ * This is used for pseudo-randomness with no outside seeding.
+ * For more random results, use get_random_u32().
+ */
+static __always_inline u32 prandom_u32_state(struct rnd_state *state)
+{
+#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
+	state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
+	state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
+	state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
+	state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);
+
+	return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
+}
+
 void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes);
 void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state);
 
diff --git a/lib/random32.c b/lib/random32.c
index 24e7acd9343f..d57baf489d4a 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -42,25 +42,6 @@
 #include <linux/slab.h>
 #include <linux/unaligned.h>
 
-/**
- *	prandom_u32_state - seeded pseudo-random number generator.
- *	@state: pointer to state structure holding seeded state.
- *
- *	This is used for pseudo-randomness with no outside seeding.
- *	For more random results, use get_random_u32().
- */
-u32 prandom_u32_state(struct rnd_state *state)
-{
-#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
-	state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
-	state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
-	state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
-	state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);
-
-	return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
-}
-EXPORT_SYMBOL(prandom_u32_state);
-
 /**
  *	prandom_bytes_state - get the requested number of pseudo-random bytes
  *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts
  2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
  2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts
@ 2026-01-02 13:11 ` Ryan Roberts
  2026-01-04 23:01   ` David Laight
  2026-01-19 10:48   ` Mark Rutland
  2026-01-19 10:52 ` [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Mark Rutland
  3 siblings, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton
  Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening

Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().

There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.

So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:

  - We can remove choose_random_kstack_offset(); That was only there to
    try to make the timestamp counter value a bit harder to influence
    from user space.

  - The architecture code is simplified. All it has to do now is call
    add_random_kstack_offset() in the syscall path.

  - The strength of the randomness can be reasoned about independently
    of the architecture.

  - Arches previously using get_random_u16() now have much faster
    syscall paths, see below results.

There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.

Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.

Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):

+-----------------+--------------+---------------+---------------+
| Benchmark       | Result Class | per-task-prng | per-task-prng |
|                 |              | arm64 (metal) |   x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid  | mean (ns)    |    (I) -9.50% |   (I) -17.65% |
|                 | p99 (ns)     |   (I) -59.24% |   (I) -24.41% |
|                 | p99.9 (ns)   |   (I) -59.52% |   (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns)    |    (I) -9.52% |   (I) -19.24% |
|                 | p99 (ns)     |   (I) -59.25% |   (I) -25.03% |
|                 | p99.9 (ns)   |   (I) -59.50% |   (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns)    |   (I) -10.31% |   (I) -18.56% |
|                 | p99 (ns)     |   (I) -60.79% |   (I) -20.06% |
|                 | p99.9 (ns)   |   (I) -61.04% |   (I) -25.04% |
+-----------------+--------------+---------------+---------------+

I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/Kconfig                         |  5 ++-
 arch/arm64/kernel/syscall.c          | 11 ------
 arch/loongarch/kernel/syscall.c      | 11 ------
 arch/powerpc/kernel/syscall.c        | 12 -------
 arch/riscv/kernel/traps.c            | 12 -------
 arch/s390/include/asm/entry-common.h |  8 -----
 arch/x86/include/asm/entry-common.h  | 12 -------
 include/linux/randomize_kstack.h     | 52 +++++++++-------------------
 include/linux/sched.h                |  4 ---
 init/main.c                          |  8 +++++
 kernel/fork.c                        |  1 -
 11 files changed, 27 insertions(+), 109 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 31220f512b16..8591fe7b4ac1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1516,9 +1516,8 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	def_bool n
 	help
 	  An arch should select this symbol if it can support kernel stack
-	  offset randomization with calls to add_random_kstack_offset()
-	  during syscall entry and choose_random_kstack_offset() during
-	  syscall exit. Careful removal of -fstack-protector-strong and
+	  offset randomization with a call to add_random_kstack_offset()
+	  during syscall entry. Careful removal of -fstack-protector-strong and
 	  -fstack-protector should also be applied to the entry code and
 	  closely examined, as the artificial stack bump looks like an array
 	  to the compiler, so it will attempt to add canary checks regardless
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index c062badd1a56..358ddfbf1401 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -52,17 +52,6 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
 	}
 
 	syscall_set_return_value(current, regs, 0, ret);
-
-	/*
-	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
-	 * bits. The actual entropy will be further reduced by the compiler
-	 * when applying stack alignment constraints: the AAPCS mandates a
-	 * 16-byte aligned SP at function boundaries, which will remove the
-	 * 4 low bits from any entropy chosen here.
-	 *
-	 * The resulting 6 bits of entropy is seen in SP[9:4].
-	 */
-	choose_random_kstack_offset(get_random_u16());
 }
 
 static inline bool has_syscall_work(unsigned long flags)
diff --git a/arch/loongarch/kernel/syscall.c b/arch/loongarch/kernel/syscall.c
index 1249d82c1cd0..85da7e050d97 100644
--- a/arch/loongarch/kernel/syscall.c
+++ b/arch/loongarch/kernel/syscall.c
@@ -79,16 +79,5 @@ void noinstr __no_stack_protector do_syscall(struct pt_regs *regs)
 					   regs->regs[7], regs->regs[8], regs->regs[9]);
 	}
 
-	/*
-	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
-	 * bits. The actual entropy will be further reduced by the compiler
-	 * when applying stack alignment constraints: 16-bytes (i.e. 4-bits)
-	 * aligned, which will remove the 4 low bits from any entropy chosen
-	 * here.
-	 *
-	 * The resulting 6 bits of entropy is seen in SP[9:4].
-	 */
-	choose_random_kstack_offset(get_cycles());
-
 	syscall_exit_to_user_mode(regs);
 }
diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
index be159ad4b77b..b3d8b0f9823b 100644
--- a/arch/powerpc/kernel/syscall.c
+++ b/arch/powerpc/kernel/syscall.c
@@ -173,17 +173,5 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
 	}
 #endif
 
-	/*
-	 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
-	 * so the maximum stack offset is 1k bytes (10 bits).
-	 *
-	 * The actual entropy will be further reduced by the compiler when
-	 * applying stack alignment constraints: the powerpc architecture
-	 * may have two kinds of stack alignment (16-bytes and 8-bytes).
-	 *
-	 * So the resulting 6 or 7 bits of entropy is seen in SP[9:4] or SP[9:3].
-	 */
-	choose_random_kstack_offset(mftb());
-
 	return ret;
 }
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 80230de167de..79b285bdfd1a 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -342,18 +342,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
 		if (syscall >= 0 && syscall < NR_syscalls)
 			syscall_handler(regs, syscall);
 
-		/*
-		 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
-		 * so the maximum stack offset is 1k bytes (10 bits).
-		 *
-		 * The actual entropy will be further reduced by the compiler when
-		 * applying stack alignment constraints: 16-byte (i.e. 4-bit) aligned
-		 * for RV32I or RV64I.
-		 *
-		 * The resulting 6 bits of entropy is seen in SP[9:4].
-		 */
-		choose_random_kstack_offset(get_random_u16());
-
 		syscall_exit_to_user_mode(regs);
 	} else {
 		irqentry_state_t state = irqentry_nmi_enter(regs);
diff --git a/arch/s390/include/asm/entry-common.h b/arch/s390/include/asm/entry-common.h
index 979af986a8fe..35450a485323 100644
--- a/arch/s390/include/asm/entry-common.h
+++ b/arch/s390/include/asm/entry-common.h
@@ -51,14 +51,6 @@ static __always_inline void arch_exit_to_user_mode(void)
 
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
-static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
-						  unsigned long ti_work)
-{
-	choose_random_kstack_offset(get_tod_clock_fast());
-}
-
-#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
-
 static __always_inline bool arch_in_rcu_eqs(void)
 {
 	if (IS_ENABLED(CONFIG_KVM))
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce3eb6d5fdf9..7535131c711b 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -82,18 +82,6 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
 #endif
 
-	/*
-	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
-	 * bits. The actual entropy will be further reduced by the compiler
-	 * when applying stack alignment constraints (see cc_stack_align4/8 in
-	 * arch/x86/Makefile), which will remove the 3 (x86_64) or 2 (ia32)
-	 * low bits from any entropy chosen here.
-	 *
-	 * Therefore, final stack offset entropy will be 7 (x86_64) or
-	 * 8 (ia32) bits.
-	 */
-	choose_random_kstack_offset(rdtsc());
-
 	/* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */
 	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
 	    this_cpu_read(x86_ibpb_exit_to_user)) {
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
index 5d3916ca747c..024fc20e7762 100644
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -6,6 +6,7 @@
 #include <linux/kernel.h>
 #include <linux/jump_label.h>
 #include <linux/percpu-defs.h>
+#include <linux/prandom.h>
 
 DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 			 randomize_kstack_offset);
@@ -45,9 +46,22 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 #define KSTACK_OFFSET_MAX(x)	((x) & 0b1111111100)
 #endif
 
+DECLARE_PER_CPU(struct rnd_state, kstack_rnd_state);
+
+static __always_inline u32 get_kstack_offset(void)
+{
+	struct rnd_state *state;
+	u32 rnd;
+
+	state = &get_cpu_var(kstack_rnd_state);
+	rnd = prandom_u32_state(state);
+	put_cpu_var(kstack_rnd_state);
+
+	return rnd;
+}
+
 /**
- * add_random_kstack_offset - Increase stack utilization by previously
- *			      chosen random offset
+ * add_random_kstack_offset - Increase stack utilization by a random offset.
  *
  * This should be used in the syscall entry path after user registers have been
  * stored to the stack. Preemption may be enabled. For testing the resulting
@@ -56,47 +70,15 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 #define add_random_kstack_offset() do {					\
 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
 				&randomize_kstack_offset)) {		\
-		u32 offset = current->kstack_offset;			\
+		u32 offset = get_kstack_offset();			\
 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
 		/* Keep allocation even after "ptr" loses scope. */	\
 		asm volatile("" :: "r"(ptr) : "memory");		\
 	}								\
 } while (0)
 
-/**
- * choose_random_kstack_offset - Choose the random offset for the next
- *				 add_random_kstack_offset()
- *
- * This should only be used during syscall exit. Preemption may be enabled. This
- * position in the syscall flow is done to frustrate attacks from userspace
- * attempting to learn the next offset:
- * - Maximize the timing uncertainty visible from userspace: if the
- *   offset is chosen at syscall entry, userspace has much more control
- *   over the timing between choosing offsets. "How long will we be in
- *   kernel mode?" tends to be more difficult to predict than "how long
- *   will we be in user mode?"
- * - Reduce the lifetime of the new offset sitting in memory during
- *   kernel mode execution. Exposure of "thread-local" memory content
- *   (e.g. current, percpu, etc) tends to be easier than arbitrary
- *   location memory exposure.
- */
-#define choose_random_kstack_offset(rand) do {				\
-	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
-				&randomize_kstack_offset)) {		\
-		u32 offset = current->kstack_offset;			\
-		offset = ror32(offset, 5) ^ (rand);			\
-		current->kstack_offset = offset;			\
-	}								\
-} while (0)
-
-static inline void random_kstack_task_init(struct task_struct *tsk)
-{
-	tsk->kstack_offset = 0;
-}
 #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
 #define add_random_kstack_offset()		do { } while (0)
-#define choose_random_kstack_offset(rand)	do { } while (0)
-#define random_kstack_task_init(tsk)		do { } while (0)
 #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e0080ed1484..d395f2810fac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1591,10 +1591,6 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
-#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
-	u32				kstack_offset;
-#endif
-
 #ifdef CONFIG_X86_MCE
 	void __user			*mce_vaddr;
 	__u64				mce_kflags;
diff --git a/init/main.c b/init/main.c
index 27fcbbde933e..8626e048095a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -830,6 +830,14 @@ static inline void initcall_debug_enable(void)
 #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
 DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 			   randomize_kstack_offset);
+DEFINE_PER_CPU(struct rnd_state, kstack_rnd_state);
+
+static int __init random_kstack_init(void)
+{
+	prandom_seed_full_state(&kstack_rnd_state);
+	return 0;
+}
+late_initcall(random_kstack_init);
 
 static int __init early_randomize_kstack_offset(char *buf)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index b061e1edbc43..68d9766288fd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2232,7 +2232,6 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
-	random_kstack_task_init(p);
 	stackleak_task_init(p);
 
 	if (pid != &init_struct_pid) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts
@ 2026-01-02 13:39   ` Jason A. Donenfeld
  2026-01-02 14:09     ` Ryan Roberts
  2026-01-02 22:54     ` David Laight
  2026-01-19 10:26   ` Mark Rutland
  1 sibling, 2 replies; 26+ messages in thread
From: Jason A. Donenfeld @ 2026-01-02 13:39 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel,
	Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening

Hi Ryan,

On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> context. Given the function is just a handful of operations and doesn't

How many? What's this looking like in terms of assembly? It'd also be
nice to have some brief analysis of other call sites to have
confirmation this isn't blowing up other users.

> +static __always_inline u32 prandom_u32_state(struct rnd_state *state)

Why not just normal `inline`? Is gcc disagreeing with the inlinability
of this function?

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 13:39   ` Jason A. Donenfeld
@ 2026-01-02 14:09     ` Ryan Roberts
  2026-01-03  8:00       ` Christophe Leroy (CS GROUP)
  2026-01-03 10:46       ` David Laight
  2026-01-02 22:54     ` David Laight
  1 sibling, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-02 14:09 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel,
	Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening

On 02/01/2026 13:39, Jason A. Donenfeld wrote:
> Hi Ryan,
> 
> On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> context. Given the function is just a handful of operations and doesn't
> 
> How many? What's this looking like in terms of assembly? 

25 instructions on arm64:

0000000000000000 <prandom_u32_state>:
   0:	29401403 	ldp	w3, w5, [x0]
   4:	aa0003e1 	mov	x1, x0
   8:	29410002 	ldp	w2, w0, [x0, #8]
   c:	531e74a4 	lsl	w4, w5, #2
  10:	530e3468 	lsl	w8, w3, #18
  14:	4a0400a5 	eor	w5, w5, w4
  18:	4a031863 	eor	w3, w3, w3, lsl #6
  1c:	53196047 	lsl	w7, w2, #7
  20:	53134806 	lsl	w6, w0, #13
  24:	4a023442 	eor	w2, w2, w2, lsl #13
  28:	4a000c00 	eor	w0, w0, w0, lsl #3
  2c:	121b6884 	and	w4, w4, #0xffffffe0
  30:	120d3108 	and	w8, w8, #0xfff80000
  34:	121550e7 	and	w7, w7, #0xfffff800
  38:	120c2cc6 	and	w6, w6, #0xfff00000
  3c:	2a456c85 	orr	w5, w4, w5, lsr #27
  40:	2a433504 	orr	w4, w8, w3, lsr #13
  44:	2a4254e3 	orr	w3, w7, w2, lsr #21
  48:	2a4030c2 	orr	w2, w6, w0, lsr #12
  4c:	4a020066 	eor	w6, w3, w2
  50:	4a050080 	eor	w0, w4, w5
  54:	4a0000c0 	eor	w0, w6, w0
  58:	29001424 	stp	w4, w5, [x1]
  5c:	29010823 	stp	w3, w2, [x1, #8]
  60:	d65f03c0 	ret

> It'd also be
> nice to have some brief analysis of other call sites to have
> confirmation this isn't blowing up other users.

I compiled defconfig before and after this patch on arm64 and compared the text
sizes:

$ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
Function                                     old     new   delta
prandom_seed_full_state                      364     932    +568
pick_next_task_fair                         1940    2036     +96
bpf_user_rnd_u32                             104     196     +92
prandom_bytes_state                          204     260     +56
e843419@0f2b_00012d69_e34                      -       8      +8
e843419@0db7_00010ec3_23ec                     -       8      +8
e843419@02cb_00003767_25c                      -       8      +8
bpf_prog_select_runtime                      448     444      -4
e843419@0aa3_0000cfd1_1580                     8       -      -8
e843419@0aa2_0000cfba_147c                     8       -      -8
e843419@075f_00008d8c_184                      8       -      -8
prandom_u32_state                            100       -    -100
Total: Before=19078072, After=19078780, chg +0.00%

So 708 bytes more after inlining. The main cost is prandom_seed_full_state(),
which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
could turn that into a loop to reduce ~450 bytes overall.

I'm not really sure if 708 is good or bad...

> 
>> +static __always_inline u32 prandom_u32_state(struct rnd_state *state)
> 
> Why not just normal `inline`? Is gcc disagreeing with the inlinability
> of this function?

Given this needs to be called from a noinstr function, I didn't want to give the
compiler the opportunity to decide not to inline it, since in that case, some
instrumentation might end up being applied to the function body which would blow
up when called in the noinstr context.

I think the other 2 options are to keep prandom_u32_state() in the c file but
mark it noinstr or rearrange all the users so that thay don't call it until
instrumentation is allowable. The latter is something I was trying to avoid.

There is some previous discussion of this at [1].

[1] https://lore.kernel.org/all/aS65LFUfdgRPKv1l@J2N7QTR9R3/

Perhaps keeping prandom_u32_state() in the c file and making it noinstr is the
best compromise?

Thanks,
Ryan

> 
> Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task
  2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
@ 2026-01-02 22:44   ` David Laight
  2026-01-05 10:30     ` Ryan Roberts
  2026-01-19 10:23   ` Mark Rutland
  1 sibling, 1 reply; 26+ messages in thread
From: David Laight @ 2026-01-02 22:44 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening, stable

On Fri,  2 Jan 2026 13:11:52 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:

> kstack_offset was previously maintained per-cpu, but this caused a
> couple of issues. So let's instead make it per-task.
> 
> Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
> expected and required to be called with interrupts and preemption
> disabled so that it could manipulate per-cpu state. But arm64, loongarch
> and risc-v are calling them with interrupts and preemption enabled. I
> don't _think_ this causes any functional issues, but it's certainly
> unexpected and could lead to manipulating the wrong cpu's state, which
> could cause a minor performance degradation due to bouncing the cache
> lines. By maintaining the state per-task those functions can safely be
> called in preemptible context.
> 
> Issue 2: add_random_kstack_offset() is called before executing the
> syscall and expands the stack using a previously chosen rnadom offset.
                                                           <>
	David

> choose_random_kstack_offset() is called after executing the syscall and
> chooses and stores a new random offset for the next syscall. With
> per-cpu storage for this offset, an attacker could force cpu migration
> during the execution of the syscall and prevent the offset from being
> updated for the original cpu such that it is predictable for the next
> syscall on that cpu. By maintaining the state per-task, this problem
> goes away because the per-task random offset is updated after the
> syscall regardless of which cpu it is executing on.
> 
> Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall")
> Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/randomize_kstack.h | 26 +++++++++++++++-----------
>  include/linux/sched.h            |  4 ++++
>  init/main.c                      |  1 -
>  kernel/fork.c                    |  2 ++
>  4 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
> index 1d982dbdd0d0..5d3916ca747c 100644
> --- a/include/linux/randomize_kstack.h
> +++ b/include/linux/randomize_kstack.h
> @@ -9,7 +9,6 @@
>  
>  DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  			 randomize_kstack_offset);
> -DECLARE_PER_CPU(u32, kstack_offset);
>  
>  /*
>   * Do not use this anywhere else in the kernel. This is used here because
> @@ -50,15 +49,14 @@ DECLARE_PER_CPU(u32, kstack_offset);
>   * add_random_kstack_offset - Increase stack utilization by previously
>   *			      chosen random offset
>   *
> - * This should be used in the syscall entry path when interrupts and
> - * preempt are disabled, and after user registers have been stored to
> - * the stack. For testing the resulting entropy, please see:
> - * tools/testing/selftests/lkdtm/stack-entropy.sh
> + * This should be used in the syscall entry path after user registers have been
> + * stored to the stack. Preemption may be enabled. For testing the resulting
> + * entropy, please see: tools/testing/selftests/lkdtm/stack-entropy.sh
>   */
>  #define add_random_kstack_offset() do {					\
>  	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>  				&randomize_kstack_offset)) {		\
> -		u32 offset = raw_cpu_read(kstack_offset);		\
> +		u32 offset = current->kstack_offset;			\
>  		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
>  		/* Keep allocation even after "ptr" loses scope. */	\
>  		asm volatile("" :: "r"(ptr) : "memory");		\
> @@ -69,9 +67,9 @@ DECLARE_PER_CPU(u32, kstack_offset);
>   * choose_random_kstack_offset - Choose the random offset for the next
>   *				 add_random_kstack_offset()
>   *
> - * This should only be used during syscall exit when interrupts and
> - * preempt are disabled. This position in the syscall flow is done to
> - * frustrate attacks from userspace attempting to learn the next offset:
> + * This should only be used during syscall exit. Preemption may be enabled. This
> + * position in the syscall flow is done to frustrate attacks from userspace
> + * attempting to learn the next offset:
>   * - Maximize the timing uncertainty visible from userspace: if the
>   *   offset is chosen at syscall entry, userspace has much more control
>   *   over the timing between choosing offsets. "How long will we be in
> @@ -85,14 +83,20 @@ DECLARE_PER_CPU(u32, kstack_offset);
>  #define choose_random_kstack_offset(rand) do {				\
>  	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>  				&randomize_kstack_offset)) {		\
> -		u32 offset = raw_cpu_read(kstack_offset);		\
> +		u32 offset = current->kstack_offset;			\
>  		offset = ror32(offset, 5) ^ (rand);			\
> -		raw_cpu_write(kstack_offset, offset);			\
> +		current->kstack_offset = offset;			\
>  	}								\
>  } while (0)
> +
> +static inline void random_kstack_task_init(struct task_struct *tsk)
> +{
> +	tsk->kstack_offset = 0;
> +}
>  #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
>  #define add_random_kstack_offset()		do { } while (0)
>  #define choose_random_kstack_offset(rand)	do { } while (0)
> +#define random_kstack_task_init(tsk)		do { } while (0)
>  #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
>  
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d395f2810fac..9e0080ed1484 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1591,6 +1591,10 @@ struct task_struct {
>  	unsigned long			prev_lowest_stack;
>  #endif
>  
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +	u32				kstack_offset;
> +#endif
> +
>  #ifdef CONFIG_X86_MCE
>  	void __user			*mce_vaddr;
>  	__u64				mce_kflags;
> diff --git a/init/main.c b/init/main.c
> index b84818ad9685..27fcbbde933e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -830,7 +830,6 @@ static inline void initcall_debug_enable(void)
>  #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>  DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  			   randomize_kstack_offset);
> -DEFINE_PER_CPU(u32, kstack_offset);
>  
>  static int __init early_randomize_kstack_offset(char *buf)
>  {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b1f3915d5f8e..b061e1edbc43 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -95,6 +95,7 @@
>  #include <linux/thread_info.h>
>  #include <linux/kstack_erase.h>
>  #include <linux/kasan.h>
> +#include <linux/randomize_kstack.h>
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> @@ -2231,6 +2232,7 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> +	random_kstack_task_init(p);
>  	stackleak_task_init(p);
>  
>  	if (pid != &init_struct_pid) {


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 13:39   ` Jason A. Donenfeld
  2026-01-02 14:09     ` Ryan Roberts
@ 2026-01-02 22:54     ` David Laight
  1 sibling, 0 replies; 26+ messages in thread
From: David Laight @ 2026-01-02 22:54 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Ryan Roberts, Catalin Marinas, Will Deacon, Huacai Chen,
	Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann,
	Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Fri, 2 Jan 2026 14:39:21 +0100
"Jason A. Donenfeld" <Jason@zx2c4.com> wrote:

> Hi Ryan,
...
> > +static __always_inline u32 prandom_u32_state(struct rnd_state *state)  
> 
> Why not just normal `inline`? Is gcc disagreeing with the inlinability
> of this function?

gcc has a mind of its own when it comes to inlining.
If there weren't some massive functions marked 'inline' that should never
really be inlined then making 'inline' '__always_inline' would make sense.
But first an audit would be needed.
(This has come up several times in the past.)
But if you need a function to be inlined (for any reason) it needs to be
always_inline.

Whether there should be an non-inlined 'option' here is another matter.
There could be a normal function that calls the inlined version.

	David

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 14:09     ` Ryan Roberts
@ 2026-01-03  8:00       ` Christophe Leroy (CS GROUP)
  2026-01-05 10:36         ` Ryan Roberts
  2026-01-03 10:46       ` David Laight
  1 sibling, 1 reply; 26+ messages in thread
From: Christophe Leroy (CS GROUP) @ 2026-01-03  8:00 UTC (permalink / raw)
  To: Ryan Roberts, Jason A. Donenfeld
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel,
	Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening



Le 02/01/2026 à 15:09, Ryan Roberts a écrit :
> On 02/01/2026 13:39, Jason A. Donenfeld wrote:
>> Hi Ryan,
>>
>> On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>> context. Given the function is just a handful of operations and doesn't
>>
>> How many? What's this looking like in terms of assembly?
> 
> 25 instructions on arm64:

31 instructions on powerpc:

00000000 <prandom_u32_state>:
    0:	7c 69 1b 78 	mr      r9,r3
    4:	80 63 00 00 	lwz     r3,0(r3)
    8:	80 89 00 08 	lwz     r4,8(r9)
    c:	81 69 00 04 	lwz     r11,4(r9)
   10:	80 a9 00 0c 	lwz     r5,12(r9)
   14:	54 67 30 32 	slwi    r7,r3,6
   18:	7c e7 1a 78 	xor     r7,r7,r3
   1c:	55 66 10 3a 	slwi    r6,r11,2
   20:	54 88 68 24 	slwi    r8,r4,13
   24:	54 63 90 18 	rlwinm  r3,r3,18,0,12
   28:	7d 6b 32 78 	xor     r11,r11,r6
   2c:	7d 08 22 78 	xor     r8,r8,r4
   30:	54 aa 18 38 	slwi    r10,r5,3
   34:	54 e7 9b 7e 	srwi    r7,r7,13
   38:	7c e7 1a 78 	xor     r7,r7,r3
   3c:	51 66 2e fe 	rlwimi  r6,r11,5,27,31
   40:	54 84 38 28 	rlwinm  r4,r4,7,0,20
   44:	7d 4a 2a 78 	xor     r10,r10,r5
   48:	55 08 5d 7e 	srwi    r8,r8,21
   4c:	7d 08 22 78 	xor     r8,r8,r4
   50:	7c e3 32 78 	xor     r3,r7,r6
   54:	54 a5 68 16 	rlwinm  r5,r5,13,0,11
   58:	55 4a a3 3e 	srwi    r10,r10,12
   5c:	7d 4a 2a 78 	xor     r10,r10,r5
   60:	7c 63 42 78 	xor     r3,r3,r8
   64:	90 e9 00 00 	stw     r7,0(r9)
   68:	90 c9 00 04 	stw     r6,4(r9)
   6c:	91 09 00 08 	stw     r8,8(r9)
   70:	91 49 00 0c 	stw     r10,12(r9)
   74:	7c 63 52 78 	xor     r3,r3,r10
   78:	4e 80 00 20 	blr

Among those, 8 instructions are for reading/writing the state in stack. 
They of course disappear when inlining.

> 
>> It'd also be
>> nice to have some brief analysis of other call sites to have
>> confirmation this isn't blowing up other users.
> 
> I compiled defconfig before and after this patch on arm64 and compared the text
> sizes:
> 
> $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
> Function                                     old     new   delta
> prandom_seed_full_state                      364     932    +568
> pick_next_task_fair                         1940    2036     +96
> bpf_user_rnd_u32                             104     196     +92
> prandom_bytes_state                          204     260     +56
> e843419@0f2b_00012d69_e34                      -       8      +8
> e843419@0db7_00010ec3_23ec                     -       8      +8
> e843419@02cb_00003767_25c                      -       8      +8
> bpf_prog_select_runtime                      448     444      -4
> e843419@0aa3_0000cfd1_1580                     8       -      -8
> e843419@0aa2_0000cfba_147c                     8       -      -8
> e843419@075f_00008d8c_184                      8       -      -8
> prandom_u32_state                            100       -    -100
> Total: Before=19078072, After=19078780, chg +0.00%
> 
> So 708 bytes more after inlining. The main cost is prandom_seed_full_state(),
> which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
> could turn that into a loop to reduce ~450 bytes overall.
> 
With following change the increase of prandom_seed_full_state() remains 
reasonnable and performance wise it is a lot better as it avoids the 
read/write of the state via the stack

diff --git a/lib/random32.c b/lib/random32.c
index 24e7acd9343f6..28a5b109c9018 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -94,17 +94,11 @@ EXPORT_SYMBOL(prandom_bytes_state);

  static void prandom_warmup(struct rnd_state *state)
  {
+	int i;
+
  	/* Calling RNG ten times to satisfy recurrence condition */
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
-	prandom_u32_state(state);
+	for (i = 0; i < 10; i++)
+		prandom_u32_state(state);
  }

  void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state)

The loop is:

  248:	38 e0 00 0a 	li      r7,10
  24c:	7c e9 03 a6 	mtctr   r7
  250:	55 05 30 32 	slwi    r5,r8,6
  254:	55 46 68 24 	slwi    r6,r10,13
  258:	55 27 18 38 	slwi    r7,r9,3
  25c:	7c a5 42 78 	xor     r5,r5,r8
  260:	7c c6 52 78 	xor     r6,r6,r10
  264:	7c e7 4a 78 	xor     r7,r7,r9
  268:	54 8b 10 3a 	slwi    r11,r4,2
  26c:	7d 60 22 78 	xor     r0,r11,r4
  270:	54 a5 9b 7e 	srwi    r5,r5,13
  274:	55 08 90 18 	rlwinm  r8,r8,18,0,12
  278:	54 c6 5d 7e 	srwi    r6,r6,21
  27c:	55 4a 38 28 	rlwinm  r10,r10,7,0,20
  280:	54 e7 a3 3e 	srwi    r7,r7,12
  284:	55 29 68 16 	rlwinm  r9,r9,13,0,11
  288:	7d 64 5b 78 	mr      r4,r11
  28c:	7c a8 42 78 	xor     r8,r5,r8
  290:	7c ca 52 78 	xor     r10,r6,r10
  294:	7c e9 4a 78 	xor     r9,r7,r9
  298:	50 04 2e fe 	rlwimi  r4,r0,5,27,31
  29c:	42 00 ff b4 	bdnz    250 <prandom_seed_full_state+0x7c>

Which replaces the 10 calls to prandom_u32_state()

   fc:	91 3f 00 0c 	stw     r9,12(r31)
  100:	7f e3 fb 78 	mr      r3,r31
  104:	48 00 00 01 	bl      104 <prandom_seed_full_state+0x88>
			104: R_PPC_REL24	prandom_u32_state
  108:	7f e3 fb 78 	mr      r3,r31
  10c:	48 00 00 01 	bl      10c <prandom_seed_full_state+0x90>
			10c: R_PPC_REL24	prandom_u32_state
  110:	7f e3 fb 78 	mr      r3,r31
  114:	48 00 00 01 	bl      114 <prandom_seed_full_state+0x98>
			114: R_PPC_REL24	prandom_u32_state
  118:	7f e3 fb 78 	mr      r3,r31
  11c:	48 00 00 01 	bl      11c <prandom_seed_full_state+0xa0>
			11c: R_PPC_REL24	prandom_u32_state
  120:	7f e3 fb 78 	mr      r3,r31
  124:	48 00 00 01 	bl      124 <prandom_seed_full_state+0xa8>
			124: R_PPC_REL24	prandom_u32_state
  128:	7f e3 fb 78 	mr      r3,r31
  12c:	48 00 00 01 	bl      12c <prandom_seed_full_state+0xb0>
			12c: R_PPC_REL24	prandom_u32_state
  130:	7f e3 fb 78 	mr      r3,r31
  134:	48 00 00 01 	bl      134 <prandom_seed_full_state+0xb8>
			134: R_PPC_REL24	prandom_u32_state
  138:	7f e3 fb 78 	mr      r3,r31
  13c:	48 00 00 01 	bl      13c <prandom_seed_full_state+0xc0>
			13c: R_PPC_REL24	prandom_u32_state
  140:	7f e3 fb 78 	mr      r3,r31
  144:	48 00 00 01 	bl      144 <prandom_seed_full_state+0xc8>
			144: R_PPC_REL24	prandom_u32_state
  148:	80 01 00 24 	lwz     r0,36(r1)
  14c:	7f e3 fb 78 	mr      r3,r31
  150:	83 e1 00 1c 	lwz     r31,28(r1)
  154:	7c 08 03 a6 	mtlr    r0
  158:	38 21 00 20 	addi    r1,r1,32
  15c:	48 00 00 00 	b       15c <prandom_seed_full_state+0xe0>
			15c: R_PPC_REL24	prandom_u32_state


So approx the same number of instructions in size, while better performance.

> I'm not really sure if 708 is good or bad...

That's in the noise compared to the overall size of vmlinux, but if we 
change it to a loop we also reduce pressure on the cache.

Christophe

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 14:09     ` Ryan Roberts
  2026-01-03  8:00       ` Christophe Leroy (CS GROUP)
@ 2026-01-03 10:46       ` David Laight
  2026-01-05 10:34         ` Ryan Roberts
  1 sibling, 1 reply; 26+ messages in thread
From: David Laight @ 2026-01-03 10:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Jason A. Donenfeld, Catalin Marinas, Will Deacon, Huacai Chen,
	Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann,
	Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Fri, 2 Jan 2026 14:09:26 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:

> On 02/01/2026 13:39, Jason A. Donenfeld wrote:
> > Hi Ryan,
> > 
> > On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:  
> >> context. Given the function is just a handful of operations and doesn't  
> > 
> > How many? What's this looking like in terms of assembly?   
> 
> 25 instructions on arm64:
> 
> 0000000000000000 <prandom_u32_state>:
>    0:	29401403 	ldp	w3, w5, [x0]
>    4:	aa0003e1 	mov	x1, x0
>    8:	29410002 	ldp	w2, w0, [x0, #8]
>    c:	531e74a4 	lsl	w4, w5, #2
>   10:	530e3468 	lsl	w8, w3, #18
>   14:	4a0400a5 	eor	w5, w5, w4
>   18:	4a031863 	eor	w3, w3, w3, lsl #6
>   1c:	53196047 	lsl	w7, w2, #7
>   20:	53134806 	lsl	w6, w0, #13
>   24:	4a023442 	eor	w2, w2, w2, lsl #13
>   28:	4a000c00 	eor	w0, w0, w0, lsl #3
>   2c:	121b6884 	and	w4, w4, #0xffffffe0
>   30:	120d3108 	and	w8, w8, #0xfff80000
>   34:	121550e7 	and	w7, w7, #0xfffff800
>   38:	120c2cc6 	and	w6, w6, #0xfff00000
>   3c:	2a456c85 	orr	w5, w4, w5, lsr #27
>   40:	2a433504 	orr	w4, w8, w3, lsr #13
>   44:	2a4254e3 	orr	w3, w7, w2, lsr #21
>   48:	2a4030c2 	orr	w2, w6, w0, lsr #12
>   4c:	4a020066 	eor	w6, w3, w2
>   50:	4a050080 	eor	w0, w4, w5
>   54:	4a0000c0 	eor	w0, w6, w0
>   58:	29001424 	stp	w4, w5, [x1]
>   5c:	29010823 	stp	w3, w2, [x1, #8]
>   60:	d65f03c0 	ret

That is gcc, clang seems to generate something horrid (from godbolt).
I'm not sure what it has tried to do (and maybe it can't in kernel)
but it clearly doesn't help!
.LCPI0_0:
        .word   18
        .word   2
        .word   7
        .word   13
.LCPI0_1:
        .word   6
        .word   2
        .word   13
        .word   3
.LCPI0_2:
        .word   4294443008
        .word   4294967264
        .word   4294965248
        .word   4293918720
.LCPI0_3:
        .word   4294967283
        .word   4294967269
        .word   4294967275
        .word   4294967284
prandom_u32_state:
        adrp    x9, .LCPI0_1
        ldr     q0, [x0]
        adrp    x10, .LCPI0_3
        ldr     q1, [x9, :lo12:.LCPI0_1]
        adrp    x9, .LCPI0_0
        ldr     q3, [x10, :lo12:.LCPI0_3]
        ldr     q2, [x9, :lo12:.LCPI0_0]
        adrp    x9, .LCPI0_2
        mov     x8, x0
        ushl    v1.4s, v0.4s, v1.4s
        ushl    v2.4s, v0.4s, v2.4s
        eor     v0.16b, v1.16b, v0.16b
        ldr     q1, [x9, :lo12:.LCPI0_2]
        and     v1.16b, v2.16b, v1.16b
        ushl    v0.4s, v0.4s, v3.4s
        orr     v0.16b, v0.16b, v1.16b
        ext     v1.16b, v0.16b, v0.16b, #8
        str     q0, [x8]
        eor     v1.8b, v0.8b, v1.8b
        fmov    x9, d1
        lsr     x10, x9, #32
        eor     w0, w9, w10
        ret

The x86 versions are a little longer (arm's barrel shifter helps a lot).

> 
> > It'd also be
> > nice to have some brief analysis of other call sites to have
> > confirmation this isn't blowing up other users.  
> 
> I compiled defconfig before and after this patch on arm64 and compared the text
> sizes:
> 
> $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
> add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
> Function                                     old     new   delta
> prandom_seed_full_state                      364     932    +568
> pick_next_task_fair                         1940    2036     +96
> bpf_user_rnd_u32                             104     196     +92
> prandom_bytes_state                          204     260     +56
> e843419@0f2b_00012d69_e34                      -       8      +8
> e843419@0db7_00010ec3_23ec                     -       8      +8
> e843419@02cb_00003767_25c                      -       8      +8
> bpf_prog_select_runtime                      448     444      -4
> e843419@0aa3_0000cfd1_1580                     8       -      -8
> e843419@0aa2_0000cfba_147c                     8       -      -8
> e843419@075f_00008d8c_184                      8       -      -8
> prandom_u32_state                            100       -    -100
> Total: Before=19078072, After=19078780, chg +0.00%
> 
> So 708 bytes more after inlining.

Doesn't look like there are many calls.

> The main cost is prandom_seed_full_state(),
> which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
> could turn that into a loop to reduce ~450 bytes overall.

That would always have helped the code size.
And I suspect the other costs of that code make unrolling the loop pointless.

> 
> I'm not really sure if 708 is good or bad...
> 
> >   
> >> +static __always_inline u32 prandom_u32_state(struct rnd_state *state)  
> > 
> > Why not just normal `inline`? Is gcc disagreeing with the inlinability
> > of this function?  
> 
> Given this needs to be called from a noinstr function, I didn't want to give the
> compiler the opportunity to decide not to inline it, since in that case, some
> instrumentation might end up being applied to the function body which would blow
> up when called in the noinstr context.
> 
> I think the other 2 options are to keep prandom_u32_state() in the c file but
> mark it noinstr or rearrange all the users so that thay don't call it until
> instrumentation is allowable. The latter is something I was trying to avoid.
> 
> There is some previous discussion of this at [1].
> 
> [1] https://lore.kernel.org/all/aS65LFUfdgRPKv1l@J2N7QTR9R3/
> 
> Perhaps keeping prandom_u32_state() in the c file and making it noinstr is the
> best compromise?

Or define prandom_u32_state_inline() as always_inline and have the
real function:
u32 prandom_u32_state(struct rnd_state *state)
{
	return prandom_u32_state_inline(state);
}

So that the callers can pick the inline version if it really matters.

	David

> 
> Thanks,
> Ryan
> 
> > 
> > Jason  
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts
@ 2026-01-04 23:01   ` David Laight
  2026-01-05 11:05     ` Ryan Roberts
  2026-01-07 14:05     ` David Laight
  2026-01-19 10:48   ` Mark Rutland
  1 sibling, 2 replies; 26+ messages in thread
From: David Laight @ 2026-01-04 23:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Fri,  2 Jan 2026 13:11:54 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:

> Previously different architectures were using random sources of
> differing strength and cost to decide the random kstack offset. A number
> of architectures (loongarch, powerpc, s390, x86) were using their
> timestamp counter, at whatever the frequency happened to be. Other
> arches (arm64, riscv) were using entropy from the crng via
> get_random_u16().
> 
> There have been concerns that in some cases the timestamp counters may
> be too weak, because they can be easily guessed or influenced by user
> space. And get_random_u16() has been shown to be too costly for the
> level of protection kstack offset randomization provides.
> 
> So let's use a common, architecture-agnostic source of entropy; a
> per-cpu prng, seeded at boot-time from the crng. This has a few
> benefits:
> 
>   - We can remove choose_random_kstack_offset(); That was only there to
>     try to make the timestamp counter value a bit harder to influence
>     from user space.
> 
>   - The architecture code is simplified. All it has to do now is call
>     add_random_kstack_offset() in the syscall path.
> 
>   - The strength of the randomness can be reasoned about independently
>     of the architecture.
> 
>   - Arches previously using get_random_u16() now have much faster
>     syscall paths, see below results.
> 
> There have been some claims that a prng may be less strong than the
> timestamp counter if not regularly reseeded. But the prng has a period
> of about 2^113. So as long as the prng state remains secret, it should
> not be possible to guess. If the prng state can be accessed, we have
> bigger problems.

If you have 128 bits of output from consecutive outputs I think you
can trivially determine the full state using (almost) 'school boy' maths
that could be done on pencil and paper.
(Most of the work only has to be done once.)

The underlying problem is that the TAUSWORTHE() transformation is 'linear'
So that TAUSWORTHE(x ^ y) == TAUSWORTHE(x) ^ TAUSWORTHE(y).
(This is true of a LFSR/CRC and TOUSWORTH() is doing some subset of CRCs.)
This means that each output bit is the 'xor' of some of the input bits.
The four new 'state' values are just xor of the the bits of the old ones.
The final xor of the four states gives a 32bit value with each bit just
an xor of some of the 128 state bits.
Get four consecutive 32 bit values and you can solve the 128 simultaneous
equations (by trivial substitution) and get the initial state.
The solution gives you the 128 128bit constants for:
	u128 state = 0;
	u128 val = 'value returned from 4 calls';
	for (int i = 0; i < 128; i++)
		state |= parity(const128[i] ^ val) << i;
You done need all 32bits, just accumulate 128 bits.  
So if you can get the 5bit stack offset from 26 system calls you know the
value that will be used for all the subsequent calls.

Simply changing the final line to use + not ^ makes the output non-linear
and solving the equations a lot harder.

I might sit down tomorrow and see if I can actually code it...

	David

 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task
  2026-01-02 22:44   ` David Laight
@ 2026-01-05 10:30     ` Ryan Roberts
  0 siblings, 0 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-05 10:30 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening, stable

On 02/01/2026 22:44, David Laight wrote:
> On Fri,  2 Jan 2026 13:11:52 +0000
> Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> kstack_offset was previously maintained per-cpu, but this caused a
>> couple of issues. So let's instead make it per-task.
>>
>> Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
>> expected and required to be called with interrupts and preemption
>> disabled so that it could manipulate per-cpu state. But arm64, loongarch
>> and risc-v are calling them with interrupts and preemption enabled. I
>> don't _think_ this causes any functional issues, but it's certainly
>> unexpected and could lead to manipulating the wrong cpu's state, which
>> could cause a minor performance degradation due to bouncing the cache
>> lines. By maintaining the state per-task those functions can safely be
>> called in preemptible context.
>>
>> Issue 2: add_random_kstack_offset() is called before executing the
>> syscall and expands the stack using a previously chosen rnadom offset.
>                                                            <>
> 	David

Cheers; will fix in next version.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-03 10:46       ` David Laight
@ 2026-01-05 10:34         ` Ryan Roberts
  0 siblings, 0 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-05 10:34 UTC (permalink / raw)
  To: David Laight
  Cc: Jason A. Donenfeld, Catalin Marinas, Will Deacon, Huacai Chen,
	Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann,
	Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On 03/01/2026 10:46, David Laight wrote:
> On Fri, 2 Jan 2026 14:09:26 +0000
> Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> On 02/01/2026 13:39, Jason A. Donenfeld wrote:
>>> Hi Ryan,
>>>
>>> On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:  
>>>> context. Given the function is just a handful of operations and doesn't  
>>>
>>> How many? What's this looking like in terms of assembly?   
>>
>> 25 instructions on arm64:
>>
>> 0000000000000000 <prandom_u32_state>:
>>    0:	29401403 	ldp	w3, w5, [x0]
>>    4:	aa0003e1 	mov	x1, x0
>>    8:	29410002 	ldp	w2, w0, [x0, #8]
>>    c:	531e74a4 	lsl	w4, w5, #2
>>   10:	530e3468 	lsl	w8, w3, #18
>>   14:	4a0400a5 	eor	w5, w5, w4
>>   18:	4a031863 	eor	w3, w3, w3, lsl #6
>>   1c:	53196047 	lsl	w7, w2, #7
>>   20:	53134806 	lsl	w6, w0, #13
>>   24:	4a023442 	eor	w2, w2, w2, lsl #13
>>   28:	4a000c00 	eor	w0, w0, w0, lsl #3
>>   2c:	121b6884 	and	w4, w4, #0xffffffe0
>>   30:	120d3108 	and	w8, w8, #0xfff80000
>>   34:	121550e7 	and	w7, w7, #0xfffff800
>>   38:	120c2cc6 	and	w6, w6, #0xfff00000
>>   3c:	2a456c85 	orr	w5, w4, w5, lsr #27
>>   40:	2a433504 	orr	w4, w8, w3, lsr #13
>>   44:	2a4254e3 	orr	w3, w7, w2, lsr #21
>>   48:	2a4030c2 	orr	w2, w6, w0, lsr #12
>>   4c:	4a020066 	eor	w6, w3, w2
>>   50:	4a050080 	eor	w0, w4, w5
>>   54:	4a0000c0 	eor	w0, w6, w0
>>   58:	29001424 	stp	w4, w5, [x1]
>>   5c:	29010823 	stp	w3, w2, [x1, #8]
>>   60:	d65f03c0 	ret
> 
> That is gcc, clang seems to generate something horrid (from godbolt).
> I'm not sure what it has tried to do (and maybe it can't in kernel)
> but it clearly doesn't help!
> .LCPI0_0:
>         .word   18
>         .word   2
>         .word   7
>         .word   13
> .LCPI0_1:
>         .word   6
>         .word   2
>         .word   13
>         .word   3
> .LCPI0_2:
>         .word   4294443008
>         .word   4294967264
>         .word   4294965248
>         .word   4293918720
> .LCPI0_3:
>         .word   4294967283
>         .word   4294967269
>         .word   4294967275
>         .word   4294967284
> prandom_u32_state:
>         adrp    x9, .LCPI0_1
>         ldr     q0, [x0]
>         adrp    x10, .LCPI0_3
>         ldr     q1, [x9, :lo12:.LCPI0_1]
>         adrp    x9, .LCPI0_0
>         ldr     q3, [x10, :lo12:.LCPI0_3]
>         ldr     q2, [x9, :lo12:.LCPI0_0]
>         adrp    x9, .LCPI0_2
>         mov     x8, x0
>         ushl    v1.4s, v0.4s, v1.4s
>         ushl    v2.4s, v0.4s, v2.4s
>         eor     v0.16b, v1.16b, v0.16b
>         ldr     q1, [x9, :lo12:.LCPI0_2]
>         and     v1.16b, v2.16b, v1.16b
>         ushl    v0.4s, v0.4s, v3.4s
>         orr     v0.16b, v0.16b, v1.16b
>         ext     v1.16b, v0.16b, v0.16b, #8
>         str     q0, [x8]
>         eor     v1.8b, v0.8b, v1.8b
>         fmov    x9, d1
>         lsr     x10, x9, #32
>         eor     w0, w9, w10
>         ret
> 
> The x86 versions are a little longer (arm's barrel shifter helps a lot).
> 
>>
>>> It'd also be
>>> nice to have some brief analysis of other call sites to have
>>> confirmation this isn't blowing up other users.  
>>
>> I compiled defconfig before and after this patch on arm64 and compared the text
>> sizes:
>>
>> $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
>> add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
>> Function                                     old     new   delta
>> prandom_seed_full_state                      364     932    +568
>> pick_next_task_fair                         1940    2036     +96
>> bpf_user_rnd_u32                             104     196     +92
>> prandom_bytes_state                          204     260     +56
>> e843419@0f2b_00012d69_e34                      -       8      +8
>> e843419@0db7_00010ec3_23ec                     -       8      +8
>> e843419@02cb_00003767_25c                      -       8      +8
>> bpf_prog_select_runtime                      448     444      -4
>> e843419@0aa3_0000cfd1_1580                     8       -      -8
>> e843419@0aa2_0000cfba_147c                     8       -      -8
>> e843419@075f_00008d8c_184                      8       -      -8
>> prandom_u32_state                            100       -    -100
>> Total: Before=19078072, After=19078780, chg +0.00%
>>
>> So 708 bytes more after inlining.
> 
> Doesn't look like there are many calls.
> 
>> The main cost is prandom_seed_full_state(),
>> which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
>> could turn that into a loop to reduce ~450 bytes overall.
> 
> That would always have helped the code size.
> And I suspect the other costs of that code make unrolling the loop pointless.
> 
>>
>> I'm not really sure if 708 is good or bad...
>>
>>>   
>>>> +static __always_inline u32 prandom_u32_state(struct rnd_state *state)  
>>>
>>> Why not just normal `inline`? Is gcc disagreeing with the inlinability
>>> of this function?  
>>
>> Given this needs to be called from a noinstr function, I didn't want to give the
>> compiler the opportunity to decide not to inline it, since in that case, some
>> instrumentation might end up being applied to the function body which would blow
>> up when called in the noinstr context.
>>
>> I think the other 2 options are to keep prandom_u32_state() in the c file but
>> mark it noinstr or rearrange all the users so that thay don't call it until
>> instrumentation is allowable. The latter is something I was trying to avoid.
>>
>> There is some previous discussion of this at [1].
>>
>> [1] https://lore.kernel.org/all/aS65LFUfdgRPKv1l@J2N7QTR9R3/
>>
>> Perhaps keeping prandom_u32_state() in the c file and making it noinstr is the
>> best compromise?
> 
> Or define prandom_u32_state_inline() as always_inline and have the
> real function:
> u32 prandom_u32_state(struct rnd_state *state)
> {
> 	return prandom_u32_state_inline(state);
> }
> 
> So that the callers can pick the inline version if it really matters.

Ahh yes, that sounds like the simplest/best idea to me. I'll take this approach
for the next version assuming Jason is ok with it?

Thanks,
Ryan

> 
> 	David
> 
>>
>> Thanks,
>> Ryan
>>
>>>
>>> Jason  
>>
>>
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-03  8:00       ` Christophe Leroy (CS GROUP)
@ 2026-01-05 10:36         ` Ryan Roberts
  0 siblings, 0 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-05 10:36 UTC (permalink / raw)
  To: Christophe Leroy (CS GROUP), Jason A. Donenfeld
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel,
	Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch,
	linuxppc-dev, linux-riscv, linux-s390, linux-hardening

On 03/01/2026 08:00, Christophe Leroy (CS GROUP) wrote:
> 
> 
> Le 02/01/2026 à 15:09, Ryan Roberts a écrit :
>> On 02/01/2026 13:39, Jason A. Donenfeld wrote:
>>> Hi Ryan,
>>>
>>> On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>> context. Given the function is just a handful of operations and doesn't
>>>
>>> How many? What's this looking like in terms of assembly?
>>
>> 25 instructions on arm64:
> 
> 31 instructions on powerpc:
> 
> 00000000 <prandom_u32_state>:
>    0:    7c 69 1b 78     mr      r9,r3
>    4:    80 63 00 00     lwz     r3,0(r3)
>    8:    80 89 00 08     lwz     r4,8(r9)
>    c:    81 69 00 04     lwz     r11,4(r9)
>   10:    80 a9 00 0c     lwz     r5,12(r9)
>   14:    54 67 30 32     slwi    r7,r3,6
>   18:    7c e7 1a 78     xor     r7,r7,r3
>   1c:    55 66 10 3a     slwi    r6,r11,2
>   20:    54 88 68 24     slwi    r8,r4,13
>   24:    54 63 90 18     rlwinm  r3,r3,18,0,12
>   28:    7d 6b 32 78     xor     r11,r11,r6
>   2c:    7d 08 22 78     xor     r8,r8,r4
>   30:    54 aa 18 38     slwi    r10,r5,3
>   34:    54 e7 9b 7e     srwi    r7,r7,13
>   38:    7c e7 1a 78     xor     r7,r7,r3
>   3c:    51 66 2e fe     rlwimi  r6,r11,5,27,31
>   40:    54 84 38 28     rlwinm  r4,r4,7,0,20
>   44:    7d 4a 2a 78     xor     r10,r10,r5
>   48:    55 08 5d 7e     srwi    r8,r8,21
>   4c:    7d 08 22 78     xor     r8,r8,r4
>   50:    7c e3 32 78     xor     r3,r7,r6
>   54:    54 a5 68 16     rlwinm  r5,r5,13,0,11
>   58:    55 4a a3 3e     srwi    r10,r10,12
>   5c:    7d 4a 2a 78     xor     r10,r10,r5
>   60:    7c 63 42 78     xor     r3,r3,r8
>   64:    90 e9 00 00     stw     r7,0(r9)
>   68:    90 c9 00 04     stw     r6,4(r9)
>   6c:    91 09 00 08     stw     r8,8(r9)
>   70:    91 49 00 0c     stw     r10,12(r9)
>   74:    7c 63 52 78     xor     r3,r3,r10
>   78:    4e 80 00 20     blr
> 
> Among those, 8 instructions are for reading/writing the state in stack. They of
> course disappear when inlining.
> 
>>
>>> It'd also be
>>> nice to have some brief analysis of other call sites to have
>>> confirmation this isn't blowing up other users.
>>
>> I compiled defconfig before and after this patch on arm64 and compared the text
>> sizes:
>>
>> $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after
>> add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708)
>> Function                                     old     new   delta
>> prandom_seed_full_state                      364     932    +568
>> pick_next_task_fair                         1940    2036     +96
>> bpf_user_rnd_u32                             104     196     +92
>> prandom_bytes_state                          204     260     +56
>> e843419@0f2b_00012d69_e34                      -       8      +8
>> e843419@0db7_00010ec3_23ec                     -       8      +8
>> e843419@02cb_00003767_25c                      -       8      +8
>> bpf_prog_select_runtime                      448     444      -4
>> e843419@0aa3_0000cfd1_1580                     8       -      -8
>> e843419@0aa2_0000cfba_147c                     8       -      -8
>> e843419@075f_00008d8c_184                      8       -      -8
>> prandom_u32_state                            100       -    -100
>> Total: Before=19078072, After=19078780, chg +0.00%
>>
>> So 708 bytes more after inlining. The main cost is prandom_seed_full_state(),
>> which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we
>> could turn that into a loop to reduce ~450 bytes overall.
>>
> With following change the increase of prandom_seed_full_state() remains
> reasonnable and performance wise it is a lot better as it avoids the read/write
> of the state via the stack
> 
> diff --git a/lib/random32.c b/lib/random32.c
> index 24e7acd9343f6..28a5b109c9018 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -94,17 +94,11 @@ EXPORT_SYMBOL(prandom_bytes_state);
> 
>  static void prandom_warmup(struct rnd_state *state)
>  {
> +    int i;
> +
>      /* Calling RNG ten times to satisfy recurrence condition */
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> -    prandom_u32_state(state);
> +    for (i = 0; i < 10; i++)
> +        prandom_u32_state(state);
>  }
> 
>  void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state)
> 
> The loop is:
> 
>  248:    38 e0 00 0a     li      r7,10
>  24c:    7c e9 03 a6     mtctr   r7
>  250:    55 05 30 32     slwi    r5,r8,6
>  254:    55 46 68 24     slwi    r6,r10,13
>  258:    55 27 18 38     slwi    r7,r9,3
>  25c:    7c a5 42 78     xor     r5,r5,r8
>  260:    7c c6 52 78     xor     r6,r6,r10
>  264:    7c e7 4a 78     xor     r7,r7,r9
>  268:    54 8b 10 3a     slwi    r11,r4,2
>  26c:    7d 60 22 78     xor     r0,r11,r4
>  270:    54 a5 9b 7e     srwi    r5,r5,13
>  274:    55 08 90 18     rlwinm  r8,r8,18,0,12
>  278:    54 c6 5d 7e     srwi    r6,r6,21
>  27c:    55 4a 38 28     rlwinm  r10,r10,7,0,20
>  280:    54 e7 a3 3e     srwi    r7,r7,12
>  284:    55 29 68 16     rlwinm  r9,r9,13,0,11
>  288:    7d 64 5b 78     mr      r4,r11
>  28c:    7c a8 42 78     xor     r8,r5,r8
>  290:    7c ca 52 78     xor     r10,r6,r10
>  294:    7c e9 4a 78     xor     r9,r7,r9
>  298:    50 04 2e fe     rlwimi  r4,r0,5,27,31
>  29c:    42 00 ff b4     bdnz    250 <prandom_seed_full_state+0x7c>
> 
> Which replaces the 10 calls to prandom_u32_state()
> 
>   fc:    91 3f 00 0c     stw     r9,12(r31)
>  100:    7f e3 fb 78     mr      r3,r31
>  104:    48 00 00 01     bl      104 <prandom_seed_full_state+0x88>
>             104: R_PPC_REL24    prandom_u32_state
>  108:    7f e3 fb 78     mr      r3,r31
>  10c:    48 00 00 01     bl      10c <prandom_seed_full_state+0x90>
>             10c: R_PPC_REL24    prandom_u32_state
>  110:    7f e3 fb 78     mr      r3,r31
>  114:    48 00 00 01     bl      114 <prandom_seed_full_state+0x98>
>             114: R_PPC_REL24    prandom_u32_state
>  118:    7f e3 fb 78     mr      r3,r31
>  11c:    48 00 00 01     bl      11c <prandom_seed_full_state+0xa0>
>             11c: R_PPC_REL24    prandom_u32_state
>  120:    7f e3 fb 78     mr      r3,r31
>  124:    48 00 00 01     bl      124 <prandom_seed_full_state+0xa8>
>             124: R_PPC_REL24    prandom_u32_state
>  128:    7f e3 fb 78     mr      r3,r31
>  12c:    48 00 00 01     bl      12c <prandom_seed_full_state+0xb0>
>             12c: R_PPC_REL24    prandom_u32_state
>  130:    7f e3 fb 78     mr      r3,r31
>  134:    48 00 00 01     bl      134 <prandom_seed_full_state+0xb8>
>             134: R_PPC_REL24    prandom_u32_state
>  138:    7f e3 fb 78     mr      r3,r31
>  13c:    48 00 00 01     bl      13c <prandom_seed_full_state+0xc0>
>             13c: R_PPC_REL24    prandom_u32_state
>  140:    7f e3 fb 78     mr      r3,r31
>  144:    48 00 00 01     bl      144 <prandom_seed_full_state+0xc8>
>             144: R_PPC_REL24    prandom_u32_state
>  148:    80 01 00 24     lwz     r0,36(r1)
>  14c:    7f e3 fb 78     mr      r3,r31
>  150:    83 e1 00 1c     lwz     r31,28(r1)
>  154:    7c 08 03 a6     mtlr    r0
>  158:    38 21 00 20     addi    r1,r1,32
>  15c:    48 00 00 00     b       15c <prandom_seed_full_state+0xe0>
>             15c: R_PPC_REL24    prandom_u32_state
> 
> 
> So approx the same number of instructions in size, while better performance.
> 
>> I'm not really sure if 708 is good or bad...
> 
> That's in the noise compared to the overall size of vmlinux, but if we change it
> to a loop we also reduce pressure on the cache.

Thanks for the analysis; I'm going to follow David's suggestion and refactor
this into both an __always_inline and an out-of-line version. That way the
existing callsites can continue to use the out-of-line version and we will only
use the inline version for the kstack offset randomization.

Thanks,
Ryan

> 
> Christophe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-04 23:01   ` David Laight
@ 2026-01-05 11:05     ` Ryan Roberts
  2026-01-05 14:45       ` David Laight
  2026-01-07 14:05     ` David Laight
  1 sibling, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2026-01-05 11:05 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On 04/01/2026 23:01, David Laight wrote:
> On Fri,  2 Jan 2026 13:11:54 +0000
> Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> Previously different architectures were using random sources of
>> differing strength and cost to decide the random kstack offset. A number
>> of architectures (loongarch, powerpc, s390, x86) were using their
>> timestamp counter, at whatever the frequency happened to be. Other
>> arches (arm64, riscv) were using entropy from the crng via
>> get_random_u16().
>>
>> There have been concerns that in some cases the timestamp counters may
>> be too weak, because they can be easily guessed or influenced by user
>> space. And get_random_u16() has been shown to be too costly for the
>> level of protection kstack offset randomization provides.
>>
>> So let's use a common, architecture-agnostic source of entropy; a
>> per-cpu prng, seeded at boot-time from the crng. This has a few
>> benefits:
>>
>>   - We can remove choose_random_kstack_offset(); That was only there to
>>     try to make the timestamp counter value a bit harder to influence
>>     from user space.
>>
>>   - The architecture code is simplified. All it has to do now is call
>>     add_random_kstack_offset() in the syscall path.
>>
>>   - The strength of the randomness can be reasoned about independently
>>     of the architecture.
>>
>>   - Arches previously using get_random_u16() now have much faster
>>     syscall paths, see below results.
>>
>> There have been some claims that a prng may be less strong than the
>> timestamp counter if not regularly reseeded. But the prng has a period
>> of about 2^113. So as long as the prng state remains secret, it should
>> not be possible to guess. If the prng state can be accessed, we have
>> bigger problems.
> 
> If you have 128 bits of output from consecutive outputs I think you
> can trivially determine the full state using (almost) 'school boy' maths
> that could be done on pencil and paper.
> (Most of the work only has to be done once.)
> 
> The underlying problem is that the TAUSWORTHE() transformation is 'linear'
> So that TAUSWORTHE(x ^ y) == TAUSWORTHE(x) ^ TAUSWORTHE(y).
> (This is true of a LFSR/CRC and TOUSWORTH() is doing some subset of CRCs.)
> This means that each output bit is the 'xor' of some of the input bits.
> The four new 'state' values are just xor of the the bits of the old ones.
> The final xor of the four states gives a 32bit value with each bit just
> an xor of some of the 128 state bits.
> Get four consecutive 32 bit values and you can solve the 128 simultaneous
> equations (by trivial substitution) and get the initial state.
> The solution gives you the 128 128bit constants for:
> 	u128 state = 0;
> 	u128 val = 'value returned from 4 calls';
> 	for (int i = 0; i < 128; i++)
> 		state |= parity(const128[i] ^ val) << i;

What is const128[] here?

> You done need all 32bits, just accumulate 128 bits.  
> So if you can get the 5bit stack offset from 26 system calls you know the
> value that will be used for all the subsequent calls.

It's not immediately obvious to me how user space would do this, but I'll take
it on faith that it may be possible.

> 
> Simply changing the final line to use + not ^ makes the output non-linear
> and solving the equations a lot harder.

There has been pushback on introducing new primitives [1] but I don't think
that's a reason not to considder it.

[1] https://lore.kernel.org/all/aRyppb8PCxFKVphr@zx2c4.com/

> 
> I might sit down tomorrow and see if I can actually code it...

Thanks for the analysis! I look forward to seeing your conclusion... although
I'm not sure I'll be qualified to evaluate it mathematically.

FWIW, I previously had a go at various schemes using siphash to calculate some
random bits. I found it to be significantly slower than this prng. I've so far
taken the view that 6 bits of randomness is not much of a defence against brute
force so we really shouldn't be spending too many cycles to generate the bits.
If we can get to approach to work, I think that's best.

Thanks,
Ryan

> 
> 	David
> 
>  


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-05 11:05     ` Ryan Roberts
@ 2026-01-05 14:45       ` David Laight
  0 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2026-01-05 14:45 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Mon, 5 Jan 2026 11:05:18 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:

> On 04/01/2026 23:01, David Laight wrote:
> > On Fri,  2 Jan 2026 13:11:54 +0000
> > Ryan Roberts <ryan.roberts@arm.com> wrote:
> >   
> >> Previously different architectures were using random sources of
> >> differing strength and cost to decide the random kstack offset. A number
> >> of architectures (loongarch, powerpc, s390, x86) were using their
> >> timestamp counter, at whatever the frequency happened to be. Other
> >> arches (arm64, riscv) were using entropy from the crng via
> >> get_random_u16().
> >>
> >> There have been concerns that in some cases the timestamp counters may
> >> be too weak, because they can be easily guessed or influenced by user
> >> space. And get_random_u16() has been shown to be too costly for the
> >> level of protection kstack offset randomization provides.
> >>
> >> So let's use a common, architecture-agnostic source of entropy; a
> >> per-cpu prng, seeded at boot-time from the crng. This has a few
> >> benefits:
> >>
> >>   - We can remove choose_random_kstack_offset(); That was only there to
> >>     try to make the timestamp counter value a bit harder to influence
> >>     from user space.
> >>
> >>   - The architecture code is simplified. All it has to do now is call
> >>     add_random_kstack_offset() in the syscall path.
> >>
> >>   - The strength of the randomness can be reasoned about independently
> >>     of the architecture.
> >>
> >>   - Arches previously using get_random_u16() now have much faster
> >>     syscall paths, see below results.
> >>
> >> There have been some claims that a prng may be less strong than the
> >> timestamp counter if not regularly reseeded. But the prng has a period
> >> of about 2^113. So as long as the prng state remains secret, it should
> >> not be possible to guess. If the prng state can be accessed, we have
> >> bigger problems.  
> > 
> > If you have 128 bits of output from consecutive outputs I think you
> > can trivially determine the full state using (almost) 'school boy' maths
> > that could be done on pencil and paper.
> > (Most of the work only has to be done once.)
> > 
> > The underlying problem is that the TAUSWORTHE() transformation is 'linear'
> > So that TAUSWORTHE(x ^ y) == TAUSWORTHE(x) ^ TAUSWORTHE(y).
> > (This is true of a LFSR/CRC and TOUSWORTH() is doing some subset of CRCs.)
> > This means that each output bit is the 'xor' of some of the input bits.
> > The four new 'state' values are just xor of the the bits of the old ones.
> > The final xor of the four states gives a 32bit value with each bit just
> > an xor of some of the 128 state bits.
> > Get four consecutive 32 bit values and you can solve the 128 simultaneous
> > equations (by trivial substitution) and get the initial state.
> > The solution gives you the 128 128bit constants for:
> > 	u128 state = 0;
> > 	u128 val = 'value returned from 4 calls';
> > 	for (int i = 0; i < 128; i++)
> > 		state |= parity(const128[i] ^ val) << i;  
> 
> What is const128[] here?

Some values you prepared earlier :-)

> > You done need all 32bits, just accumulate 128 bits.  
> > So if you can get the 5bit stack offset from 26 system calls you know the
> > value that will be used for all the subsequent calls.  
> 
> It's not immediately obvious to me how user space would do this, but I'll take
> it on faith that it may be possible.

It shouldn't be possible, but anything that leaks a stack address would
give it away.
It is also pretty much why you care about the cycle length of the PRNG.
(If the length is short a rogue application can remember all the values.)

> > 
> > Simply changing the final line to use + not ^ makes the output non-linear
> > and solving the equations a lot harder.  
> 
> There has been pushback on introducing new primitives [1] but I don't think
> that's a reason not to considder it.

That is a more general issue with the PRNG.
ISTR it was true for the previous version that explicitly used four CRC.
Jason should know more about whether the xor are a good idea.

> 
> [1] https://lore.kernel.org/all/aRyppb8PCxFKVphr@zx2c4.com/
> 
> > 
> > I might sit down tomorrow and see if I can actually code it...  
> 
> Thanks for the analysis! I look forward to seeing your conclusion... although
> I'm not sure I'll be qualified to evaluate it mathematically.

I need to drag out the brian cells from when I learnt about CRC (actually
relating to burst error correction) over 40 years ago...
 
> FWIW, I previously had a go at various schemes using siphash to calculate some
> random bits. I found it to be significantly slower than this prng. I've so far
> taken the view that 6 bits of randomness is not much of a defence against brute
> force so we really shouldn't be spending too many cycles to generate the bits.
> If we can get to approach to work, I think that's best.

Indeed.
A single 32bit CRC using (crc + (crc >> 16)) & 0x3f could be 'good enough'.
Especially if the value is 'perturbed' during (say) context switch.
The '16' might need adjusting for the actual CRC, especially if TAUSWORTHE()
is used - you don't want the value to match one of the shifts it uses.

prandom_u32_state() is defined as:
#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
	state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
	state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
	state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
	state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);
	return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
This is equivalent to:
#define TAUSWORTHE(s, a, b, c, d) ((s & ~c) << d) ^ (s >> a) ^ (s >> b)
	state->s1 = TAUSWORTHE(state->s1,  7, 13,   1, 18);
	state->s2 = TAUSWORTHE(state->s2, 25, 27,   7,  2);
	state->s3 = TAUSWORTHE(state->s3,  8, 21,  15,  7);
	state->s4 = TAUSWORTHE(state->s4,  9, 12, 127, 13);
which makes it clear that some low bits of each 's' get discarded reducing
the length of each CRC to (I think) 31, 29, 28 and 25.
Since 'b + d' matches the bits discarded by 'c', two of those shifts are
actually just a rotate, so there isn't really much 'bit stirring' going on.

By comparison CRC-16 (for hdlc comms like x25, isdn and ss7) reduces to:
u32 crc_step(u32 crc, u8 byte_val)
{
    u8 t = crc ^ byte_val;
    t = (t ^ t << 4);
    return crc >> 8 ^ t << 8 ^ t << 3 ^ t >> 4;
}
Much more 'stirring'.

	David



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-04 23:01   ` David Laight
  2026-01-05 11:05     ` Ryan Roberts
@ 2026-01-07 14:05     ` David Laight
  2026-01-12 12:26       ` Ryan Roberts
  1 sibling, 1 reply; 26+ messages in thread
From: David Laight @ 2026-01-07 14:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Sun, 4 Jan 2026 23:01:36 +0000
David Laight <david.laight.linux@gmail.com> wrote:

> On Fri,  2 Jan 2026 13:11:54 +0000
> Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> > Previously different architectures were using random sources of
> > differing strength and cost to decide the random kstack offset. A number
> > of architectures (loongarch, powerpc, s390, x86) were using their
> > timestamp counter, at whatever the frequency happened to be. Other
> > arches (arm64, riscv) were using entropy from the crng via
> > get_random_u16().
> > 
> > There have been concerns that in some cases the timestamp counters may
> > be too weak, because they can be easily guessed or influenced by user
> > space. And get_random_u16() has been shown to be too costly for the
> > level of protection kstack offset randomization provides.
> > 
> > So let's use a common, architecture-agnostic source of entropy; a
> > per-cpu prng, seeded at boot-time from the crng. This has a few
> > benefits:
> > 
> >   - We can remove choose_random_kstack_offset(); That was only there to
> >     try to make the timestamp counter value a bit harder to influence
> >     from user space.
> > 
> >   - The architecture code is simplified. All it has to do now is call
> >     add_random_kstack_offset() in the syscall path.
> > 
> >   - The strength of the randomness can be reasoned about independently
> >     of the architecture.
> > 
> >   - Arches previously using get_random_u16() now have much faster
> >     syscall paths, see below results.
> > 
> > There have been some claims that a prng may be less strong than the
> > timestamp counter if not regularly reseeded. But the prng has a period
> > of about 2^113. So as long as the prng state remains secret, it should
> > not be possible to guess. If the prng state can be accessed, we have
> > bigger problems.  
> 
> If you have 128 bits of output from consecutive outputs I think you
> can trivially determine the full state using (almost) 'school boy' maths
> that could be done on pencil and paper.
> (Most of the work only has to be done once.)
> 
> The underlying problem is that the TAUSWORTHE() transformation is 'linear'
> So that TAUSWORTHE(x ^ y) == TAUSWORTHE(x) ^ TAUSWORTHE(y).
> (This is true of a LFSR/CRC and TOUSWORTH() is doing some subset of CRCs.)
> This means that each output bit is the 'xor' of some of the input bits.
> The four new 'state' values are just xor of the the bits of the old ones.
> The final xor of the four states gives a 32bit value with each bit just
> an xor of some of the 128 state bits.
> Get four consecutive 32 bit values and you can solve the 128 simultaneous
> equations (by trivial substitution) and get the initial state.
> The solution gives you the 128 128bit constants for:
> 	u128 state = 0;
> 	u128 val = 'value returned from 4 calls';
> 	for (int i = 0; i < 128; i++)
> 		state |= parity(const128[i] ^ val) << i;
> You don't need all 32bits, just accumulate 128 bits.  
> So if you can get the 5bit stack offset from 26 system calls you know the
> value that will be used for all the subsequent calls.

Some of the state bits don't get used, so you only need 123 bits.
The stack offset is 6 bits - so you need the values from 19 calls.

> Simply changing the final line to use + not ^ makes the output non-linear
> and solving the equations a lot harder.
> 
> I might sit down tomorrow and see if I can actually code it...

Finally done:

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

typedef unsigned int u32;
typedef unsigned long long u64;
typedef unsigned __int128 u128;

struct rnd_state { u32 s1; u32 s2; u32 s3; u32 s4; };
u32 prandom_u32_state(struct rnd_state *state)
{
#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
        state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
        state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
        state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
        state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);

        return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
}

#define X(n, hi, lo) [n] = (u128)0x##hi << 64 | 0x##lo
u128 map[128] = {
        X(  1, 23acb122e4a76, e206c3f6fe435cb6),
	...
        X(127, 00d3276d8a76a, e560d1975675be24) };

u128 parity_128(u128 v)                 
{                               
        return __builtin_parityll(v) ^ __builtin_parityll(v >> 64);
}

int main(int argc, char **argv)
{
        struct rnd_state s = {};
        u128 s0, v, r = 0;

        read(open("/dev/urandom", O_RDONLY), &s, sizeof s);
        // Remove low bits that get masked by the (s & c) term.
        s.s1 &= ~1; s.s2 &= ~7; s.s3 &= ~15; s.s4 &= ~127;
        s0 = (((u128)s.s4 << 32 | s.s3) << 32 | s.s2) << 32 | s.s1;
        v = prandom_u32_state(&s);
        v |= (u128)prandom_u32_state(&s) << 32;
        v |= (u128)prandom_u32_state(&s) << 64;
        v |= (u128)prandom_u32_state(&s) << 96;

        for (int n = 0; n < 128; n++)
                r |= parity_128(v & map[n]) << n;

        printf("%016llx%016llx\n", (u64)(s0 >> 64), (u64)s0);
        printf("values%s match\n", r == s0 ? "" : " do not");

        return r != s0;
}

I've trimmed the initialiser - it is very boring.
The code to create the initialiser is actually slightly smaller than it is.
Doable by hand provided you can do 128bit shift and xor without making
any mistakes.

I've just done a quick search through the kernel sources and haven't found
many uses of prandom_u32_state() outside of test code.
There is sched_rng() which uses a per-cpu rng to throw a 1024 sized die.
bpf also has a per-cpu one for 'unprivileged user space'.
net/sched/sch_netem.c seems to use one - mostly for packet loss generation.

Since the randomize_kstack code is now using a per-task rng (initialised
by clone?) that could be used instead of all the others provided they
are run when 'current' is valid.

But the existing prandom_u32_state() needs a big health warning that
four outputs leak the entire state.
That is fixable by changing the last line to:
        return state->s1 + state->s2 + state->s3 + state->s4;
That only affects the output value, the period is unchanged.

	David



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-07 14:05     ` David Laight
@ 2026-01-12 12:26       ` Ryan Roberts
  2026-01-12 13:36         ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2026-01-12 12:26 UTC (permalink / raw)
  To: David Laight
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On 07/01/2026 14:05, David Laight wrote:
> On Sun, 4 Jan 2026 23:01:36 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
> 
>> On Fri,  2 Jan 2026 13:11:54 +0000
>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>> Previously different architectures were using random sources of
>>> differing strength and cost to decide the random kstack offset. A number
>>> of architectures (loongarch, powerpc, s390, x86) were using their
>>> timestamp counter, at whatever the frequency happened to be. Other
>>> arches (arm64, riscv) were using entropy from the crng via
>>> get_random_u16().
>>>
>>> There have been concerns that in some cases the timestamp counters may
>>> be too weak, because they can be easily guessed or influenced by user
>>> space. And get_random_u16() has been shown to be too costly for the
>>> level of protection kstack offset randomization provides.
>>>
>>> So let's use a common, architecture-agnostic source of entropy; a
>>> per-cpu prng, seeded at boot-time from the crng. This has a few
>>> benefits:
>>>
>>>   - We can remove choose_random_kstack_offset(); That was only there to
>>>     try to make the timestamp counter value a bit harder to influence
>>>     from user space.
>>>
>>>   - The architecture code is simplified. All it has to do now is call
>>>     add_random_kstack_offset() in the syscall path.
>>>
>>>   - The strength of the randomness can be reasoned about independently
>>>     of the architecture.
>>>
>>>   - Arches previously using get_random_u16() now have much faster
>>>     syscall paths, see below results.
>>>
>>> There have been some claims that a prng may be less strong than the
>>> timestamp counter if not regularly reseeded. But the prng has a period
>>> of about 2^113. So as long as the prng state remains secret, it should
>>> not be possible to guess. If the prng state can be accessed, we have
>>> bigger problems.  
>>
>> If you have 128 bits of output from consecutive outputs I think you
>> can trivially determine the full state using (almost) 'school boy' maths
>> that could be done on pencil and paper.
>> (Most of the work only has to be done once.)
>>
>> The underlying problem is that the TAUSWORTHE() transformation is 'linear'
>> So that TAUSWORTHE(x ^ y) == TAUSWORTHE(x) ^ TAUSWORTHE(y).
>> (This is true of a LFSR/CRC and TOUSWORTH() is doing some subset of CRCs.)
>> This means that each output bit is the 'xor' of some of the input bits.
>> The four new 'state' values are just xor of the the bits of the old ones.
>> The final xor of the four states gives a 32bit value with each bit just
>> an xor of some of the 128 state bits.
>> Get four consecutive 32 bit values and you can solve the 128 simultaneous
>> equations (by trivial substitution) and get the initial state.
>> The solution gives you the 128 128bit constants for:
>> 	u128 state = 0;
>> 	u128 val = 'value returned from 4 calls';
>> 	for (int i = 0; i < 128; i++)
>> 		state |= parity(const128[i] ^ val) << i;
>> You don't need all 32bits, just accumulate 128 bits.  
>> So if you can get the 5bit stack offset from 26 system calls you know the
>> value that will be used for all the subsequent calls.
> 
> Some of the state bits don't get used, so you only need 123 bits.
> The stack offset is 6 bits - so you need the values from 19 calls.
> 
>> Simply changing the final line to use + not ^ makes the output non-linear
>> and solving the equations a lot harder.
>>
>> I might sit down tomorrow and see if I can actually code it...
> 
> Finally done:
> 
> #include <stdio.h>
> #include <unistd.h>
> #include <fcntl.h>
> 
> typedef unsigned int u32;
> typedef unsigned long long u64;
> typedef unsigned __int128 u128;
> 
> struct rnd_state { u32 s1; u32 s2; u32 s3; u32 s4; };
> u32 prandom_u32_state(struct rnd_state *state)
> {
> #define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
>         state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
>         state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
>         state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
>         state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);
> 
>         return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
> }
> 
> #define X(n, hi, lo) [n] = (u128)0x##hi << 64 | 0x##lo
> u128 map[128] = {
>         X(  1, 23acb122e4a76, e206c3f6fe435cb6),
> 	...
>         X(127, 00d3276d8a76a, e560d1975675be24) };
> 
> u128 parity_128(u128 v)                 
> {                               
>         return __builtin_parityll(v) ^ __builtin_parityll(v >> 64);
> }
> 
> int main(int argc, char **argv)
> {
>         struct rnd_state s = {};
>         u128 s0, v, r = 0;
> 
>         read(open("/dev/urandom", O_RDONLY), &s, sizeof s);
>         // Remove low bits that get masked by the (s & c) term.
>         s.s1 &= ~1; s.s2 &= ~7; s.s3 &= ~15; s.s4 &= ~127;
>         s0 = (((u128)s.s4 << 32 | s.s3) << 32 | s.s2) << 32 | s.s1;
>         v = prandom_u32_state(&s);
>         v |= (u128)prandom_u32_state(&s) << 32;
>         v |= (u128)prandom_u32_state(&s) << 64;
>         v |= (u128)prandom_u32_state(&s) << 96;
> 
>         for (int n = 0; n < 128; n++)
>                 r |= parity_128(v & map[n]) << n;
> 
>         printf("%016llx%016llx\n", (u64)(s0 >> 64), (u64)s0);
>         printf("values%s match\n", r == s0 ? "" : " do not");
> 
>         return r != s0;
> }
> 
> I've trimmed the initialiser - it is very boring.
> The code to create the initialiser is actually slightly smaller than it is.
> Doable by hand provided you can do 128bit shift and xor without making
> any mistakes.
> 
> I've just done a quick search through the kernel sources and haven't found
> many uses of prandom_u32_state() outside of test code.
> There is sched_rng() which uses a per-cpu rng to throw a 1024 sized die.
> bpf also has a per-cpu one for 'unprivileged user space'.
> net/sched/sch_netem.c seems to use one - mostly for packet loss generation.
> 
> Since the randomize_kstack code is now using a per-task rng (initialised
> by clone?) that could be used instead of all the others provided they
> are run when 'current' is valid.
> 
> But the existing prandom_u32_state() needs a big health warning that
> four outputs leak the entire state.
> That is fixable by changing the last line to:
>         return state->s1 + state->s2 + state->s3 + state->s4;
> That only affects the output value, the period is unchanged.

Hi David,

This all seems interesting, but I'm not clear that it is a blocker for this
series. As I keep saying, we only use 6 bits for offset randmization so it is
trival to brute force, regardless of how easy it is to recover the prng state.

Perhaps we can decouple these 2 things and make them independent:

 - this series, which is motivated by speeding up syscalls on arm64; given 6
   bits is not hard to brute force, spending a lot of cycles calculating those
   bits is unjustified.

 - Your observation that that the current prng could be improved to make
   recoving it's state harder.

What do you think?

Thanks,
Ryan


> 
> 	David
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-12 12:26       ` Ryan Roberts
@ 2026-01-12 13:36         ` David Laight
  0 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2026-01-12 13:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Mon, 12 Jan 2026 12:26:26 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:

> On 07/01/2026 14:05, David Laight wrote:
> > On Sun, 4 Jan 2026 23:01:36 +0000
> > David Laight <david.laight.linux@gmail.com> wrote:
...
> > I've trimmed the initialiser - it is very boring.
> > The code to create the initialiser is actually slightly smaller than it is.
> > Doable by hand provided you can do 128bit shift and xor without making
> > any mistakes.
> > 
> > I've just done a quick search through the kernel sources and haven't found
> > many uses of prandom_u32_state() outside of test code.
> > There is sched_rng() which uses a per-cpu rng to throw a 1024 sized die.
> > bpf also has a per-cpu one for 'unprivileged user space'.
> > net/sched/sch_netem.c seems to use one - mostly for packet loss generation.
> > 
> > Since the randomize_kstack code is now using a per-task rng (initialised
> > by clone?) that could be used instead of all the others provided they
> > are run when 'current' is valid.
> > 
> > But the existing prandom_u32_state() needs a big health warning that
> > four outputs leak the entire state.
> > That is fixable by changing the last line to:
> >         return state->s1 + state->s2 + state->s3 + state->s4;
> > That only affects the output value, the period is unchanged.  
> 
> Hi David,
> 
> This all seems interesting, but I'm not clear that it is a blocker for this
> series. As I keep saying, we only use 6 bits for offset randmization so it is
> trival to brute force, regardless of how easy it is to recover the prng state.
> 
> Perhaps we can decouple these 2 things and make them independent:
> 
>  - this series, which is motivated by speeding up syscalls on arm64; given 6
>    bits is not hard to brute force, spending a lot of cycles calculating those
>    bits is unjustified.
> 
>  - Your observation that that the current prng could be improved to make
>    recoving it's state harder.
> 
> What do you think?

They are separate.
I should have a 'mostly written' patch series for prandom_u32_state().

If you unconditionally add a per-task prng there are a few places that could
use it instead of a per-cpu one.
It could be 'perturbed' during task switch - eg by:
	s->s1 = (s->s1 ^ something) | 2;
(The 2 stops the new value being 0 or 1, losing 1 bit wont be significant.)

This one is much nearer 'ready' and has an obvious impact.

	David

> 
> Thanks,
> Ryan
> 
> 
> > 
> > 	David
> > 
> >   
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task
  2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
  2026-01-02 22:44   ` David Laight
@ 2026-01-19 10:23   ` Mark Rutland
  1 sibling, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2026-01-19 10:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Jason A. Donenfeld,
	Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening,
	stable

On Fri, Jan 02, 2026 at 01:11:52PM +0000, Ryan Roberts wrote:
> kstack_offset was previously maintained per-cpu, but this caused a
> couple of issues. So let's instead make it per-task.
> 
> Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
> expected and required to be called with interrupts and preemption
> disabled so that it could manipulate per-cpu state. But arm64, loongarch
> and risc-v are calling them with interrupts and preemption enabled. I
> don't _think_ this causes any functional issues, but it's certainly
> unexpected and could lead to manipulating the wrong cpu's state, which
> could cause a minor performance degradation due to bouncing the cache
> lines. By maintaining the state per-task those functions can safely be
> called in preemptible context.
> 
> Issue 2: add_random_kstack_offset() is called before executing the
> syscall and expands the stack using a previously chosen rnadom offset.
> choose_random_kstack_offset() is called after executing the syscall and
> chooses and stores a new random offset for the next syscall. With
> per-cpu storage for this offset, an attacker could force cpu migration
> during the execution of the syscall and prevent the offset from being
> updated for the original cpu such that it is predictable for the next
> syscall on that cpu. By maintaining the state per-task, this problem
> goes away because the per-task random offset is updated after the
> syscall regardless of which cpu it is executing on.
> 
> Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall")
> Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
> Cc: stable@vger.kernel.org
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  include/linux/randomize_kstack.h | 26 +++++++++++++++-----------
>  include/linux/sched.h            |  4 ++++
>  init/main.c                      |  1 -
>  kernel/fork.c                    |  2 ++
>  4 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
> index 1d982dbdd0d0..5d3916ca747c 100644
> --- a/include/linux/randomize_kstack.h
> +++ b/include/linux/randomize_kstack.h
> @@ -9,7 +9,6 @@
>  
>  DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  			 randomize_kstack_offset);
> -DECLARE_PER_CPU(u32, kstack_offset);
>  
>  /*
>   * Do not use this anywhere else in the kernel. This is used here because
> @@ -50,15 +49,14 @@ DECLARE_PER_CPU(u32, kstack_offset);
>   * add_random_kstack_offset - Increase stack utilization by previously
>   *			      chosen random offset
>   *
> - * This should be used in the syscall entry path when interrupts and
> - * preempt are disabled, and after user registers have been stored to
> - * the stack. For testing the resulting entropy, please see:
> - * tools/testing/selftests/lkdtm/stack-entropy.sh
> + * This should be used in the syscall entry path after user registers have been
> + * stored to the stack. Preemption may be enabled. For testing the resulting
> + * entropy, please see: tools/testing/selftests/lkdtm/stack-entropy.sh
>   */
>  #define add_random_kstack_offset() do {					\
>  	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>  				&randomize_kstack_offset)) {		\
> -		u32 offset = raw_cpu_read(kstack_offset);		\
> +		u32 offset = current->kstack_offset;			\
>  		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
>  		/* Keep allocation even after "ptr" loses scope. */	\
>  		asm volatile("" :: "r"(ptr) : "memory");		\
> @@ -69,9 +67,9 @@ DECLARE_PER_CPU(u32, kstack_offset);
>   * choose_random_kstack_offset - Choose the random offset for the next
>   *				 add_random_kstack_offset()
>   *
> - * This should only be used during syscall exit when interrupts and
> - * preempt are disabled. This position in the syscall flow is done to
> - * frustrate attacks from userspace attempting to learn the next offset:
> + * This should only be used during syscall exit. Preemption may be enabled. This
> + * position in the syscall flow is done to frustrate attacks from userspace
> + * attempting to learn the next offset:
>   * - Maximize the timing uncertainty visible from userspace: if the
>   *   offset is chosen at syscall entry, userspace has much more control
>   *   over the timing between choosing offsets. "How long will we be in
> @@ -85,14 +83,20 @@ DECLARE_PER_CPU(u32, kstack_offset);
>  #define choose_random_kstack_offset(rand) do {				\
>  	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>  				&randomize_kstack_offset)) {		\
> -		u32 offset = raw_cpu_read(kstack_offset);		\
> +		u32 offset = current->kstack_offset;			\
>  		offset = ror32(offset, 5) ^ (rand);			\
> -		raw_cpu_write(kstack_offset, offset);			\
> +		current->kstack_offset = offset;			\
>  	}								\
>  } while (0)
> +
> +static inline void random_kstack_task_init(struct task_struct *tsk)
> +{
> +	tsk->kstack_offset = 0;
> +}
>  #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
>  #define add_random_kstack_offset()		do { } while (0)
>  #define choose_random_kstack_offset(rand)	do { } while (0)
> +#define random_kstack_task_init(tsk)		do { } while (0)
>  #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
>  
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d395f2810fac..9e0080ed1484 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1591,6 +1591,10 @@ struct task_struct {
>  	unsigned long			prev_lowest_stack;
>  #endif
>  
> +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> +	u32				kstack_offset;
> +#endif
> +
>  #ifdef CONFIG_X86_MCE
>  	void __user			*mce_vaddr;
>  	__u64				mce_kflags;
> diff --git a/init/main.c b/init/main.c
> index b84818ad9685..27fcbbde933e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -830,7 +830,6 @@ static inline void initcall_debug_enable(void)
>  #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>  DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  			   randomize_kstack_offset);
> -DEFINE_PER_CPU(u32, kstack_offset);
>  
>  static int __init early_randomize_kstack_offset(char *buf)
>  {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b1f3915d5f8e..b061e1edbc43 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -95,6 +95,7 @@
>  #include <linux/thread_info.h>
>  #include <linux/kstack_erase.h>
>  #include <linux/kasan.h>
> +#include <linux/randomize_kstack.h>
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> @@ -2231,6 +2232,7 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> +	random_kstack_task_init(p);
>  	stackleak_task_init(p);
>  
>  	if (pid != &init_struct_pid) {
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
  2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts
  2026-01-02 13:39   ` Jason A. Donenfeld
@ 2026-01-19 10:26   ` Mark Rutland
  1 sibling, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2026-01-19 10:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Jason A. Donenfeld,
	Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening

On Fri, Jan 02, 2026 at 01:11:53PM +0000, Ryan Roberts wrote:
> We will shortly use prandom_u32_state() to implement kstack offset
> randomization and some arches need to call it from non-instrumentable
> context. Given the function is just a handful of operations and doesn't
> call out to any other functions, let's take the easy path and make it
> __always_inline.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

I see there were some comments about keeping an out-of-line wrapper.
With or without that, this looks good to me, and either way:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  include/linux/prandom.h | 19 ++++++++++++++++++-
>  lib/random32.c          | 19 -------------------
>  2 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/prandom.h b/include/linux/prandom.h
> index ff7dcc3fa105..e797b3709f5c 100644
> --- a/include/linux/prandom.h
> +++ b/include/linux/prandom.h
> @@ -17,7 +17,24 @@ struct rnd_state {
>  	__u32 s1, s2, s3, s4;
>  };
>  
> -u32 prandom_u32_state(struct rnd_state *state);
> +/**
> + * prandom_u32_state - seeded pseudo-random number generator.
> + * @state: pointer to state structure holding seeded state.
> + *
> + * This is used for pseudo-randomness with no outside seeding.
> + * For more random results, use get_random_u32().
> + */
> +static __always_inline u32 prandom_u32_state(struct rnd_state *state)
> +{
> +#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
> +	state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
> +	state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
> +	state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
> +	state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);
> +
> +	return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
> +}
> +
>  void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes);
>  void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state);
>  
> diff --git a/lib/random32.c b/lib/random32.c
> index 24e7acd9343f..d57baf489d4a 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -42,25 +42,6 @@
>  #include <linux/slab.h>
>  #include <linux/unaligned.h>
>  
> -/**
> - *	prandom_u32_state - seeded pseudo-random number generator.
> - *	@state: pointer to state structure holding seeded state.
> - *
> - *	This is used for pseudo-randomness with no outside seeding.
> - *	For more random results, use get_random_u32().
> - */
> -u32 prandom_u32_state(struct rnd_state *state)
> -{
> -#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
> -	state->s1 = TAUSWORTHE(state->s1,  6U, 13U, 4294967294U, 18U);
> -	state->s2 = TAUSWORTHE(state->s2,  2U, 27U, 4294967288U,  2U);
> -	state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U,  7U);
> -	state->s4 = TAUSWORTHE(state->s4,  3U, 12U, 4294967168U, 13U);
> -
> -	return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
> -}
> -EXPORT_SYMBOL(prandom_u32_state);
> -
>  /**
>   *	prandom_bytes_state - get the requested number of pseudo-random bytes
>   *
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] randomize_kstack: Unify random source across arches
  2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts
  2026-01-04 23:01   ` David Laight
@ 2026-01-19 10:48   ` Mark Rutland
  1 sibling, 0 replies; 26+ messages in thread
From: Mark Rutland @ 2026-01-19 10:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
	Gustavo A. R. Silva, Arnd Bergmann, Jason A. Donenfeld,
	Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel,
	loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening

Hi Ryan,

I have a couple of comments below, but those are largely for posterity.

On Fri, Jan 02, 2026 at 01:11:54PM +0000, Ryan Roberts wrote:
> Previously different architectures were using random sources of
> differing strength and cost to decide the random kstack offset. A number
> of architectures (loongarch, powerpc, s390, x86) were using their
> timestamp counter, at whatever the frequency happened to be. Other
> arches (arm64, riscv) were using entropy from the crng via
> get_random_u16().
> 
> There have been concerns that in some cases the timestamp counters may
> be too weak, because they can be easily guessed or influenced by user
> space. And get_random_u16() has been shown to be too costly for the
> level of protection kstack offset randomization provides.
> 
> So let's use a common, architecture-agnostic source of entropy; a
> per-cpu prng, seeded at boot-time from the crng. This has a few
> benefits:
> 
>   - We can remove choose_random_kstack_offset(); That was only there to
>     try to make the timestamp counter value a bit harder to influence
>     from user space.

It *might* be worth mentioning that this gets rid of some redundant work
on s390 and x86. Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.

>   - The architecture code is simplified. All it has to do now is call
>     add_random_kstack_offset() in the syscall path.
> 
>   - The strength of the randomness can be reasoned about independently
>     of the architecture.
> 
>   - Arches previously using get_random_u16() now have much faster
>     syscall paths, see below results.
> 
> There have been some claims that a prng may be less strong than the
> timestamp counter if not regularly reseeded. But the prng has a period
> of about 2^113. So as long as the prng state remains secret, it should
> not be possible to guess. If the prng state can be accessed, we have
> bigger problems.
> 
> Additionally, we are only consuming 6 bits to randomize the stack, so
> there are only 64 possible random offsets. I assert that it would be
> trivial for an attacker to brute force by repeating their attack and
> waiting for the random stack offset to be the desired one. The prng
> approach seems entirely proportional to this level of protection.

FWIW, I agree with all of the above rationale.

> Performance data are provided below. The baseline is v6.18 with rndstack
> on for each respective arch. (I)/(R) indicate statistically significant
> improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
> x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):
> 
> +-----------------+--------------+---------------+---------------+
> | Benchmark       | Result Class | per-task-prng | per-task-prng |
> |                 |              | arm64 (metal) |   x86_64 (VM) |
> +=================+==============+===============+===============+
> | syscall/getpid  | mean (ns)    |    (I) -9.50% |   (I) -17.65% |
> |                 | p99 (ns)     |   (I) -59.24% |   (I) -24.41% |
> |                 | p99.9 (ns)   |   (I) -59.52% |   (I) -28.52% |
> +-----------------+--------------+---------------+---------------+
> | syscall/getppid | mean (ns)    |    (I) -9.52% |   (I) -19.24% |
> |                 | p99 (ns)     |   (I) -59.25% |   (I) -25.03% |
> |                 | p99.9 (ns)   |   (I) -59.50% |   (I) -28.17% |
> +-----------------+--------------+---------------+---------------+
> | syscall/invalid | mean (ns)    |   (I) -10.31% |   (I) -18.56% |
> |                 | p99 (ns)     |   (I) -60.79% |   (I) -20.06% |
> |                 | p99.9 (ns)   |   (I) -61.04% |   (I) -25.04% |
> +-----------------+--------------+---------------+---------------+
> 
> I tested an earlier version of this change on x86 bare metal and it
> showed a smaller but still significant improvement. The bare metal
> system wasn't available this time around so testing was done in a VM
> instance. I'm guessing the cost of rdtsc is higher for VMs.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  arch/Kconfig                         |  5 ++-
>  arch/arm64/kernel/syscall.c          | 11 ------
>  arch/loongarch/kernel/syscall.c      | 11 ------
>  arch/powerpc/kernel/syscall.c        | 12 -------
>  arch/riscv/kernel/traps.c            | 12 -------
>  arch/s390/include/asm/entry-common.h |  8 -----
>  arch/x86/include/asm/entry-common.h  | 12 -------
>  include/linux/randomize_kstack.h     | 52 +++++++++-------------------
>  include/linux/sched.h                |  4 ---
>  init/main.c                          |  8 +++++
>  kernel/fork.c                        |  1 -
>  11 files changed, 27 insertions(+), 109 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 31220f512b16..8591fe7b4ac1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1516,9 +1516,8 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
>  	def_bool n
>  	help
>  	  An arch should select this symbol if it can support kernel stack
> -	  offset randomization with calls to add_random_kstack_offset()
> -	  during syscall entry and choose_random_kstack_offset() during
> -	  syscall exit. Careful removal of -fstack-protector-strong and
> +	  offset randomization with a call to add_random_kstack_offset()
> +	  during syscall entry. Careful removal of -fstack-protector-strong and
>  	  -fstack-protector should also be applied to the entry code and
>  	  closely examined, as the artificial stack bump looks like an array
>  	  to the compiler, so it will attempt to add canary checks regardless
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index c062badd1a56..358ddfbf1401 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -52,17 +52,6 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
>  	}
>  
>  	syscall_set_return_value(current, regs, 0, ret);
> -
> -	/*
> -	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
> -	 * bits. The actual entropy will be further reduced by the compiler
> -	 * when applying stack alignment constraints: the AAPCS mandates a
> -	 * 16-byte aligned SP at function boundaries, which will remove the
> -	 * 4 low bits from any entropy chosen here.
> -	 *
> -	 * The resulting 6 bits of entropy is seen in SP[9:4].
> -	 */
> -	choose_random_kstack_offset(get_random_u16());
>  }
>  
>  static inline bool has_syscall_work(unsigned long flags)
> diff --git a/arch/loongarch/kernel/syscall.c b/arch/loongarch/kernel/syscall.c
> index 1249d82c1cd0..85da7e050d97 100644
> --- a/arch/loongarch/kernel/syscall.c
> +++ b/arch/loongarch/kernel/syscall.c
> @@ -79,16 +79,5 @@ void noinstr __no_stack_protector do_syscall(struct pt_regs *regs)
>  					   regs->regs[7], regs->regs[8], regs->regs[9]);
>  	}
>  
> -	/*
> -	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
> -	 * bits. The actual entropy will be further reduced by the compiler
> -	 * when applying stack alignment constraints: 16-bytes (i.e. 4-bits)
> -	 * aligned, which will remove the 4 low bits from any entropy chosen
> -	 * here.
> -	 *
> -	 * The resulting 6 bits of entropy is seen in SP[9:4].
> -	 */
> -	choose_random_kstack_offset(get_cycles());
> -
>  	syscall_exit_to_user_mode(regs);
>  }
> diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
> index be159ad4b77b..b3d8b0f9823b 100644
> --- a/arch/powerpc/kernel/syscall.c
> +++ b/arch/powerpc/kernel/syscall.c
> @@ -173,17 +173,5 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
>  	}
>  #endif
>  
> -	/*
> -	 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
> -	 * so the maximum stack offset is 1k bytes (10 bits).
> -	 *
> -	 * The actual entropy will be further reduced by the compiler when
> -	 * applying stack alignment constraints: the powerpc architecture
> -	 * may have two kinds of stack alignment (16-bytes and 8-bytes).
> -	 *
> -	 * So the resulting 6 or 7 bits of entropy is seen in SP[9:4] or SP[9:3].
> -	 */
> -	choose_random_kstack_offset(mftb());
> -
>  	return ret;
>  }
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 80230de167de..79b285bdfd1a 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -342,18 +342,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
>  		if (syscall >= 0 && syscall < NR_syscalls)
>  			syscall_handler(regs, syscall);
>  
> -		/*
> -		 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
> -		 * so the maximum stack offset is 1k bytes (10 bits).
> -		 *
> -		 * The actual entropy will be further reduced by the compiler when
> -		 * applying stack alignment constraints: 16-byte (i.e. 4-bit) aligned
> -		 * for RV32I or RV64I.
> -		 *
> -		 * The resulting 6 bits of entropy is seen in SP[9:4].
> -		 */
> -		choose_random_kstack_offset(get_random_u16());
> -
>  		syscall_exit_to_user_mode(regs);
>  	} else {
>  		irqentry_state_t state = irqentry_nmi_enter(regs);
> diff --git a/arch/s390/include/asm/entry-common.h b/arch/s390/include/asm/entry-common.h
> index 979af986a8fe..35450a485323 100644
> --- a/arch/s390/include/asm/entry-common.h
> +++ b/arch/s390/include/asm/entry-common.h
> @@ -51,14 +51,6 @@ static __always_inline void arch_exit_to_user_mode(void)
>  
>  #define arch_exit_to_user_mode arch_exit_to_user_mode
>  
> -static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
> -						  unsigned long ti_work)
> -{
> -	choose_random_kstack_offset(get_tod_clock_fast());
> -}
> -
> -#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
> -
>  static __always_inline bool arch_in_rcu_eqs(void)
>  {
>  	if (IS_ENABLED(CONFIG_KVM))
> diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
> index ce3eb6d5fdf9..7535131c711b 100644
> --- a/arch/x86/include/asm/entry-common.h
> +++ b/arch/x86/include/asm/entry-common.h
> @@ -82,18 +82,6 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>  	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
>  #endif
>  
> -	/*
> -	 * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
> -	 * bits. The actual entropy will be further reduced by the compiler
> -	 * when applying stack alignment constraints (see cc_stack_align4/8 in
> -	 * arch/x86/Makefile), which will remove the 3 (x86_64) or 2 (ia32)
> -	 * low bits from any entropy chosen here.
> -	 *
> -	 * Therefore, final stack offset entropy will be 7 (x86_64) or
> -	 * 8 (ia32) bits.
> -	 */
> -	choose_random_kstack_offset(rdtsc());
> -
>  	/* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */
>  	if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
>  	    this_cpu_read(x86_ibpb_exit_to_user)) {
> diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
> index 5d3916ca747c..024fc20e7762 100644
> --- a/include/linux/randomize_kstack.h
> +++ b/include/linux/randomize_kstack.h
> @@ -6,6 +6,7 @@
>  #include <linux/kernel.h>
>  #include <linux/jump_label.h>
>  #include <linux/percpu-defs.h>
> +#include <linux/prandom.h>
>  
>  DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  			 randomize_kstack_offset);
> @@ -45,9 +46,22 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  #define KSTACK_OFFSET_MAX(x)	((x) & 0b1111111100)
>  #endif
>  
> +DECLARE_PER_CPU(struct rnd_state, kstack_rnd_state);
> +
> +static __always_inline u32 get_kstack_offset(void)
> +{
> +	struct rnd_state *state;
> +	u32 rnd;
> +
> +	state = &get_cpu_var(kstack_rnd_state);
> +	rnd = prandom_u32_state(state);
> +	put_cpu_var(kstack_rnd_state);
> +
> +	return rnd;
> +}
> +
>  /**
> - * add_random_kstack_offset - Increase stack utilization by previously
> - *			      chosen random offset
> + * add_random_kstack_offset - Increase stack utilization by a random offset.
>   *
>   * This should be used in the syscall entry path after user registers have been
>   * stored to the stack. Preemption may be enabled. For testing the resulting
> @@ -56,47 +70,15 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  #define add_random_kstack_offset() do {					\
>  	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>  				&randomize_kstack_offset)) {		\
> -		u32 offset = current->kstack_offset;			\
> +		u32 offset = get_kstack_offset();			\
>  		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
>  		/* Keep allocation even after "ptr" loses scope. */	\
>  		asm volatile("" :: "r"(ptr) : "memory");		\
>  	}								\
>  } while (0)
>  
> -/**
> - * choose_random_kstack_offset - Choose the random offset for the next
> - *				 add_random_kstack_offset()
> - *
> - * This should only be used during syscall exit. Preemption may be enabled. This
> - * position in the syscall flow is done to frustrate attacks from userspace
> - * attempting to learn the next offset:
> - * - Maximize the timing uncertainty visible from userspace: if the
> - *   offset is chosen at syscall entry, userspace has much more control
> - *   over the timing between choosing offsets. "How long will we be in
> - *   kernel mode?" tends to be more difficult to predict than "how long
> - *   will we be in user mode?"
> - * - Reduce the lifetime of the new offset sitting in memory during
> - *   kernel mode execution. Exposure of "thread-local" memory content
> - *   (e.g. current, percpu, etc) tends to be easier than arbitrary
> - *   location memory exposure.
> - */
> -#define choose_random_kstack_offset(rand) do {				\
> -	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
> -				&randomize_kstack_offset)) {		\
> -		u32 offset = current->kstack_offset;			\
> -		offset = ror32(offset, 5) ^ (rand);			\
> -		current->kstack_offset = offset;			\
> -	}								\
> -} while (0)
> -
> -static inline void random_kstack_task_init(struct task_struct *tsk)
> -{
> -	tsk->kstack_offset = 0;
> -}
>  #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
>  #define add_random_kstack_offset()		do { } while (0)
> -#define choose_random_kstack_offset(rand)	do { } while (0)
> -#define random_kstack_task_init(tsk)		do { } while (0)
>  #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
>  
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9e0080ed1484..d395f2810fac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1591,10 +1591,6 @@ struct task_struct {
>  	unsigned long			prev_lowest_stack;
>  #endif
>  
> -#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> -	u32				kstack_offset;
> -#endif
> -
>  #ifdef CONFIG_X86_MCE
>  	void __user			*mce_vaddr;
>  	__u64				mce_kflags;
> diff --git a/init/main.c b/init/main.c
> index 27fcbbde933e..8626e048095a 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -830,6 +830,14 @@ static inline void initcall_debug_enable(void)
>  #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>  DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
>  			   randomize_kstack_offset);
> +DEFINE_PER_CPU(struct rnd_state, kstack_rnd_state);
> +
> +static int __init random_kstack_init(void)
> +{
> +	prandom_seed_full_state(&kstack_rnd_state);
> +	return 0;
> +}
> +late_initcall(random_kstack_init);
>  
>  static int __init early_randomize_kstack_offset(char *buf)
>  {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b061e1edbc43..68d9766288fd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2232,7 +2232,6 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> -	random_kstack_task_init(p);
>  	stackleak_task_init(p);
>  
>  	if (pid != &init_struct_pid) {
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation
  2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts
                   ` (2 preceding siblings ...)
  2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts
@ 2026-01-19 10:52 ` Mark Rutland
  2026-01-19 12:22   ` David Laight
  2026-01-19 12:59   ` Ryan Roberts
  3 siblings, 2 replies; 26+ messages in thread
From: Mark Rutland @ 2026-01-19 10:52 UTC (permalink / raw)
  To: Ryan Roberts, Kees Cook
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Gustavo A. R. Silva,
	Arnd Bergmann, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton,
	linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-hardening

On Fri, Jan 02, 2026 at 01:11:51PM +0000, Ryan Roberts wrote:
> Hi All,

Hi Ryan,

> As I reported at [1], kstack offset randomisation suffers from a couple of bugs
> and, on arm64 at least, the performance is poor. This series attempts to fix
> both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
> propose a performance improvement approach.
> 
> I've looked at a few different options but ultimately decided that Jeremy's
> original prng approach is the fastest. I made the argument that this approach is
> secure "enough" in the RFC [2] and the responses indicated agreement.

FWIW, the series all looks good to me. I understand you're likely to
spin a v4 with a couple of minor tweaks (fixing typos and adding an
out-of-line wrapper for a prandom function), but I don't think there's
anything material that needs to change.

I've given my Ack on all three patches. I've given the series a quick
boot test (atop v6.19-rc4) with a bunch of debug options enabled, and
all looks well.

Kees, do you have any comments? It would be nice if we could queue this
up soon.

Mark.

> More details in the commit logs.
> 
> 
> Performance
> ===========
> 
> Mean and tail performance of 3 "small" syscalls was measured. syscall was made
> 10 million times and each individually measured and binned. These results have
> low noise so I'm confident that they are trustworthy.
> 
> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> performance cost of turning it on without any changes to the implementation,
> then the reduced performance cost of turning it on with my changes applied.
> 
> **NOTE**: The below results were generated using the RFC patches but there is no
> meaningful change, so the numbers are still valid.
> 
> arm64 (AWS Graviton3):
> +-----------------+--------------+-------------+---------------+
> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
> |                 |              | rndstack-on |               |
> |                 |              |             |               |
> +=================+==============+=============+===============+
> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |
> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |
> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |
> +-----------------+--------------+-------------+---------------+
> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |
> |                 | p99 (ns)     | (R) 152.81% |         1.55% |
> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |
> +-----------------+--------------+-------------+---------------+
> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |
> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |
> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |
> +-----------------+--------------+-------------+---------------+
> 
> Because arm64 was previously using get_random_u16(), it was expensive when it
> didn't have any buffered bits and had to call into the crng. That's what caused
> the enormous tail latency.
> 
> 
> x86 (AWS Sapphire Rapids):
> +-----------------+--------------+-------------+---------------+
> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
> |                 |              | rndstack-on |               |
> |                 |              |             |               |
> +=================+==============+=============+===============+
> | syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
> |                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
> |                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |
> +-----------------+--------------+-------------+---------------+
> | syscall/getppid | mean (ns)    |  (R) 11.96% |     (R) 5.26% |
> |                 | p99 (ns)     |  (R) 11.83% |     (R) 8.35% |
> |                 | p99.9 (ns)   |  (R) 11.42% |    (R) 22.37% |
> +-----------------+--------------+-------------+---------------+
> | syscall/invalid | mean (ns)    |  (R) 10.58% |     (R) 2.91% |
> |                 | p99 (ns)     |  (R) 10.51% |     (R) 4.36% |
> |                 | p99.9 (ns)   |  (R) 10.35% |    (R) 21.97% |
> +-----------------+--------------+-------------+---------------+
> 
> I was surprised to see that the baseline cost on x86 is 10-12% since it is just
> using rdtsc. But as I say, I believe the results are accurate.
> 
> 
> Changes since v2 (RFC) [3]
> ==========================
> 
> - Moved late_initcall() to initialize kstack_rnd_state out of
>   randomize_kstack.h and into main.c. (issue noticed by kernel test robot)
> 
> Changes since v1 (RFC) [2]
> ==========================
> 
> - Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
>   its called from noinstr code)
> - In patch 3, prng is now per-cpu instead of per-task (per Ard)
> 
> 
> [1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
> [2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (3):
>   randomize_kstack: Maintain kstack_offset per task
>   prandom: Convert prandom_u32_state() to __always_inline
>   randomize_kstack: Unify random source across arches
> 
>  arch/Kconfig                         |  5 ++-
>  arch/arm64/kernel/syscall.c          | 11 ------
>  arch/loongarch/kernel/syscall.c      | 11 ------
>  arch/powerpc/kernel/syscall.c        | 12 -------
>  arch/riscv/kernel/traps.c            | 12 -------
>  arch/s390/include/asm/entry-common.h |  8 -----
>  arch/x86/include/asm/entry-common.h  | 12 -------
>  include/linux/prandom.h              | 19 +++++++++-
>  include/linux/randomize_kstack.h     | 54 +++++++++++-----------------
>  init/main.c                          |  9 ++++-
>  kernel/fork.c                        |  1 +
>  lib/random32.c                       | 19 ----------
>  12 files changed, 49 insertions(+), 124 deletions(-)
> 
> --
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation
  2026-01-19 10:52 ` [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Mark Rutland
@ 2026-01-19 12:22   ` David Laight
  2026-01-19 12:58     ` Ryan Roberts
  2026-01-19 12:59   ` Ryan Roberts
  1 sibling, 1 reply; 26+ messages in thread
From: David Laight @ 2026-01-19 12:22 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ryan Roberts, Kees Cook, Catalin Marinas, Will Deacon,
	Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Gustavo A. R. Silva, Arnd Bergmann,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On Mon, 19 Jan 2026 10:52:59 +0000
Mark Rutland <mark.rutland@arm.com> wrote:

> On Fri, Jan 02, 2026 at 01:11:51PM +0000, Ryan Roberts wrote:
> > Hi All,  
> 
> Hi Ryan,
> 
> > As I reported at [1], kstack offset randomisation suffers from a couple of bugs
> > and, on arm64 at least, the performance is poor. This series attempts to fix
> > both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
> > propose a performance improvement approach.
> > 
> > I've looked at a few different options but ultimately decided that Jeremy's
> > original prng approach is the fastest. I made the argument that this approach is
> > secure "enough" in the RFC [2] and the responses indicated agreement.  
> 
> FWIW, the series all looks good to me. I understand you're likely to
> spin a v4 with a couple of minor tweaks (fixing typos and adding an
> out-of-line wrapper for a prandom function), but I don't think there's
> anything material that needs to change.
> 
> I've given my Ack on all three patches. I've given the series a quick
> boot test (atop v6.19-rc4) with a bunch of debug options enabled, and
> all looks well.
> 
> Kees, do you have any comments? It would be nice if we could queue this
> up soon.

I don't want to stop this being queued up in its current form.
But I don't see an obvious need for multiple per-cpu prng
(there are a couple of others lurking), surely one will do.

How much overhead does the get_cpu_var() add?
I think it has to disable pre-emption (or interrupts) which might
be more expensive on non-x86 (which can just do 'inc %gs:address').

I'm sure I remember a version that used a per-task prng.
That just needs 'current' - which might be known and/or be cheaper
to get.
(Although I also remember a reference some system where it was slow...)

The other option is just to play 'fast and loose' with the prng data.
Using the state from the 'wrong cpu' (if the code is pre-empted) won't
really matter.
You might get a RrwW (or even RrwrwW) sequence, but the prng won't be used
for anything 'really important' so it shouldn't matter.

	David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation
  2026-01-19 12:22   ` David Laight
@ 2026-01-19 12:58     ` Ryan Roberts
  0 siblings, 0 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-19 12:58 UTC (permalink / raw)
  To: David Laight, Mark Rutland
  Cc: Kees Cook, Catalin Marinas, Will Deacon, Huacai Chen,
	Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Gustavo A. R. Silva, Arnd Bergmann,
	Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel,
	linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv,
	linux-s390, linux-hardening

On 19/01/2026 12:22, David Laight wrote:
> On Mon, 19 Jan 2026 10:52:59 +0000
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
>> On Fri, Jan 02, 2026 at 01:11:51PM +0000, Ryan Roberts wrote:
>>> Hi All,  
>>
>> Hi Ryan,
>>
>>> As I reported at [1], kstack offset randomisation suffers from a couple of bugs
>>> and, on arm64 at least, the performance is poor. This series attempts to fix
>>> both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
>>> propose a performance improvement approach.
>>>
>>> I've looked at a few different options but ultimately decided that Jeremy's
>>> original prng approach is the fastest. I made the argument that this approach is
>>> secure "enough" in the RFC [2] and the responses indicated agreement.  
>>
>> FWIW, the series all looks good to me. I understand you're likely to
>> spin a v4 with a couple of minor tweaks (fixing typos and adding an
>> out-of-line wrapper for a prandom function), but I don't think there's
>> anything material that needs to change.
>>
>> I've given my Ack on all three patches. I've given the series a quick
>> boot test (atop v6.19-rc4) with a bunch of debug options enabled, and
>> all looks well.
>>
>> Kees, do you have any comments? It would be nice if we could queue this
>> up soon.
> 
> I don't want to stop this being queued up in its current form.
> But I don't see an obvious need for multiple per-cpu prng
> (there are a couple of others lurking), surely one will do.

I see 2 other per-cpu prngs; one for BPF and one for the scheduler. The state is
16 bytes per prng, per cpu. So personally I think the maintainability advantages
of keeping them separate to their respective subsystems wins out vs the memory
cost in this particular case?

> 
> How much overhead does the get_cpu_var() add?
> I think it has to disable pre-emption (or interrupts) which might
> be more expensive on non-x86 (which can just do 'inc %gs:address').

The RFC used a per-task prng, then v2 switched to per-cpu. Performance numbers
can be compared from those 2 for arm64 only (the x86 numbers are from different
systems in the 2 version):

RFC: https://lore.kernel.org/all/20251127105958.2427758-3-ryan.roberts@arm.com/
v2: https://lore.kernel.org/all/20251215163520.1144179-4-ryan.roberts@arm.com/

+-----------------+--------------+---------------+---------------+
| Benchmark       | Result Class | per-task-prng |  per-cpu-prng |
|                 |              |         arm64 |         arm64 |
+=================+==============+===============+===============+
| syscall/getpid  | mean (ns)    |   (I) -10.54% |    (I) -9.50% |
|                 | p99 (ns)     |   (I) -59.53% |   (I) -59.24% |
|                 | p99.9 (ns)   |   (I) -59.90% |   (I) -59.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns)    |   (I) -10.49% |    (I) -9.52% |
|                 | p99 (ns)     |   (I) -59.83% |   (I) -59.25% |
|                 | p99.9 (ns)   |   (I) -59.88% |   (I) -59.50% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns)    |    (I) -9.28% |   (I) -10.31% |
|                 | p99 (ns)     |   (I) -61.06% |   (I) -60.79% |
|                 | p99.9 (ns)   |   (I) -61.40% |   (I) -61.04% |
+-----------------+--------------+---------------+---------------+

So getpid and getppid are a small amount better with per-task. invalid is a
small amount better with per-cpu. I decided that it's likely mostly noise and
per-cpu is therefore preferable since it costs (a bit) less memory.

> 
> I'm sure I remember a version that used a per-task prng.

Yes; as per above.

> That just needs 'current' - which might be known and/or be cheaper
> to get.
> (Although I also remember a reference some system where it was slow...)
> 
> The other option is just to play 'fast and loose' with the prng data.
> Using the state from the 'wrong cpu' (if the code is pre-empted) won't
> really matter.
> You might get a RrwW (or even RrwrwW) sequence, but the prng won't be used
> for anything 'really important' so it shouldn't matter.

As per above, I'm not really seeing much performance cost.

My opinion is that this series represents an improvement over what's already
there. I'd be happy to review an additional series to merge per-cpu prngs, but I
don't think that should be a prerequisite for getting this series merged.

Thanks,
Ryan

> 
> 	David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation
  2026-01-19 10:52 ` [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Mark Rutland
  2026-01-19 12:22   ` David Laight
@ 2026-01-19 12:59   ` Ryan Roberts
  1 sibling, 0 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-01-19 12:59 UTC (permalink / raw)
  To: Mark Rutland, Kees Cook
  Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
	Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Gustavo A. R. Silva,
	Arnd Bergmann, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton,
	linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
	linux-riscv, linux-s390, linux-hardening

On 19/01/2026 10:52, Mark Rutland wrote:
> On Fri, Jan 02, 2026 at 01:11:51PM +0000, Ryan Roberts wrote:
>> Hi All,
> 
> Hi Ryan,
> 
>> As I reported at [1], kstack offset randomisation suffers from a couple of bugs
>> and, on arm64 at least, the performance is poor. This series attempts to fix
>> both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
>> propose a performance improvement approach.
>>
>> I've looked at a few different options but ultimately decided that Jeremy's
>> original prng approach is the fastest. I made the argument that this approach is
>> secure "enough" in the RFC [2] and the responses indicated agreement.
> 
> FWIW, the series all looks good to me. I understand you're likely to
> spin a v4 with a couple of minor tweaks (fixing typos and adding an
> out-of-line wrapper for a prandom function), but I don't think there's
> anything material that needs to change.

Thanks for the review, Mark! v4 incomming...

> 
> I've given my Ack on all three patches. I've given the series a quick
> boot test (atop v6.19-rc4) with a bunch of debug options enabled, and
> all looks well.
> 
> Kees, do you have any comments? It would be nice if we could queue this
> up soon.
> 
> Mark.
> 
>> More details in the commit logs.
>>
>>
>> Performance
>> ===========
>>
>> Mean and tail performance of 3 "small" syscalls was measured. syscall was made
>> 10 million times and each individually measured and binned. These results have
>> low noise so I'm confident that they are trustworthy.
>>
>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>> performance cost of turning it on without any changes to the implementation,
>> then the reduced performance cost of turning it on with my changes applied.
>>
>> **NOTE**: The below results were generated using the RFC patches but there is no
>> meaningful change, so the numbers are still valid.
>>
>> arm64 (AWS Graviton3):
>> +-----------------+--------------+-------------+---------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
>> |                 |              | rndstack-on |               |
>> |                 |              |             |               |
>> +=================+==============+=============+===============+
>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |
>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |
>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |
>> +-----------------+--------------+-------------+---------------+
>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |
>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |
>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |
>> +-----------------+--------------+-------------+---------------+
>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |
>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |
>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |
>> +-----------------+--------------+-------------+---------------+
>>
>> Because arm64 was previously using get_random_u16(), it was expensive when it
>> didn't have any buffered bits and had to call into the crng. That's what caused
>> the enormous tail latency.
>>
>>
>> x86 (AWS Sapphire Rapids):
>> +-----------------+--------------+-------------+---------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng |
>> |                 |              | rndstack-on |               |
>> |                 |              |             |               |
>> +=================+==============+=============+===============+
>> | syscall/getpid  | mean (ns)    |  (R) 13.32% |     (R) 4.60% |
>> |                 | p99 (ns)     |  (R) 13.38% |    (R) 18.08% |
>> |                 | p99.9 (ns)   |      16.26% |    (R) 19.38% |
>> +-----------------+--------------+-------------+---------------+
>> | syscall/getppid | mean (ns)    |  (R) 11.96% |     (R) 5.26% |
>> |                 | p99 (ns)     |  (R) 11.83% |     (R) 8.35% |
>> |                 | p99.9 (ns)   |  (R) 11.42% |    (R) 22.37% |
>> +-----------------+--------------+-------------+---------------+
>> | syscall/invalid | mean (ns)    |  (R) 10.58% |     (R) 2.91% |
>> |                 | p99 (ns)     |  (R) 10.51% |     (R) 4.36% |
>> |                 | p99.9 (ns)   |  (R) 10.35% |    (R) 21.97% |
>> +-----------------+--------------+-------------+---------------+
>>
>> I was surprised to see that the baseline cost on x86 is 10-12% since it is just
>> using rdtsc. But as I say, I believe the results are accurate.
>>
>>
>> Changes since v2 (RFC) [3]
>> ==========================
>>
>> - Moved late_initcall() to initialize kstack_rnd_state out of
>>   randomize_kstack.h and into main.c. (issue noticed by kernel test robot)
>>
>> Changes since v1 (RFC) [2]
>> ==========================
>>
>> - Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
>>   its called from noinstr code)
>> - In patch 3, prng is now per-cpu instead of per-task (per Ard)
>>
>>
>> [1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
>> [2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
>> [3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (3):
>>   randomize_kstack: Maintain kstack_offset per task
>>   prandom: Convert prandom_u32_state() to __always_inline
>>   randomize_kstack: Unify random source across arches
>>
>>  arch/Kconfig                         |  5 ++-
>>  arch/arm64/kernel/syscall.c          | 11 ------
>>  arch/loongarch/kernel/syscall.c      | 11 ------
>>  arch/powerpc/kernel/syscall.c        | 12 -------
>>  arch/riscv/kernel/traps.c            | 12 -------
>>  arch/s390/include/asm/entry-common.h |  8 -----
>>  arch/x86/include/asm/entry-common.h  | 12 -------
>>  include/linux/prandom.h              | 19 +++++++++-
>>  include/linux/randomize_kstack.h     | 54 +++++++++++-----------------
>>  init/main.c                          |  9 ++++-
>>  kernel/fork.c                        |  1 +
>>  lib/random32.c                       | 19 ----------
>>  12 files changed, 49 insertions(+), 124 deletions(-)
>>
>> --
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-01-19 13:00 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts
2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
2026-01-02 22:44   ` David Laight
2026-01-05 10:30     ` Ryan Roberts
2026-01-19 10:23   ` Mark Rutland
2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts
2026-01-02 13:39   ` Jason A. Donenfeld
2026-01-02 14:09     ` Ryan Roberts
2026-01-03  8:00       ` Christophe Leroy (CS GROUP)
2026-01-05 10:36         ` Ryan Roberts
2026-01-03 10:46       ` David Laight
2026-01-05 10:34         ` Ryan Roberts
2026-01-02 22:54     ` David Laight
2026-01-19 10:26   ` Mark Rutland
2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts
2026-01-04 23:01   ` David Laight
2026-01-05 11:05     ` Ryan Roberts
2026-01-05 14:45       ` David Laight
2026-01-07 14:05     ` David Laight
2026-01-12 12:26       ` Ryan Roberts
2026-01-12 13:36         ` David Laight
2026-01-19 10:48   ` Mark Rutland
2026-01-19 10:52 ` [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Mark Rutland
2026-01-19 12:22   ` David Laight
2026-01-19 12:58     ` Ryan Roberts
2026-01-19 12:59   ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox