* [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation
@ 2026-01-02 13:11 Ryan Roberts
2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton
Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, linux-hardening
Hi All,
As I reported at [1], kstack offset randomisation suffers from a couple of bugs
and, on arm64 at least, the performance is poor. This series attempts to fix
both; patch 1 provides back-portable fixes for the functional bugs. Patches 2-3
propose a performance improvement approach.
I've looked at a few different options but ultimately decided that Jeremy's
original prng approach is the fastest. I made the argument that this approach is
secure "enough" in the RFC [2] and the responses indicated agreement.
More details in the commit logs.
Performance
===========
Mean and tail performance of 3 "small" syscalls was measured. syscall was made
10 million times and each individually measured and binned. These results have
low noise so I'm confident that they are trustworthy.
The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
performance cost of turning it on without any changes to the implementation,
then the reduced performance cost of turning it on with my changes applied.
**NOTE**: The below results were generated using the RFC patches but there is no
meaningful change, so the numbers are still valid.
arm64 (AWS Graviton3):
+-----------------+--------------+-------------+---------------+
| Benchmark | Result Class | v6.18-rc5 | per-task-prng |
| | | rndstack-on | |
| | | | |
+=================+==============+=============+===============+
| syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% |
| | p99 (ns) | (R) 155.01% | (R) 3.20% |
| | p99.9 (ns) | (R) 156.71% | (R) 2.93% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% |
| | p99 (ns) | (R) 152.81% | 1.55% |
| | p99.9 (ns) | (R) 153.67% | 1.77% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% |
| | p99 (ns) | (R) 165.82% | (R) 3.51% |
| | p99.9 (ns) | (R) 168.83% | (R) 3.77% |
+-----------------+--------------+-------------+---------------+
Because arm64 was previously using get_random_u16(), it was expensive when it
didn't have any buffered bits and had to call into the crng. That's what caused
the enormous tail latency.
x86 (AWS Sapphire Rapids):
+-----------------+--------------+-------------+---------------+
| Benchmark | Result Class | v6.18-rc5 | per-task-prng |
| | | rndstack-on | |
| | | | |
+=================+==============+=============+===============+
| syscall/getpid | mean (ns) | (R) 13.32% | (R) 4.60% |
| | p99 (ns) | (R) 13.38% | (R) 18.08% |
| | p99.9 (ns) | 16.26% | (R) 19.38% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns) | (R) 11.96% | (R) 5.26% |
| | p99 (ns) | (R) 11.83% | (R) 8.35% |
| | p99.9 (ns) | (R) 11.42% | (R) 22.37% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns) | (R) 10.58% | (R) 2.91% |
| | p99 (ns) | (R) 10.51% | (R) 4.36% |
| | p99.9 (ns) | (R) 10.35% | (R) 21.97% |
+-----------------+--------------+-------------+---------------+
I was surprised to see that the baseline cost on x86 is 10-12% since it is just
using rdtsc. But as I say, I believe the results are accurate.
Changes since v2 (RFC) [3]
==========================
- Moved late_initcall() to initialize kstack_rnd_state out of
randomize_kstack.h and into main.c. (issue noticed by kernel test robot)
Changes since v1 (RFC) [2]
==========================
- Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
its called from noinstr code)
- In patch 3, prng is now per-cpu instead of per-task (per Ard)
[1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
[2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/
Thanks,
Ryan
Ryan Roberts (3):
randomize_kstack: Maintain kstack_offset per task
prandom: Convert prandom_u32_state() to __always_inline
randomize_kstack: Unify random source across arches
arch/Kconfig | 5 ++-
arch/arm64/kernel/syscall.c | 11 ------
arch/loongarch/kernel/syscall.c | 11 ------
arch/powerpc/kernel/syscall.c | 12 -------
arch/riscv/kernel/traps.c | 12 -------
arch/s390/include/asm/entry-common.h | 8 -----
arch/x86/include/asm/entry-common.h | 12 -------
include/linux/prandom.h | 19 +++++++++-
include/linux/randomize_kstack.h | 54 +++++++++++-----------------
init/main.c | 9 ++++-
kernel/fork.c | 1 +
lib/random32.c | 19 ----------
12 files changed, 49 insertions(+), 124 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task 2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts @ 2026-01-02 13:11 ` Ryan Roberts 2026-01-02 22:44 ` David Laight 2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts 2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts 2 siblings, 1 reply; 9+ messages in thread From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening, stable kstack_offset was previously maintained per-cpu, but this caused a couple of issues. So let's instead make it per-task. Issue 1: add_random_kstack_offset() and choose_random_kstack_offset() expected and required to be called with interrupts and preemption disabled so that it could manipulate per-cpu state. But arm64, loongarch and risc-v are calling them with interrupts and preemption enabled. I don't _think_ this causes any functional issues, but it's certainly unexpected and could lead to manipulating the wrong cpu's state, which could cause a minor performance degradation due to bouncing the cache lines. By maintaining the state per-task those functions can safely be called in preemptible context. Issue 2: add_random_kstack_offset() is called before executing the syscall and expands the stack using a previously chosen rnadom offset. choose_random_kstack_offset() is called after executing the syscall and chooses and stores a new random offset for the next syscall. With per-cpu storage for this offset, an attacker could force cpu migration during the execution of the syscall and prevent the offset from being updated for the original cpu such that it is predictable for the next syscall on that cpu. By maintaining the state per-task, this problem goes away because the per-task random offset is updated after the syscall regardless of which cpu it is executing on. Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/ Cc: stable@vger.kernel.org Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/randomize_kstack.h | 26 +++++++++++++++----------- include/linux/sched.h | 4 ++++ init/main.c | 1 - kernel/fork.c | 2 ++ 4 files changed, 21 insertions(+), 12 deletions(-) diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h index 1d982dbdd0d0..5d3916ca747c 100644 --- a/include/linux/randomize_kstack.h +++ b/include/linux/randomize_kstack.h @@ -9,7 +9,6 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, randomize_kstack_offset); -DECLARE_PER_CPU(u32, kstack_offset); /* * Do not use this anywhere else in the kernel. This is used here because @@ -50,15 +49,14 @@ DECLARE_PER_CPU(u32, kstack_offset); * add_random_kstack_offset - Increase stack utilization by previously * chosen random offset * - * This should be used in the syscall entry path when interrupts and - * preempt are disabled, and after user registers have been stored to - * the stack. For testing the resulting entropy, please see: - * tools/testing/selftests/lkdtm/stack-entropy.sh + * This should be used in the syscall entry path after user registers have been + * stored to the stack. Preemption may be enabled. For testing the resulting + * entropy, please see: tools/testing/selftests/lkdtm/stack-entropy.sh */ #define add_random_kstack_offset() do { \ if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ &randomize_kstack_offset)) { \ - u32 offset = raw_cpu_read(kstack_offset); \ + u32 offset = current->kstack_offset; \ u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \ /* Keep allocation even after "ptr" loses scope. */ \ asm volatile("" :: "r"(ptr) : "memory"); \ @@ -69,9 +67,9 @@ DECLARE_PER_CPU(u32, kstack_offset); * choose_random_kstack_offset - Choose the random offset for the next * add_random_kstack_offset() * - * This should only be used during syscall exit when interrupts and - * preempt are disabled. This position in the syscall flow is done to - * frustrate attacks from userspace attempting to learn the next offset: + * This should only be used during syscall exit. Preemption may be enabled. This + * position in the syscall flow is done to frustrate attacks from userspace + * attempting to learn the next offset: * - Maximize the timing uncertainty visible from userspace: if the * offset is chosen at syscall entry, userspace has much more control * over the timing between choosing offsets. "How long will we be in @@ -85,14 +83,20 @@ DECLARE_PER_CPU(u32, kstack_offset); #define choose_random_kstack_offset(rand) do { \ if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ &randomize_kstack_offset)) { \ - u32 offset = raw_cpu_read(kstack_offset); \ + u32 offset = current->kstack_offset; \ offset = ror32(offset, 5) ^ (rand); \ - raw_cpu_write(kstack_offset, offset); \ + current->kstack_offset = offset; \ } \ } while (0) + +static inline void random_kstack_task_init(struct task_struct *tsk) +{ + tsk->kstack_offset = 0; +} #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ #define add_random_kstack_offset() do { } while (0) #define choose_random_kstack_offset(rand) do { } while (0) +#define random_kstack_task_init(tsk) do { } while (0) #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index d395f2810fac..9e0080ed1484 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1591,6 +1591,10 @@ struct task_struct { unsigned long prev_lowest_stack; #endif +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET + u32 kstack_offset; +#endif + #ifdef CONFIG_X86_MCE void __user *mce_vaddr; __u64 mce_kflags; diff --git a/init/main.c b/init/main.c index b84818ad9685..27fcbbde933e 100644 --- a/init/main.c +++ b/init/main.c @@ -830,7 +830,6 @@ static inline void initcall_debug_enable(void) #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, randomize_kstack_offset); -DEFINE_PER_CPU(u32, kstack_offset); static int __init early_randomize_kstack_offset(char *buf) { diff --git a/kernel/fork.c b/kernel/fork.c index b1f3915d5f8e..b061e1edbc43 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -95,6 +95,7 @@ #include <linux/thread_info.h> #include <linux/kstack_erase.h> #include <linux/kasan.h> +#include <linux/randomize_kstack.h> #include <linux/scs.h> #include <linux/io_uring.h> #include <linux/bpf.h> @@ -2231,6 +2232,7 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; + random_kstack_task_init(p); stackleak_task_init(p); if (pid != &init_struct_pid) { -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task 2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts @ 2026-01-02 22:44 ` David Laight 0 siblings, 0 replies; 9+ messages in thread From: David Laight @ 2026-01-02 22:44 UTC (permalink / raw) To: Ryan Roberts Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening, stable On Fri, 2 Jan 2026 13:11:52 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote: > kstack_offset was previously maintained per-cpu, but this caused a > couple of issues. So let's instead make it per-task. > > Issue 1: add_random_kstack_offset() and choose_random_kstack_offset() > expected and required to be called with interrupts and preemption > disabled so that it could manipulate per-cpu state. But arm64, loongarch > and risc-v are calling them with interrupts and preemption enabled. I > don't _think_ this causes any functional issues, but it's certainly > unexpected and could lead to manipulating the wrong cpu's state, which > could cause a minor performance degradation due to bouncing the cache > lines. By maintaining the state per-task those functions can safely be > called in preemptible context. > > Issue 2: add_random_kstack_offset() is called before executing the > syscall and expands the stack using a previously chosen rnadom offset. <> David > choose_random_kstack_offset() is called after executing the syscall and > chooses and stores a new random offset for the next syscall. With > per-cpu storage for this offset, an attacker could force cpu migration > during the execution of the syscall and prevent the offset from being > updated for the original cpu such that it is predictable for the next > syscall on that cpu. By maintaining the state per-task, this problem > goes away because the per-task random offset is updated after the > syscall regardless of which cpu it is executing on. > > Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") > Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/ > Cc: stable@vger.kernel.org > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > include/linux/randomize_kstack.h | 26 +++++++++++++++----------- > include/linux/sched.h | 4 ++++ > init/main.c | 1 - > kernel/fork.c | 2 ++ > 4 files changed, 21 insertions(+), 12 deletions(-) > > diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h > index 1d982dbdd0d0..5d3916ca747c 100644 > --- a/include/linux/randomize_kstack.h > +++ b/include/linux/randomize_kstack.h > @@ -9,7 +9,6 @@ > > DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, > randomize_kstack_offset); > -DECLARE_PER_CPU(u32, kstack_offset); > > /* > * Do not use this anywhere else in the kernel. This is used here because > @@ -50,15 +49,14 @@ DECLARE_PER_CPU(u32, kstack_offset); > * add_random_kstack_offset - Increase stack utilization by previously > * chosen random offset > * > - * This should be used in the syscall entry path when interrupts and > - * preempt are disabled, and after user registers have been stored to > - * the stack. For testing the resulting entropy, please see: > - * tools/testing/selftests/lkdtm/stack-entropy.sh > + * This should be used in the syscall entry path after user registers have been > + * stored to the stack. Preemption may be enabled. For testing the resulting > + * entropy, please see: tools/testing/selftests/lkdtm/stack-entropy.sh > */ > #define add_random_kstack_offset() do { \ > if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ > &randomize_kstack_offset)) { \ > - u32 offset = raw_cpu_read(kstack_offset); \ > + u32 offset = current->kstack_offset; \ > u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \ > /* Keep allocation even after "ptr" loses scope. */ \ > asm volatile("" :: "r"(ptr) : "memory"); \ > @@ -69,9 +67,9 @@ DECLARE_PER_CPU(u32, kstack_offset); > * choose_random_kstack_offset - Choose the random offset for the next > * add_random_kstack_offset() > * > - * This should only be used during syscall exit when interrupts and > - * preempt are disabled. This position in the syscall flow is done to > - * frustrate attacks from userspace attempting to learn the next offset: > + * This should only be used during syscall exit. Preemption may be enabled. This > + * position in the syscall flow is done to frustrate attacks from userspace > + * attempting to learn the next offset: > * - Maximize the timing uncertainty visible from userspace: if the > * offset is chosen at syscall entry, userspace has much more control > * over the timing between choosing offsets. "How long will we be in > @@ -85,14 +83,20 @@ DECLARE_PER_CPU(u32, kstack_offset); > #define choose_random_kstack_offset(rand) do { \ > if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ > &randomize_kstack_offset)) { \ > - u32 offset = raw_cpu_read(kstack_offset); \ > + u32 offset = current->kstack_offset; \ > offset = ror32(offset, 5) ^ (rand); \ > - raw_cpu_write(kstack_offset, offset); \ > + current->kstack_offset = offset; \ > } \ > } while (0) > + > +static inline void random_kstack_task_init(struct task_struct *tsk) > +{ > + tsk->kstack_offset = 0; > +} > #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ > #define add_random_kstack_offset() do { } while (0) > #define choose_random_kstack_offset(rand) do { } while (0) > +#define random_kstack_task_init(tsk) do { } while (0) > #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ > > #endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d395f2810fac..9e0080ed1484 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1591,6 +1591,10 @@ struct task_struct { > unsigned long prev_lowest_stack; > #endif > > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET > + u32 kstack_offset; > +#endif > + > #ifdef CONFIG_X86_MCE > void __user *mce_vaddr; > __u64 mce_kflags; > diff --git a/init/main.c b/init/main.c > index b84818ad9685..27fcbbde933e 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -830,7 +830,6 @@ static inline void initcall_debug_enable(void) > #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET > DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, > randomize_kstack_offset); > -DEFINE_PER_CPU(u32, kstack_offset); > > static int __init early_randomize_kstack_offset(char *buf) > { > diff --git a/kernel/fork.c b/kernel/fork.c > index b1f3915d5f8e..b061e1edbc43 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -95,6 +95,7 @@ > #include <linux/thread_info.h> > #include <linux/kstack_erase.h> > #include <linux/kasan.h> > +#include <linux/randomize_kstack.h> > #include <linux/scs.h> > #include <linux/io_uring.h> > #include <linux/bpf.h> > @@ -2231,6 +2232,7 @@ __latent_entropy struct task_struct *copy_process( > if (retval) > goto bad_fork_cleanup_io; > > + random_kstack_task_init(p); > stackleak_task_init(p); > > if (pid != &init_struct_pid) { ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline 2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts 2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts @ 2026-01-02 13:11 ` Ryan Roberts 2026-01-02 13:39 ` Jason A. Donenfeld 2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts 2 siblings, 1 reply; 9+ messages in thread From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening We will shortly use prandom_u32_state() to implement kstack offset randomization and some arches need to call it from non-instrumentable context. Given the function is just a handful of operations and doesn't call out to any other functions, let's take the easy path and make it __always_inline. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/prandom.h | 19 ++++++++++++++++++- lib/random32.c | 19 ------------------- 2 files changed, 18 insertions(+), 20 deletions(-) diff --git a/include/linux/prandom.h b/include/linux/prandom.h index ff7dcc3fa105..e797b3709f5c 100644 --- a/include/linux/prandom.h +++ b/include/linux/prandom.h @@ -17,7 +17,24 @@ struct rnd_state { __u32 s1, s2, s3, s4; }; -u32 prandom_u32_state(struct rnd_state *state); +/** + * prandom_u32_state - seeded pseudo-random number generator. + * @state: pointer to state structure holding seeded state. + * + * This is used for pseudo-randomness with no outside seeding. + * For more random results, use get_random_u32(). + */ +static __always_inline u32 prandom_u32_state(struct rnd_state *state) +{ +#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b) + state->s1 = TAUSWORTHE(state->s1, 6U, 13U, 4294967294U, 18U); + state->s2 = TAUSWORTHE(state->s2, 2U, 27U, 4294967288U, 2U); + state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U, 7U); + state->s4 = TAUSWORTHE(state->s4, 3U, 12U, 4294967168U, 13U); + + return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4); +} + void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes); void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state); diff --git a/lib/random32.c b/lib/random32.c index 24e7acd9343f..d57baf489d4a 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -42,25 +42,6 @@ #include <linux/slab.h> #include <linux/unaligned.h> -/** - * prandom_u32_state - seeded pseudo-random number generator. - * @state: pointer to state structure holding seeded state. - * - * This is used for pseudo-randomness with no outside seeding. - * For more random results, use get_random_u32(). - */ -u32 prandom_u32_state(struct rnd_state *state) -{ -#define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b) - state->s1 = TAUSWORTHE(state->s1, 6U, 13U, 4294967294U, 18U); - state->s2 = TAUSWORTHE(state->s2, 2U, 27U, 4294967288U, 2U); - state->s3 = TAUSWORTHE(state->s3, 13U, 21U, 4294967280U, 7U); - state->s4 = TAUSWORTHE(state->s4, 3U, 12U, 4294967168U, 13U); - - return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4); -} -EXPORT_SYMBOL(prandom_u32_state); - /** * prandom_bytes_state - get the requested number of pseudo-random bytes * -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline 2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts @ 2026-01-02 13:39 ` Jason A. Donenfeld 2026-01-02 14:09 ` Ryan Roberts 2026-01-02 22:54 ` David Laight 0 siblings, 2 replies; 9+ messages in thread From: Jason A. Donenfeld @ 2026-01-02 13:39 UTC (permalink / raw) To: Ryan Roberts Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening Hi Ryan, On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > context. Given the function is just a handful of operations and doesn't How many? What's this looking like in terms of assembly? It'd also be nice to have some brief analysis of other call sites to have confirmation this isn't blowing up other users. > +static __always_inline u32 prandom_u32_state(struct rnd_state *state) Why not just normal `inline`? Is gcc disagreeing with the inlinability of this function? Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline 2026-01-02 13:39 ` Jason A. Donenfeld @ 2026-01-02 14:09 ` Ryan Roberts 2026-01-03 8:00 ` Christophe Leroy (CS GROUP) 2026-01-02 22:54 ` David Laight 1 sibling, 1 reply; 9+ messages in thread From: Ryan Roberts @ 2026-01-02 14:09 UTC (permalink / raw) To: Jason A. Donenfeld Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening On 02/01/2026 13:39, Jason A. Donenfeld wrote: > Hi Ryan, > > On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> context. Given the function is just a handful of operations and doesn't > > How many? What's this looking like in terms of assembly? 25 instructions on arm64: 0000000000000000 <prandom_u32_state>: 0: 29401403 ldp w3, w5, [x0] 4: aa0003e1 mov x1, x0 8: 29410002 ldp w2, w0, [x0, #8] c: 531e74a4 lsl w4, w5, #2 10: 530e3468 lsl w8, w3, #18 14: 4a0400a5 eor w5, w5, w4 18: 4a031863 eor w3, w3, w3, lsl #6 1c: 53196047 lsl w7, w2, #7 20: 53134806 lsl w6, w0, #13 24: 4a023442 eor w2, w2, w2, lsl #13 28: 4a000c00 eor w0, w0, w0, lsl #3 2c: 121b6884 and w4, w4, #0xffffffe0 30: 120d3108 and w8, w8, #0xfff80000 34: 121550e7 and w7, w7, #0xfffff800 38: 120c2cc6 and w6, w6, #0xfff00000 3c: 2a456c85 orr w5, w4, w5, lsr #27 40: 2a433504 orr w4, w8, w3, lsr #13 44: 2a4254e3 orr w3, w7, w2, lsr #21 48: 2a4030c2 orr w2, w6, w0, lsr #12 4c: 4a020066 eor w6, w3, w2 50: 4a050080 eor w0, w4, w5 54: 4a0000c0 eor w0, w6, w0 58: 29001424 stp w4, w5, [x1] 5c: 29010823 stp w3, w2, [x1, #8] 60: d65f03c0 ret > It'd also be > nice to have some brief analysis of other call sites to have > confirmation this isn't blowing up other users. I compiled defconfig before and after this patch on arm64 and compared the text sizes: $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708) Function old new delta prandom_seed_full_state 364 932 +568 pick_next_task_fair 1940 2036 +96 bpf_user_rnd_u32 104 196 +92 prandom_bytes_state 204 260 +56 e843419@0f2b_00012d69_e34 - 8 +8 e843419@0db7_00010ec3_23ec - 8 +8 e843419@02cb_00003767_25c - 8 +8 bpf_prog_select_runtime 448 444 -4 e843419@0aa3_0000cfd1_1580 8 - -8 e843419@0aa2_0000cfba_147c 8 - -8 e843419@075f_00008d8c_184 8 - -8 prandom_u32_state 100 - -100 Total: Before=19078072, After=19078780, chg +0.00% So 708 bytes more after inlining. The main cost is prandom_seed_full_state(), which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we could turn that into a loop to reduce ~450 bytes overall. I'm not really sure if 708 is good or bad... > >> +static __always_inline u32 prandom_u32_state(struct rnd_state *state) > > Why not just normal `inline`? Is gcc disagreeing with the inlinability > of this function? Given this needs to be called from a noinstr function, I didn't want to give the compiler the opportunity to decide not to inline it, since in that case, some instrumentation might end up being applied to the function body which would blow up when called in the noinstr context. I think the other 2 options are to keep prandom_u32_state() in the c file but mark it noinstr or rearrange all the users so that thay don't call it until instrumentation is allowable. The latter is something I was trying to avoid. There is some previous discussion of this at [1]. [1] https://lore.kernel.org/all/aS65LFUfdgRPKv1l@J2N7QTR9R3/ Perhaps keeping prandom_u32_state() in the c file and making it noinstr is the best compromise? Thanks, Ryan > > Jason ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline 2026-01-02 14:09 ` Ryan Roberts @ 2026-01-03 8:00 ` Christophe Leroy (CS GROUP) 0 siblings, 0 replies; 9+ messages in thread From: Christophe Leroy (CS GROUP) @ 2026-01-03 8:00 UTC (permalink / raw) To: Ryan Roberts, Jason A. Donenfeld Cc: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening Le 02/01/2026 à 15:09, Ryan Roberts a écrit : > On 02/01/2026 13:39, Jason A. Donenfeld wrote: >> Hi Ryan, >> >> On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> context. Given the function is just a handful of operations and doesn't >> >> How many? What's this looking like in terms of assembly? > > 25 instructions on arm64: 31 instructions on powerpc: 00000000 <prandom_u32_state>: 0: 7c 69 1b 78 mr r9,r3 4: 80 63 00 00 lwz r3,0(r3) 8: 80 89 00 08 lwz r4,8(r9) c: 81 69 00 04 lwz r11,4(r9) 10: 80 a9 00 0c lwz r5,12(r9) 14: 54 67 30 32 slwi r7,r3,6 18: 7c e7 1a 78 xor r7,r7,r3 1c: 55 66 10 3a slwi r6,r11,2 20: 54 88 68 24 slwi r8,r4,13 24: 54 63 90 18 rlwinm r3,r3,18,0,12 28: 7d 6b 32 78 xor r11,r11,r6 2c: 7d 08 22 78 xor r8,r8,r4 30: 54 aa 18 38 slwi r10,r5,3 34: 54 e7 9b 7e srwi r7,r7,13 38: 7c e7 1a 78 xor r7,r7,r3 3c: 51 66 2e fe rlwimi r6,r11,5,27,31 40: 54 84 38 28 rlwinm r4,r4,7,0,20 44: 7d 4a 2a 78 xor r10,r10,r5 48: 55 08 5d 7e srwi r8,r8,21 4c: 7d 08 22 78 xor r8,r8,r4 50: 7c e3 32 78 xor r3,r7,r6 54: 54 a5 68 16 rlwinm r5,r5,13,0,11 58: 55 4a a3 3e srwi r10,r10,12 5c: 7d 4a 2a 78 xor r10,r10,r5 60: 7c 63 42 78 xor r3,r3,r8 64: 90 e9 00 00 stw r7,0(r9) 68: 90 c9 00 04 stw r6,4(r9) 6c: 91 09 00 08 stw r8,8(r9) 70: 91 49 00 0c stw r10,12(r9) 74: 7c 63 52 78 xor r3,r3,r10 78: 4e 80 00 20 blr Among those, 8 instructions are for reading/writing the state in stack. They of course disappear when inlining. > >> It'd also be >> nice to have some brief analysis of other call sites to have >> confirmation this isn't blowing up other users. > > I compiled defconfig before and after this patch on arm64 and compared the text > sizes: > > $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after > add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708) > Function old new delta > prandom_seed_full_state 364 932 +568 > pick_next_task_fair 1940 2036 +96 > bpf_user_rnd_u32 104 196 +92 > prandom_bytes_state 204 260 +56 > e843419@0f2b_00012d69_e34 - 8 +8 > e843419@0db7_00010ec3_23ec - 8 +8 > e843419@02cb_00003767_25c - 8 +8 > bpf_prog_select_runtime 448 444 -4 > e843419@0aa3_0000cfd1_1580 8 - -8 > e843419@0aa2_0000cfba_147c 8 - -8 > e843419@075f_00008d8c_184 8 - -8 > prandom_u32_state 100 - -100 > Total: Before=19078072, After=19078780, chg +0.00% > > So 708 bytes more after inlining. The main cost is prandom_seed_full_state(), > which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we > could turn that into a loop to reduce ~450 bytes overall. > With following change the increase of prandom_seed_full_state() remains reasonnable and performance wise it is a lot better as it avoids the read/write of the state via the stack diff --git a/lib/random32.c b/lib/random32.c index 24e7acd9343f6..28a5b109c9018 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -94,17 +94,11 @@ EXPORT_SYMBOL(prandom_bytes_state); static void prandom_warmup(struct rnd_state *state) { + int i; + /* Calling RNG ten times to satisfy recurrence condition */ - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); + for (i = 0; i < 10; i++) + prandom_u32_state(state); } void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state) The loop is: 248: 38 e0 00 0a li r7,10 24c: 7c e9 03 a6 mtctr r7 250: 55 05 30 32 slwi r5,r8,6 254: 55 46 68 24 slwi r6,r10,13 258: 55 27 18 38 slwi r7,r9,3 25c: 7c a5 42 78 xor r5,r5,r8 260: 7c c6 52 78 xor r6,r6,r10 264: 7c e7 4a 78 xor r7,r7,r9 268: 54 8b 10 3a slwi r11,r4,2 26c: 7d 60 22 78 xor r0,r11,r4 270: 54 a5 9b 7e srwi r5,r5,13 274: 55 08 90 18 rlwinm r8,r8,18,0,12 278: 54 c6 5d 7e srwi r6,r6,21 27c: 55 4a 38 28 rlwinm r10,r10,7,0,20 280: 54 e7 a3 3e srwi r7,r7,12 284: 55 29 68 16 rlwinm r9,r9,13,0,11 288: 7d 64 5b 78 mr r4,r11 28c: 7c a8 42 78 xor r8,r5,r8 290: 7c ca 52 78 xor r10,r6,r10 294: 7c e9 4a 78 xor r9,r7,r9 298: 50 04 2e fe rlwimi r4,r0,5,27,31 29c: 42 00 ff b4 bdnz 250 <prandom_seed_full_state+0x7c> Which replaces the 10 calls to prandom_u32_state() fc: 91 3f 00 0c stw r9,12(r31) 100: 7f e3 fb 78 mr r3,r31 104: 48 00 00 01 bl 104 <prandom_seed_full_state+0x88> 104: R_PPC_REL24 prandom_u32_state 108: 7f e3 fb 78 mr r3,r31 10c: 48 00 00 01 bl 10c <prandom_seed_full_state+0x90> 10c: R_PPC_REL24 prandom_u32_state 110: 7f e3 fb 78 mr r3,r31 114: 48 00 00 01 bl 114 <prandom_seed_full_state+0x98> 114: R_PPC_REL24 prandom_u32_state 118: 7f e3 fb 78 mr r3,r31 11c: 48 00 00 01 bl 11c <prandom_seed_full_state+0xa0> 11c: R_PPC_REL24 prandom_u32_state 120: 7f e3 fb 78 mr r3,r31 124: 48 00 00 01 bl 124 <prandom_seed_full_state+0xa8> 124: R_PPC_REL24 prandom_u32_state 128: 7f e3 fb 78 mr r3,r31 12c: 48 00 00 01 bl 12c <prandom_seed_full_state+0xb0> 12c: R_PPC_REL24 prandom_u32_state 130: 7f e3 fb 78 mr r3,r31 134: 48 00 00 01 bl 134 <prandom_seed_full_state+0xb8> 134: R_PPC_REL24 prandom_u32_state 138: 7f e3 fb 78 mr r3,r31 13c: 48 00 00 01 bl 13c <prandom_seed_full_state+0xc0> 13c: R_PPC_REL24 prandom_u32_state 140: 7f e3 fb 78 mr r3,r31 144: 48 00 00 01 bl 144 <prandom_seed_full_state+0xc8> 144: R_PPC_REL24 prandom_u32_state 148: 80 01 00 24 lwz r0,36(r1) 14c: 7f e3 fb 78 mr r3,r31 150: 83 e1 00 1c lwz r31,28(r1) 154: 7c 08 03 a6 mtlr r0 158: 38 21 00 20 addi r1,r1,32 15c: 48 00 00 00 b 15c <prandom_seed_full_state+0xe0> 15c: R_PPC_REL24 prandom_u32_state So approx the same number of instructions in size, while better performance. > I'm not really sure if 708 is good or bad... That's in the noise compared to the overall size of vmlinux, but if we change it to a loop we also reduce pressure on the cache. Christophe ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline 2026-01-02 13:39 ` Jason A. Donenfeld 2026-01-02 14:09 ` Ryan Roberts @ 2026-01-02 22:54 ` David Laight 1 sibling, 0 replies; 9+ messages in thread From: David Laight @ 2026-01-02 22:54 UTC (permalink / raw) To: Jason A. Donenfeld Cc: Ryan Roberts, Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Ard Biesheuvel, Jeremy Linton, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening On Fri, 2 Jan 2026 14:39:21 +0100 "Jason A. Donenfeld" <Jason@zx2c4.com> wrote: > Hi Ryan, ... > > +static __always_inline u32 prandom_u32_state(struct rnd_state *state) > > Why not just normal `inline`? Is gcc disagreeing with the inlinability > of this function? gcc has a mind of its own when it comes to inlining. If there weren't some massive functions marked 'inline' that should never really be inlined then making 'inline' '__always_inline' would make sense. But first an audit would be needed. (This has come up several times in the past.) But if you need a function to be inlined (for any reason) it needs to be always_inline. Whether there should be an non-inlined 'option' here is another matter. There could be a normal function that calls the inlined version. David > > Jason > ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 3/3] randomize_kstack: Unify random source across arches 2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts 2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts 2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts @ 2026-01-02 13:11 ` Ryan Roberts 2 siblings, 0 replies; 9+ messages in thread From: Ryan Roberts @ 2026-01-02 13:11 UTC (permalink / raw) To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan, Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook, Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev, linux-riscv, linux-s390, linux-hardening Previously different architectures were using random sources of differing strength and cost to decide the random kstack offset. A number of architectures (loongarch, powerpc, s390, x86) were using their timestamp counter, at whatever the frequency happened to be. Other arches (arm64, riscv) were using entropy from the crng via get_random_u16(). There have been concerns that in some cases the timestamp counters may be too weak, because they can be easily guessed or influenced by user space. And get_random_u16() has been shown to be too costly for the level of protection kstack offset randomization provides. So let's use a common, architecture-agnostic source of entropy; a per-cpu prng, seeded at boot-time from the crng. This has a few benefits: - We can remove choose_random_kstack_offset(); That was only there to try to make the timestamp counter value a bit harder to influence from user space. - The architecture code is simplified. All it has to do now is call add_random_kstack_offset() in the syscall path. - The strength of the randomness can be reasoned about independently of the architecture. - Arches previously using get_random_u16() now have much faster syscall paths, see below results. There have been some claims that a prng may be less strong than the timestamp counter if not regularly reseeded. But the prng has a period of about 2^113. So as long as the prng state remains secret, it should not be possible to guess. If the prng state can be accessed, we have bigger problems. Additionally, we are only consuming 6 bits to randomize the stack, so there are only 64 possible random offsets. I assert that it would be trivial for an attacker to brute force by repeating their attack and waiting for the random stack offset to be the desired one. The prng approach seems entirely proportional to this level of protection. Performance data are provided below. The baseline is v6.18 with rndstack on for each respective arch. (I)/(R) indicate statistically significant improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal). x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge): +-----------------+--------------+---------------+---------------+ | Benchmark | Result Class | per-task-prng | per-task-prng | | | | arm64 (metal) | x86_64 (VM) | +=================+==============+===============+===============+ | syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% | | | p99 (ns) | (I) -59.24% | (I) -24.41% | | | p99.9 (ns) | (I) -59.52% | (I) -28.52% | +-----------------+--------------+---------------+---------------+ | syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% | | | p99 (ns) | (I) -59.25% | (I) -25.03% | | | p99.9 (ns) | (I) -59.50% | (I) -28.17% | +-----------------+--------------+---------------+---------------+ | syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% | | | p99 (ns) | (I) -60.79% | (I) -20.06% | | | p99.9 (ns) | (I) -61.04% | (I) -25.04% | +-----------------+--------------+---------------+---------------+ I tested an earlier version of this change on x86 bare metal and it showed a smaller but still significant improvement. The bare metal system wasn't available this time around so testing was done in a VM instance. I'm guessing the cost of rdtsc is higher for VMs. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- arch/Kconfig | 5 ++- arch/arm64/kernel/syscall.c | 11 ------ arch/loongarch/kernel/syscall.c | 11 ------ arch/powerpc/kernel/syscall.c | 12 ------- arch/riscv/kernel/traps.c | 12 ------- arch/s390/include/asm/entry-common.h | 8 ----- arch/x86/include/asm/entry-common.h | 12 ------- include/linux/randomize_kstack.h | 52 +++++++++------------------- include/linux/sched.h | 4 --- init/main.c | 8 +++++ kernel/fork.c | 1 - 11 files changed, 27 insertions(+), 109 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 31220f512b16..8591fe7b4ac1 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1516,9 +1516,8 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET def_bool n help An arch should select this symbol if it can support kernel stack - offset randomization with calls to add_random_kstack_offset() - during syscall entry and choose_random_kstack_offset() during - syscall exit. Careful removal of -fstack-protector-strong and + offset randomization with a call to add_random_kstack_offset() + during syscall entry. Careful removal of -fstack-protector-strong and -fstack-protector should also be applied to the entry code and closely examined, as the artificial stack bump looks like an array to the compiler, so it will attempt to add canary checks regardless diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c index c062badd1a56..358ddfbf1401 100644 --- a/arch/arm64/kernel/syscall.c +++ b/arch/arm64/kernel/syscall.c @@ -52,17 +52,6 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, } syscall_set_return_value(current, regs, 0, ret); - - /* - * This value will get limited by KSTACK_OFFSET_MAX(), which is 10 - * bits. The actual entropy will be further reduced by the compiler - * when applying stack alignment constraints: the AAPCS mandates a - * 16-byte aligned SP at function boundaries, which will remove the - * 4 low bits from any entropy chosen here. - * - * The resulting 6 bits of entropy is seen in SP[9:4]. - */ - choose_random_kstack_offset(get_random_u16()); } static inline bool has_syscall_work(unsigned long flags) diff --git a/arch/loongarch/kernel/syscall.c b/arch/loongarch/kernel/syscall.c index 1249d82c1cd0..85da7e050d97 100644 --- a/arch/loongarch/kernel/syscall.c +++ b/arch/loongarch/kernel/syscall.c @@ -79,16 +79,5 @@ void noinstr __no_stack_protector do_syscall(struct pt_regs *regs) regs->regs[7], regs->regs[8], regs->regs[9]); } - /* - * This value will get limited by KSTACK_OFFSET_MAX(), which is 10 - * bits. The actual entropy will be further reduced by the compiler - * when applying stack alignment constraints: 16-bytes (i.e. 4-bits) - * aligned, which will remove the 4 low bits from any entropy chosen - * here. - * - * The resulting 6 bits of entropy is seen in SP[9:4]. - */ - choose_random_kstack_offset(get_cycles()); - syscall_exit_to_user_mode(regs); } diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index be159ad4b77b..b3d8b0f9823b 100644 --- a/arch/powerpc/kernel/syscall.c +++ b/arch/powerpc/kernel/syscall.c @@ -173,17 +173,5 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0) } #endif - /* - * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), - * so the maximum stack offset is 1k bytes (10 bits). - * - * The actual entropy will be further reduced by the compiler when - * applying stack alignment constraints: the powerpc architecture - * may have two kinds of stack alignment (16-bytes and 8-bytes). - * - * So the resulting 6 or 7 bits of entropy is seen in SP[9:4] or SP[9:3]. - */ - choose_random_kstack_offset(mftb()); - return ret; } diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index 80230de167de..79b285bdfd1a 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -342,18 +342,6 @@ void do_trap_ecall_u(struct pt_regs *regs) if (syscall >= 0 && syscall < NR_syscalls) syscall_handler(regs, syscall); - /* - * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), - * so the maximum stack offset is 1k bytes (10 bits). - * - * The actual entropy will be further reduced by the compiler when - * applying stack alignment constraints: 16-byte (i.e. 4-bit) aligned - * for RV32I or RV64I. - * - * The resulting 6 bits of entropy is seen in SP[9:4]. - */ - choose_random_kstack_offset(get_random_u16()); - syscall_exit_to_user_mode(regs); } else { irqentry_state_t state = irqentry_nmi_enter(regs); diff --git a/arch/s390/include/asm/entry-common.h b/arch/s390/include/asm/entry-common.h index 979af986a8fe..35450a485323 100644 --- a/arch/s390/include/asm/entry-common.h +++ b/arch/s390/include/asm/entry-common.h @@ -51,14 +51,6 @@ static __always_inline void arch_exit_to_user_mode(void) #define arch_exit_to_user_mode arch_exit_to_user_mode -static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, - unsigned long ti_work) -{ - choose_random_kstack_offset(get_tod_clock_fast()); -} - -#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare - static __always_inline bool arch_in_rcu_eqs(void) { if (IS_ENABLED(CONFIG_KVM)) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index ce3eb6d5fdf9..7535131c711b 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -82,18 +82,6 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED); #endif - /* - * This value will get limited by KSTACK_OFFSET_MAX(), which is 10 - * bits. The actual entropy will be further reduced by the compiler - * when applying stack alignment constraints (see cc_stack_align4/8 in - * arch/x86/Makefile), which will remove the 3 (x86_64) or 2 (ia32) - * low bits from any entropy chosen here. - * - * Therefore, final stack offset entropy will be 7 (x86_64) or - * 8 (ia32) bits. - */ - choose_random_kstack_offset(rdtsc()); - /* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */ if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) && this_cpu_read(x86_ibpb_exit_to_user)) { diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h index 5d3916ca747c..024fc20e7762 100644 --- a/include/linux/randomize_kstack.h +++ b/include/linux/randomize_kstack.h @@ -6,6 +6,7 @@ #include <linux/kernel.h> #include <linux/jump_label.h> #include <linux/percpu-defs.h> +#include <linux/prandom.h> DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, randomize_kstack_offset); @@ -45,9 +46,22 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, #define KSTACK_OFFSET_MAX(x) ((x) & 0b1111111100) #endif +DECLARE_PER_CPU(struct rnd_state, kstack_rnd_state); + +static __always_inline u32 get_kstack_offset(void) +{ + struct rnd_state *state; + u32 rnd; + + state = &get_cpu_var(kstack_rnd_state); + rnd = prandom_u32_state(state); + put_cpu_var(kstack_rnd_state); + + return rnd; +} + /** - * add_random_kstack_offset - Increase stack utilization by previously - * chosen random offset + * add_random_kstack_offset - Increase stack utilization by a random offset. * * This should be used in the syscall entry path after user registers have been * stored to the stack. Preemption may be enabled. For testing the resulting @@ -56,47 +70,15 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, #define add_random_kstack_offset() do { \ if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ &randomize_kstack_offset)) { \ - u32 offset = current->kstack_offset; \ + u32 offset = get_kstack_offset(); \ u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \ /* Keep allocation even after "ptr" loses scope. */ \ asm volatile("" :: "r"(ptr) : "memory"); \ } \ } while (0) -/** - * choose_random_kstack_offset - Choose the random offset for the next - * add_random_kstack_offset() - * - * This should only be used during syscall exit. Preemption may be enabled. This - * position in the syscall flow is done to frustrate attacks from userspace - * attempting to learn the next offset: - * - Maximize the timing uncertainty visible from userspace: if the - * offset is chosen at syscall entry, userspace has much more control - * over the timing between choosing offsets. "How long will we be in - * kernel mode?" tends to be more difficult to predict than "how long - * will we be in user mode?" - * - Reduce the lifetime of the new offset sitting in memory during - * kernel mode execution. Exposure of "thread-local" memory content - * (e.g. current, percpu, etc) tends to be easier than arbitrary - * location memory exposure. - */ -#define choose_random_kstack_offset(rand) do { \ - if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \ - &randomize_kstack_offset)) { \ - u32 offset = current->kstack_offset; \ - offset = ror32(offset, 5) ^ (rand); \ - current->kstack_offset = offset; \ - } \ -} while (0) - -static inline void random_kstack_task_init(struct task_struct *tsk) -{ - tsk->kstack_offset = 0; -} #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ #define add_random_kstack_offset() do { } while (0) -#define choose_random_kstack_offset(rand) do { } while (0) -#define random_kstack_task_init(tsk) do { } while (0) #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */ #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 9e0080ed1484..d395f2810fac 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1591,10 +1591,6 @@ struct task_struct { unsigned long prev_lowest_stack; #endif -#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET - u32 kstack_offset; -#endif - #ifdef CONFIG_X86_MCE void __user *mce_vaddr; __u64 mce_kflags; diff --git a/init/main.c b/init/main.c index 27fcbbde933e..8626e048095a 100644 --- a/init/main.c +++ b/init/main.c @@ -830,6 +830,14 @@ static inline void initcall_debug_enable(void) #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, randomize_kstack_offset); +DEFINE_PER_CPU(struct rnd_state, kstack_rnd_state); + +static int __init random_kstack_init(void) +{ + prandom_seed_full_state(&kstack_rnd_state); + return 0; +} +late_initcall(random_kstack_init); static int __init early_randomize_kstack_offset(char *buf) { diff --git a/kernel/fork.c b/kernel/fork.c index b061e1edbc43..68d9766288fd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2232,7 +2232,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - random_kstack_task_init(p); stackleak_task_init(p); if (pid != &init_struct_pid) { -- 2.43.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-01-03 8:01 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-02 13:11 [PATCH v3 0/3] Fix bugs and performance of kstack offset randomisation Ryan Roberts 2026-01-02 13:11 ` [PATCH v3 1/3] randomize_kstack: Maintain kstack_offset per task Ryan Roberts 2026-01-02 22:44 ` David Laight 2026-01-02 13:11 ` [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline Ryan Roberts 2026-01-02 13:39 ` Jason A. Donenfeld 2026-01-02 14:09 ` Ryan Roberts 2026-01-03 8:00 ` Christophe Leroy (CS GROUP) 2026-01-02 22:54 ` David Laight 2026-01-02 13:11 ` [PATCH v3 3/3] randomize_kstack: Unify random source across arches Ryan Roberts
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).