* [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation
@ 2026-03-03 15:08 Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 1/2] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Ryan Roberts @ 2026-03-03 15:08 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, David Laight
Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, linux-hardening
[Kees; I'm hoping this is now good-to-go via your hardening tree? It would be
good to get some linux-next testing.]
Hi All,
As I reported at [1], kstack offset randomisation suffers from a couple of bugs
and, on arm64 at least, the performance is poor. This series attempts to fix
both; patch 1 provides back-portable fixes for the functional bugs. Patch 2
proposes a performance improvement approach.
I've looked at a few different options but ultimately decided that Jeremy's
original prng approach is the fastest. I made the argument that this approach is
secure "enough" in the RFC [2] and the responses indicated agreement.
More details in the commit logs.
Performance
===========
Mean and tail performance of 3 "small" syscalls was measured. syscall was made
10 million times and each individually measured and binned. These results have
low noise so I'm confident that they are trustworthy.
The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
performance cost of turning it on without any changes to the implementation,
then the reduced performance cost of turning it on with my changes applied.
**NOTE**: The below results were generated using the RFC patches but there is no
meaningful change, so the numbers are still valid. I've also rerun the tests
with this version on top of v7.0-rc2 on arm64 and confirmed simialr results.
arm64 (AWS Graviton3):
+-----------------+--------------+-------------+---------------+
| Benchmark | Result Class | v6.18-rc5 | per-cpu-prng |
| | | rndstack-on | |
| | | | |
+=================+==============+=============+===============+
| syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% |
| | p99 (ns) | (R) 155.01% | (R) 3.20% |
| | p99.9 (ns) | (R) 156.71% | (R) 2.93% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% |
| | p99 (ns) | (R) 152.81% | 1.55% |
| | p99.9 (ns) | (R) 153.67% | 1.77% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% |
| | p99 (ns) | (R) 165.82% | (R) 3.51% |
| | p99.9 (ns) | (R) 168.83% | (R) 3.77% |
+-----------------+--------------+-------------+---------------+
Because arm64 was previously using get_random_u16(), it was expensive when it
didn't have any buffered bits and had to call into the crng. That's what caused
the enormous tail latency.
x86 (AWS Sapphire Rapids):
+-----------------+--------------+-------------+---------------+
| Benchmark | Result Class | v6.18-rc5 | per-cpu-prng |
| | | rndstack-on | |
| | | | |
+=================+==============+=============+===============+
| syscall/getpid | mean (ns) | (R) 13.32% | (R) 4.60% |
| | p99 (ns) | (R) 13.38% | (R) 18.08% |
| | p99.9 (ns) | 16.26% | (R) 19.38% |
+-----------------+--------------+-------------+---------------+
| syscall/getppid | mean (ns) | (R) 11.96% | (R) 5.26% |
| | p99 (ns) | (R) 11.83% | (R) 8.35% |
| | p99.9 (ns) | (R) 11.42% | (R) 22.37% |
+-----------------+--------------+-------------+---------------+
| syscall/invalid | mean (ns) | (R) 10.58% | (R) 2.91% |
| | p99 (ns) | (R) 10.51% | (R) 4.36% |
| | p99.9 (ns) | (R) 10.35% | (R) 21.97% |
+-----------------+--------------+-------------+---------------+
I was surprised to see that the baseline cost on x86 is 10-12% since it is just
using rdtsc. But as I say, I believe the results are accurate.
Changes since v4 [5]
====================
- Moved add_random_kstack_offset() later in syscall entry code for powerpc, s390
and x86. On these platforms it was previously within noinstr sections but for
some exotic Kconfigs, [get|put]_cpu_var() was calling out to instrumentable
code. (reported by kernel test robot)
- Removed what was previously patch 2 (inline version of prandom_u32_state()).
With the above change, there is no longer an issue with calling the
out-of-line version.
Changes since v3 [4]
====================
- Patch 1: Fixed typo in commit log (per David L)
- Patch 2: Reinstated prandom_u32_state() as out-of-line function, which
forwards to inline version (per David L)
- Patch 3: Added supplementary info about benefits of removing
choose_random_kstack_offset() (per Mark R)
Changes since v2 [3]
====================
- Moved late_initcall() to initialize kstack_rnd_state out of
randomize_kstack.h and into main.c. (issue noticed by kernel test robot)
Changes since v1 (RFC) [2]
==========================
- Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
its called from noinstr code)
- In patch 3, prng is now per-cpu instead of per-task (per Ard)
[1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
[2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/20260102131156.3265118-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/all/20260119130122.1283821-1-ryan.roberts@arm.com
Thanks,
Ryan
Ryan Roberts (2):
randomize_kstack: Maintain kstack_offset per task
randomize_kstack: Unify random source across arches
arch/Kconfig | 5 ++-
arch/arm64/kernel/syscall.c | 11 ------
arch/loongarch/kernel/syscall.c | 11 ------
arch/powerpc/kernel/syscall.c | 16 ++-------
arch/riscv/kernel/traps.c | 12 -------
arch/s390/include/asm/entry-common.h | 8 -----
arch/s390/kernel/syscall.c | 2 +-
arch/x86/entry/syscall_32.c | 4 +--
arch/x86/entry/syscall_64.c | 2 +-
arch/x86/include/asm/entry-common.h | 12 -------
include/linux/randomize_kstack.h | 54 +++++++++++-----------------
init/main.c | 9 ++++-
kernel/fork.c | 1 +
13 files changed, 37 insertions(+), 110 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v5 1/2] randomize_kstack: Maintain kstack_offset per task
2026-03-03 15:08 [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
@ 2026-03-03 15:08 ` Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 2/2] randomize_kstack: Unify random source across arches Ryan Roberts
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Ryan Roberts @ 2026-03-03 15:08 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, David Laight
Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, linux-hardening, stable
kstack_offset was previously maintained per-cpu, but this caused a
couple of issues. So let's instead make it per-task.
Issue 1: add_random_kstack_offset() and choose_random_kstack_offset()
expected and required to be called with interrupts and preemption
disabled so that it could manipulate per-cpu state. But arm64, loongarch
and risc-v are calling them with interrupts and preemption enabled. I
don't _think_ this causes any functional issues, but it's certainly
unexpected and could lead to manipulating the wrong cpu's state, which
could cause a minor performance degradation due to bouncing the cache
lines. By maintaining the state per-task those functions can safely be
called in preemptible context.
Issue 2: add_random_kstack_offset() is called before executing the
syscall and expands the stack using a previously chosen random offset.
choose_random_kstack_offset() is called after executing the syscall and
chooses and stores a new random offset for the next syscall. With
per-cpu storage for this offset, an attacker could force cpu migration
during the execution of the syscall and prevent the offset from being
updated for the original cpu such that it is predictable for the next
syscall on that cpu. By maintaining the state per-task, this problem
goes away because the per-task random offset is updated after the
syscall regardless of which cpu it is executing on.
Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall")
Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
Cc: stable@vger.kernel.org
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/randomize_kstack.h | 26 +++++++++++++++-----------
include/linux/sched.h | 4 ++++
init/main.c | 1 -
kernel/fork.c | 2 ++
4 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
index 1d982dbdd0d0b..5d3916ca747cc 100644
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -9,7 +9,6 @@
DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
randomize_kstack_offset);
-DECLARE_PER_CPU(u32, kstack_offset);
/*
* Do not use this anywhere else in the kernel. This is used here because
@@ -50,15 +49,14 @@ DECLARE_PER_CPU(u32, kstack_offset);
* add_random_kstack_offset - Increase stack utilization by previously
* chosen random offset
*
- * This should be used in the syscall entry path when interrupts and
- * preempt are disabled, and after user registers have been stored to
- * the stack. For testing the resulting entropy, please see:
- * tools/testing/selftests/lkdtm/stack-entropy.sh
+ * This should be used in the syscall entry path after user registers have been
+ * stored to the stack. Preemption may be enabled. For testing the resulting
+ * entropy, please see: tools/testing/selftests/lkdtm/stack-entropy.sh
*/
#define add_random_kstack_offset() do { \
if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
&randomize_kstack_offset)) { \
- u32 offset = raw_cpu_read(kstack_offset); \
+ u32 offset = current->kstack_offset; \
u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \
/* Keep allocation even after "ptr" loses scope. */ \
asm volatile("" :: "r"(ptr) : "memory"); \
@@ -69,9 +67,9 @@ DECLARE_PER_CPU(u32, kstack_offset);
* choose_random_kstack_offset - Choose the random offset for the next
* add_random_kstack_offset()
*
- * This should only be used during syscall exit when interrupts and
- * preempt are disabled. This position in the syscall flow is done to
- * frustrate attacks from userspace attempting to learn the next offset:
+ * This should only be used during syscall exit. Preemption may be enabled. This
+ * position in the syscall flow is done to frustrate attacks from userspace
+ * attempting to learn the next offset:
* - Maximize the timing uncertainty visible from userspace: if the
* offset is chosen at syscall entry, userspace has much more control
* over the timing between choosing offsets. "How long will we be in
@@ -85,14 +83,20 @@ DECLARE_PER_CPU(u32, kstack_offset);
#define choose_random_kstack_offset(rand) do { \
if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
&randomize_kstack_offset)) { \
- u32 offset = raw_cpu_read(kstack_offset); \
+ u32 offset = current->kstack_offset; \
offset = ror32(offset, 5) ^ (rand); \
- raw_cpu_write(kstack_offset, offset); \
+ current->kstack_offset = offset; \
} \
} while (0)
+
+static inline void random_kstack_task_init(struct task_struct *tsk)
+{
+ tsk->kstack_offset = 0;
+}
#else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#define add_random_kstack_offset() do { } while (0)
#define choose_random_kstack_offset(rand) do { } while (0)
+#define random_kstack_task_init(tsk) do { } while (0)
#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7b4a980eb2f0..8358e430dd7fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1592,6 +1592,10 @@ struct task_struct {
unsigned long prev_lowest_stack;
#endif
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+ u32 kstack_offset;
+#endif
+
#ifdef CONFIG_X86_MCE
void __user *mce_vaddr;
__u64 mce_kflags;
diff --git a/init/main.c b/init/main.c
index 1cb395dd94e43..0a1d8529212e9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -833,7 +833,6 @@ static inline void initcall_debug_enable(void)
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
randomize_kstack_offset);
-DEFINE_PER_CPU(u32, kstack_offset);
static int __init early_randomize_kstack_offset(char *buf)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 65113a304518a..5715adeb6adfe 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -95,6 +95,7 @@
#include <linux/thread_info.h>
#include <linux/kstack_erase.h>
#include <linux/kasan.h>
+#include <linux/randomize_kstack.h>
#include <linux/scs.h>
#include <linux/io_uring.h>
#include <linux/io_uring_types.h>
@@ -2233,6 +2234,7 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
+ random_kstack_task_init(p);
stackleak_task_init(p);
if (pid != &init_struct_pid) {
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v5 2/2] randomize_kstack: Unify random source across arches
2026-03-03 15:08 [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 1/2] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
@ 2026-03-03 15:08 ` Ryan Roberts
2026-03-11 8:32 ` [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
2026-03-25 4:14 ` Kees Cook
3 siblings, 0 replies; 5+ messages in thread
From: Ryan Roberts @ 2026-03-03 15:08 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, David Laight
Cc: Ryan Roberts, linux-kernel, linux-arm-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, linux-hardening
Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().
There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.
So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:
- We can remove choose_random_kstack_offset(); That was only there to
try to make the timestamp counter value a bit harder to influence
from user space [*].
- The architecture code is simplified. All it has to do now is call
add_random_kstack_offset() in the syscall path.
- The strength of the randomness can be reasoned about independently
of the architecture.
- Arches previously using get_random_u16() now have much faster
syscall paths, see below results.
[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.
In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.
There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.
Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.
Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):
+-----------------+--------------+---------------+---------------+
| Benchmark | Result Class | per-cpu-prng | per-cpu-prng |
| | | arm64 (metal) | x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% |
| | p99 (ns) | (I) -59.24% | (I) -24.41% |
| | p99.9 (ns) | (I) -59.52% | (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% |
| | p99 (ns) | (I) -59.25% | (I) -25.03% |
| | p99.9 (ns) | (I) -59.50% | (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% |
| | p99 (ns) | (I) -60.79% | (I) -20.06% |
| | p99.9 (ns) | (I) -61.04% | (I) -25.04% |
+-----------------+--------------+---------------+---------------+
I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
arch/Kconfig | 5 ++-
arch/arm64/kernel/syscall.c | 11 ------
arch/loongarch/kernel/syscall.c | 11 ------
arch/powerpc/kernel/syscall.c | 16 ++-------
arch/riscv/kernel/traps.c | 12 -------
arch/s390/include/asm/entry-common.h | 8 -----
arch/s390/kernel/syscall.c | 2 +-
arch/x86/entry/syscall_32.c | 4 +--
arch/x86/entry/syscall_64.c | 2 +-
arch/x86/include/asm/entry-common.h | 12 -------
include/linux/randomize_kstack.h | 52 +++++++++-------------------
include/linux/sched.h | 4 ---
init/main.c | 8 +++++
kernel/fork.c | 1 -
14 files changed, 33 insertions(+), 115 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 102ddbd4298ef..f134527ace10e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1519,9 +1519,8 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
def_bool n
help
An arch should select this symbol if it can support kernel stack
- offset randomization with calls to add_random_kstack_offset()
- during syscall entry and choose_random_kstack_offset() during
- syscall exit. Careful removal of -fstack-protector-strong and
+ offset randomization with a call to add_random_kstack_offset()
+ during syscall entry. Careful removal of -fstack-protector-strong and
-fstack-protector should also be applied to the entry code and
closely examined, as the artificial stack bump looks like an array
to the compiler, so it will attempt to add canary checks regardless
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index c062badd1a566..358ddfbf1401a 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -52,17 +52,6 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
}
syscall_set_return_value(current, regs, 0, ret);
-
- /*
- * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
- * bits. The actual entropy will be further reduced by the compiler
- * when applying stack alignment constraints: the AAPCS mandates a
- * 16-byte aligned SP at function boundaries, which will remove the
- * 4 low bits from any entropy chosen here.
- *
- * The resulting 6 bits of entropy is seen in SP[9:4].
- */
- choose_random_kstack_offset(get_random_u16());
}
static inline bool has_syscall_work(unsigned long flags)
diff --git a/arch/loongarch/kernel/syscall.c b/arch/loongarch/kernel/syscall.c
index 1249d82c1cd0a..85da7e050d977 100644
--- a/arch/loongarch/kernel/syscall.c
+++ b/arch/loongarch/kernel/syscall.c
@@ -79,16 +79,5 @@ void noinstr __no_stack_protector do_syscall(struct pt_regs *regs)
regs->regs[7], regs->regs[8], regs->regs[9]);
}
- /*
- * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
- * bits. The actual entropy will be further reduced by the compiler
- * when applying stack alignment constraints: 16-bytes (i.e. 4-bits)
- * aligned, which will remove the 4 low bits from any entropy chosen
- * here.
- *
- * The resulting 6 bits of entropy is seen in SP[9:4].
- */
- choose_random_kstack_offset(get_cycles());
-
syscall_exit_to_user_mode(regs);
}
diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
index be159ad4b77bd..b762677f87371 100644
--- a/arch/powerpc/kernel/syscall.c
+++ b/arch/powerpc/kernel/syscall.c
@@ -20,8 +20,6 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
kuap_lock();
- add_random_kstack_offset();
-
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
BUG_ON(irq_soft_mask_return() != IRQS_ALL_DISABLED);
@@ -30,6 +28,8 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
CT_WARN_ON(ct_state() == CT_STATE_KERNEL);
user_exit_irqoff();
+ add_random_kstack_offset();
+
BUG_ON(regs_is_unrecoverable(regs));
BUG_ON(!user_mode(regs));
BUG_ON(arch_irq_disabled_regs(regs));
@@ -173,17 +173,5 @@ notrace long system_call_exception(struct pt_regs *regs, unsigned long r0)
}
#endif
- /*
- * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
- * so the maximum stack offset is 1k bytes (10 bits).
- *
- * The actual entropy will be further reduced by the compiler when
- * applying stack alignment constraints: the powerpc architecture
- * may have two kinds of stack alignment (16-bytes and 8-bytes).
- *
- * So the resulting 6 or 7 bits of entropy is seen in SP[9:4] or SP[9:3].
- */
- choose_random_kstack_offset(mftb());
-
return ret;
}
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 5fb57fad188a9..461279a7bd864 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -344,18 +344,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
syscall_handler(regs, syscall);
}
- /*
- * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
- * so the maximum stack offset is 1k bytes (10 bits).
- *
- * The actual entropy will be further reduced by the compiler when
- * applying stack alignment constraints: 16-byte (i.e. 4-bit) aligned
- * for RV32I or RV64I.
- *
- * The resulting 6 bits of entropy is seen in SP[9:4].
- */
- choose_random_kstack_offset(get_random_u16());
-
syscall_exit_to_user_mode(regs);
} else {
irqentry_state_t state = irqentry_nmi_enter(regs);
diff --git a/arch/s390/include/asm/entry-common.h b/arch/s390/include/asm/entry-common.h
index 979af986a8feb..35450a4853239 100644
--- a/arch/s390/include/asm/entry-common.h
+++ b/arch/s390/include/asm/entry-common.h
@@ -51,14 +51,6 @@ static __always_inline void arch_exit_to_user_mode(void)
#define arch_exit_to_user_mode arch_exit_to_user_mode
-static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
- unsigned long ti_work)
-{
- choose_random_kstack_offset(get_tod_clock_fast());
-}
-
-#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
-
static __always_inline bool arch_in_rcu_eqs(void)
{
if (IS_ENABLED(CONFIG_KVM))
diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c
index 795b6cca74c9b..1cf49af74a1df 100644
--- a/arch/s390/kernel/syscall.c
+++ b/arch/s390/kernel/syscall.c
@@ -96,8 +96,8 @@ void noinstr __do_syscall(struct pt_regs *regs, int per_trap)
{
unsigned long nr;
- add_random_kstack_offset();
enter_from_user_mode(regs);
+ add_random_kstack_offset();
regs->psw = get_lowcore()->svc_old_psw;
regs->int_code = get_lowcore()->svc_int_code;
update_timer_sys();
diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index 8e829575e12f9..31b9492fe851d 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -247,7 +247,6 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
{
int nr = syscall_32_enter(regs);
- add_random_kstack_offset();
/*
* Subtlety here: if ptrace pokes something larger than 2^31-1 into
* orig_ax, the int return value truncates it. This matches
@@ -256,6 +255,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
+ add_random_kstack_offset();
do_syscall_32_irqs_on(regs, nr);
instrumentation_end();
@@ -268,7 +268,6 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
int nr = syscall_32_enter(regs);
int res;
- add_random_kstack_offset();
/*
* This cannot use syscall_enter_from_user_mode() as it has to
* fetch EBP before invoking any of the syscall entry work
@@ -277,6 +276,7 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
enter_from_user_mode(regs);
instrumentation_begin();
+ add_random_kstack_offset();
local_irq_enable();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index b6e68ea98b839..71f032504e731 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -86,10 +86,10 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
/* Returns true to return using SYSRET, or false to use IRET */
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
- add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
+ add_random_kstack_offset();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
/* Invalid system call, but still a system call. */
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce3eb6d5fdf9f..7535131c711bb 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -82,18 +82,6 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
#endif
- /*
- * This value will get limited by KSTACK_OFFSET_MAX(), which is 10
- * bits. The actual entropy will be further reduced by the compiler
- * when applying stack alignment constraints (see cc_stack_align4/8 in
- * arch/x86/Makefile), which will remove the 3 (x86_64) or 2 (ia32)
- * low bits from any entropy chosen here.
- *
- * Therefore, final stack offset entropy will be 7 (x86_64) or
- * 8 (ia32) bits.
- */
- choose_random_kstack_offset(rdtsc());
-
/* Avoid unnecessary reads of 'x86_ibpb_exit_to_user' */
if (cpu_feature_enabled(X86_FEATURE_IBPB_EXIT_TO_USER) &&
this_cpu_read(x86_ibpb_exit_to_user)) {
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
index 5d3916ca747cc..024fc20e77622 100644
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -6,6 +6,7 @@
#include <linux/kernel.h>
#include <linux/jump_label.h>
#include <linux/percpu-defs.h>
+#include <linux/prandom.h>
DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
randomize_kstack_offset);
@@ -45,9 +46,22 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
#define KSTACK_OFFSET_MAX(x) ((x) & 0b1111111100)
#endif
+DECLARE_PER_CPU(struct rnd_state, kstack_rnd_state);
+
+static __always_inline u32 get_kstack_offset(void)
+{
+ struct rnd_state *state;
+ u32 rnd;
+
+ state = &get_cpu_var(kstack_rnd_state);
+ rnd = prandom_u32_state(state);
+ put_cpu_var(kstack_rnd_state);
+
+ return rnd;
+}
+
/**
- * add_random_kstack_offset - Increase stack utilization by previously
- * chosen random offset
+ * add_random_kstack_offset - Increase stack utilization by a random offset.
*
* This should be used in the syscall entry path after user registers have been
* stored to the stack. Preemption may be enabled. For testing the resulting
@@ -56,47 +70,15 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
#define add_random_kstack_offset() do { \
if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
&randomize_kstack_offset)) { \
- u32 offset = current->kstack_offset; \
+ u32 offset = get_kstack_offset(); \
u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset)); \
/* Keep allocation even after "ptr" loses scope. */ \
asm volatile("" :: "r"(ptr) : "memory"); \
} \
} while (0)
-/**
- * choose_random_kstack_offset - Choose the random offset for the next
- * add_random_kstack_offset()
- *
- * This should only be used during syscall exit. Preemption may be enabled. This
- * position in the syscall flow is done to frustrate attacks from userspace
- * attempting to learn the next offset:
- * - Maximize the timing uncertainty visible from userspace: if the
- * offset is chosen at syscall entry, userspace has much more control
- * over the timing between choosing offsets. "How long will we be in
- * kernel mode?" tends to be more difficult to predict than "how long
- * will we be in user mode?"
- * - Reduce the lifetime of the new offset sitting in memory during
- * kernel mode execution. Exposure of "thread-local" memory content
- * (e.g. current, percpu, etc) tends to be easier than arbitrary
- * location memory exposure.
- */
-#define choose_random_kstack_offset(rand) do { \
- if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
- &randomize_kstack_offset)) { \
- u32 offset = current->kstack_offset; \
- offset = ror32(offset, 5) ^ (rand); \
- current->kstack_offset = offset; \
- } \
-} while (0)
-
-static inline void random_kstack_task_init(struct task_struct *tsk)
-{
- tsk->kstack_offset = 0;
-}
#else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#define add_random_kstack_offset() do { } while (0)
-#define choose_random_kstack_offset(rand) do { } while (0)
-#define random_kstack_task_init(tsk) do { } while (0)
#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8358e430dd7fd..a7b4a980eb2f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1592,10 +1592,6 @@ struct task_struct {
unsigned long prev_lowest_stack;
#endif
-#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
- u32 kstack_offset;
-#endif
-
#ifdef CONFIG_X86_MCE
void __user *mce_vaddr;
__u64 mce_kflags;
diff --git a/init/main.c b/init/main.c
index 0a1d8529212e9..c9638a6946dca 100644
--- a/init/main.c
+++ b/init/main.c
@@ -833,6 +833,14 @@ static inline void initcall_debug_enable(void)
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
randomize_kstack_offset);
+DEFINE_PER_CPU(struct rnd_state, kstack_rnd_state);
+
+static int __init random_kstack_init(void)
+{
+ prandom_seed_full_state(&kstack_rnd_state);
+ return 0;
+}
+late_initcall(random_kstack_init);
static int __init early_randomize_kstack_offset(char *buf)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 5715adeb6adfe..1f738c28ca07b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2234,7 +2234,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
- random_kstack_task_init(p);
stackleak_task_init(p);
if (pid != &init_struct_pid) {
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation
2026-03-03 15:08 [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 1/2] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 2/2] randomize_kstack: Unify random source across arches Ryan Roberts
@ 2026-03-11 8:32 ` Ryan Roberts
2026-03-25 4:14 ` Kees Cook
3 siblings, 0 replies; 5+ messages in thread
From: Ryan Roberts @ 2026-03-11 8:32 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Kees Cook,
Gustavo A. R. Silva, Arnd Bergmann, Mark Rutland,
Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton, David Laight
Cc: linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
linux-riscv, linux-s390, linux-hardening
Hi Kees,
I'm keen to get some testing in linux-next and hopefully get this upstream for
v7.1 as we previously discussed. Are you willing/able to take this via your tree?
Thanks,
Ryan
On 03/03/2026 15:08, Ryan Roberts wrote:
> [Kees; I'm hoping this is now good-to-go via your hardening tree? It would be
> good to get some linux-next testing.]
>
> Hi All,
>
> As I reported at [1], kstack offset randomisation suffers from a couple of bugs
> and, on arm64 at least, the performance is poor. This series attempts to fix
> both; patch 1 provides back-portable fixes for the functional bugs. Patch 2
> proposes a performance improvement approach.
>
> I've looked at a few different options but ultimately decided that Jeremy's
> original prng approach is the fastest. I made the argument that this approach is
> secure "enough" in the RFC [2] and the responses indicated agreement.
>
> More details in the commit logs.
>
>
> Performance
> ===========
>
> Mean and tail performance of 3 "small" syscalls was measured. syscall was made
> 10 million times and each individually measured and binned. These results have
> low noise so I'm confident that they are trustworthy.
>
> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> performance cost of turning it on without any changes to the implementation,
> then the reduced performance cost of turning it on with my changes applied.
>
> **NOTE**: The below results were generated using the RFC patches but there is no
> meaningful change, so the numbers are still valid. I've also rerun the tests
> with this version on top of v7.0-rc2 on arm64 and confirmed simialr results.
>
> arm64 (AWS Graviton3):
> +-----------------+--------------+-------------+---------------+
> | Benchmark | Result Class | v6.18-rc5 | per-cpu-prng |
> | | | rndstack-on | |
> | | | | |
> +=================+==============+=============+===============+
> | syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% |
> | | p99 (ns) | (R) 155.01% | (R) 3.20% |
> | | p99.9 (ns) | (R) 156.71% | (R) 2.93% |
> +-----------------+--------------+-------------+---------------+
> | syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% |
> | | p99 (ns) | (R) 152.81% | 1.55% |
> | | p99.9 (ns) | (R) 153.67% | 1.77% |
> +-----------------+--------------+-------------+---------------+
> | syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% |
> | | p99 (ns) | (R) 165.82% | (R) 3.51% |
> | | p99.9 (ns) | (R) 168.83% | (R) 3.77% |
> +-----------------+--------------+-------------+---------------+
>
> Because arm64 was previously using get_random_u16(), it was expensive when it
> didn't have any buffered bits and had to call into the crng. That's what caused
> the enormous tail latency.
>
>
> x86 (AWS Sapphire Rapids):
> +-----------------+--------------+-------------+---------------+
> | Benchmark | Result Class | v6.18-rc5 | per-cpu-prng |
> | | | rndstack-on | |
> | | | | |
> +=================+==============+=============+===============+
> | syscall/getpid | mean (ns) | (R) 13.32% | (R) 4.60% |
> | | p99 (ns) | (R) 13.38% | (R) 18.08% |
> | | p99.9 (ns) | 16.26% | (R) 19.38% |
> +-----------------+--------------+-------------+---------------+
> | syscall/getppid | mean (ns) | (R) 11.96% | (R) 5.26% |
> | | p99 (ns) | (R) 11.83% | (R) 8.35% |
> | | p99.9 (ns) | (R) 11.42% | (R) 22.37% |
> +-----------------+--------------+-------------+---------------+
> | syscall/invalid | mean (ns) | (R) 10.58% | (R) 2.91% |
> | | p99 (ns) | (R) 10.51% | (R) 4.36% |
> | | p99.9 (ns) | (R) 10.35% | (R) 21.97% |
> +-----------------+--------------+-------------+---------------+
>
> I was surprised to see that the baseline cost on x86 is 10-12% since it is just
> using rdtsc. But as I say, I believe the results are accurate.
>
>
> Changes since v4 [5]
> ====================
>
> - Moved add_random_kstack_offset() later in syscall entry code for powerpc, s390
> and x86. On these platforms it was previously within noinstr sections but for
> some exotic Kconfigs, [get|put]_cpu_var() was calling out to instrumentable
> code. (reported by kernel test robot)
> - Removed what was previously patch 2 (inline version of prandom_u32_state()).
> With the above change, there is no longer an issue with calling the
> out-of-line version.
>
> Changes since v3 [4]
> ====================
>
> - Patch 1: Fixed typo in commit log (per David L)
> - Patch 2: Reinstated prandom_u32_state() as out-of-line function, which
> forwards to inline version (per David L)
> - Patch 3: Added supplementary info about benefits of removing
> choose_random_kstack_offset() (per Mark R)
>
> Changes since v2 [3]
> ====================
>
> - Moved late_initcall() to initialize kstack_rnd_state out of
> randomize_kstack.h and into main.c. (issue noticed by kernel test robot)
>
> Changes since v1 (RFC) [2]
> ==========================
>
> - Introduced patch 2 to make prandom_u32_state() __always_inline (needed since
> its called from noinstr code)
> - In patch 3, prng is now per-cpu instead of per-task (per Ard)
>
>
> [1] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
> [2] https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/all/20251215163520.1144179-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/all/20260102131156.3265118-1-ryan.roberts@arm.com/
> [5] https://lore.kernel.org/all/20260119130122.1283821-1-ryan.roberts@arm.com
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (2):
> randomize_kstack: Maintain kstack_offset per task
> randomize_kstack: Unify random source across arches
>
> arch/Kconfig | 5 ++-
> arch/arm64/kernel/syscall.c | 11 ------
> arch/loongarch/kernel/syscall.c | 11 ------
> arch/powerpc/kernel/syscall.c | 16 ++-------
> arch/riscv/kernel/traps.c | 12 -------
> arch/s390/include/asm/entry-common.h | 8 -----
> arch/s390/kernel/syscall.c | 2 +-
> arch/x86/entry/syscall_32.c | 4 +--
> arch/x86/entry/syscall_64.c | 2 +-
> arch/x86/include/asm/entry-common.h | 12 -------
> include/linux/randomize_kstack.h | 54 +++++++++++-----------------
> init/main.c | 9 ++++-
> kernel/fork.c | 1 +
> 13 files changed, 37 insertions(+), 110 deletions(-)
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation
2026-03-03 15:08 [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
` (2 preceding siblings ...)
2026-03-11 8:32 ` [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
@ 2026-03-25 4:14 ` Kees Cook
3 siblings, 0 replies; 5+ messages in thread
From: Kees Cook @ 2026-03-25 4:14 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Huacai Chen, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Ingo Molnar,
Borislav Petkov, Dave Hansen, Gustavo A. R. Silva, Arnd Bergmann,
Mark Rutland, Jason A. Donenfeld, Ard Biesheuvel, Jeremy Linton,
David Laight, Thomas Gleixner, Ryan Roberts
Cc: Kees Cook, linux-kernel, linux-arm-kernel, loongarch,
linuxppc-dev, linux-riscv, linux-s390, linux-hardening
On Tue, 03 Mar 2026 15:08:37 +0000, Ryan Roberts wrote:
> [Kees; I'm hoping this is now good-to-go via your hardening tree? It would be
> good to get some linux-next testing.]
>
> Hi All,
>
> As I reported at [1], kstack offset randomisation suffers from a couple of bugs
> and, on arm64 at least, the performance is poor. This series attempts to fix
> both; patch 1 provides back-portable fixes for the functional bugs. Patch 2
> proposes a performance improvement approach.
>
> [...]
Sorry for the delay! Applied to for-next/hardening, thanks. :)
[1/2] randomize_kstack: Maintain kstack_offset per task
https://git.kernel.org/kees/c/37beb4256016
[2/2] randomize_kstack: Unify random source across arches
https://git.kernel.org/kees/c/a96ef5848cb0
Take care,
--
Kees Cook
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-03-25 4:14 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03 15:08 [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 1/2] randomize_kstack: Maintain kstack_offset per task Ryan Roberts
2026-03-03 15:08 ` [PATCH v5 2/2] randomize_kstack: Unify random source across arches Ryan Roberts
2026-03-11 8:32 ` [PATCH v5 0/2] Fix bugs and performance of kstack offset randomisation Ryan Roberts
2026-03-25 4:14 ` Kees Cook
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox