[RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
@ 2025-11-27  9:22 Ard Biesheuvel
  2025-11-27  9:22 ` [RFC/RFT PATCH 1/6] hexagon: Wire up cmpxchg64_local() to generic implementation Ard Biesheuvel
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

Ryan reports that get_random_u16() is dominant in the performance
profiling of syscall entry when kstack randomization is enabled [0].

This is the reason many architectures rely on a counter instead, and
that, in turn, is the reason for the convoluted way the (pseudo-)entropy
is gathered and recorded in a per-CPU variable.

Let's try to make the get_random_uXX() fast path faster, and switch to
get_random_u8() so that we'll hit the slow path 2x less often. Then,
wire it up in the syscall entry path, replacing the per-CPU variable,
making the logic at syscall exit redundant.

[0] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/

Cc: Kees Cook <kees@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>

Ard Biesheuvel (6):
  hexagon: Wire up cmpxchg64_local() to generic implementation
  arc: Wire up cmpxchg64_local() to generic implementation
  random: Use u32 to keep track of batched entropy generation
  random: Use a lockless fast path for get_random_uXX()
  random: Plug race in preceding patch
  randomize_kstack: Use get_random_u8() at entry for entropy

 arch/Kconfig                       |  9 ++--
 arch/arc/include/asm/cmpxchg.h     |  3 ++
 arch/hexagon/include/asm/cmpxchg.h |  4 ++
 drivers/char/random.c              | 49 ++++++++++++++------
 include/linux/randomize_kstack.h   | 36 ++------------
 init/main.c                        |  1 -
 6 files changed, 49 insertions(+), 53 deletions(-)

base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d
-- 
2.52.0.107.ga0afd4fd5b-goog

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC/RFT PATCH 1/6] hexagon: Wire up cmpxchg64_local() to generic implementation
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
@ 2025-11-27  9:22 ` Ard Biesheuvel
  2025-11-27  9:22 ` [RFC/RFT PATCH 2/6] arc: " Ard Biesheuvel
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

Provide cmpxchg64_local() for hexagon so we can start using it in
generic code.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/hexagon/include/asm/cmpxchg.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/hexagon/include/asm/cmpxchg.h b/arch/hexagon/include/asm/cmpxchg.h
index 9c58fb81f7fd..05e426475d2a 100644
--- a/arch/hexagon/include/asm/cmpxchg.h
+++ b/arch/hexagon/include/asm/cmpxchg.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_CMPXCHG_H
 #define _ASM_CMPXCHG_H
 
+#include <asm-generic/cmpxchg-local.h>
+
 /*
  * __arch_xchg - atomically exchange a register and a memory location
  * @x: value to swap
@@ -72,4 +74,6 @@ __arch_xchg(unsigned long x, volatile void *ptr, int size)
 	__oldval;						\
 })
 
+#define arch_cmpxchg64_local __generic_cmpxchg64_local
+
 #endif /* _ASM_CMPXCHG_H */
-- 
2.52.0.107.ga0afd4fd5b-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/RFT PATCH 2/6] arc: Wire up cmpxchg64_local() to generic implementation
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
  2025-11-27  9:22 ` [RFC/RFT PATCH 1/6] hexagon: Wire up cmpxchg64_local() to generic implementation Ard Biesheuvel
@ 2025-11-27  9:22 ` Ard Biesheuvel
  2025-11-27 15:06   ` Joey Gouly
  2025-11-27  9:22 ` [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation Ard Biesheuvel
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

Provide cmpxchg64_local() for hexagon so we can start using it in
generic code.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arc/include/asm/cmpxchg.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arc/include/asm/cmpxchg.h b/arch/arc/include/asm/cmpxchg.h
index 76f43db0890f..f2d55823645c 100644
--- a/arch/arc/include/asm/cmpxchg.h
+++ b/arch/arc/include/asm/cmpxchg.h
@@ -12,6 +12,7 @@
 
 #include <asm/barrier.h>
 #include <asm/smp.h>
+#include <asm-generic/cmpxchg-local.h>
 
 #ifdef CONFIG_ARC_HAS_LLSC
 
@@ -142,4 +143,6 @@
 
 #endif
 
+#define arch_cmpxchg64_local __generic_cmpxchg64_local
+
 #endif
-- 
2.52.0.107.ga0afd4fd5b-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
  2025-11-27  9:22 ` [RFC/RFT PATCH 1/6] hexagon: Wire up cmpxchg64_local() to generic implementation Ard Biesheuvel
  2025-11-27  9:22 ` [RFC/RFT PATCH 2/6] arc: " Ard Biesheuvel
@ 2025-11-27  9:22 ` Ard Biesheuvel
  2025-11-27 10:11   ` david laight
  2025-11-27  9:22 ` [RFC/RFT PATCH 4/6] random: Use a lockless fast path for get_random_uXX() Ard Biesheuvel
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

The batched entropy containers each have a generation field, to keep
track of the base_crng generation from which it was last reseeded.

This use case does not require all bits of the unsigned long to be
stored: storing only 32 bits is sufficient to determine whether or not
we're at most 4 billion generations behind, which seems ample.

So use an unsigned int instead: this will allow a future patch to treat
the generation and position as a single 64-bit quantity, which can be
used locklessly in a compare-and-exchange() operation.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/char/random.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b8b24b6ed3fe..0e04bc60d034 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -507,7 +507,7 @@ struct batch_ ##type {								\
 	 */									\
 	type entropy[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(type))];		\
 	local_lock_t lock;							\
-	unsigned long generation;						\
+	unsigned int generation;						\
 	unsigned int position;							\
 };										\
 										\
@@ -521,7 +521,7 @@ type get_random_ ##type(void)							\
 	type ret;								\
 	unsigned long flags;							\
 	struct batch_ ##type *batch;						\
-	unsigned long next_gen;							\
+	unsigned int next_gen;							\
 										\
 	warn_unseeded_randomness();						\
 										\
@@ -533,7 +533,7 @@ type get_random_ ##type(void)							\
 	local_lock_irqsave(&batched_entropy_ ##type.lock, flags);		\
 	batch = raw_cpu_ptr(&batched_entropy_##type);				\
 										\
-	next_gen = READ_ONCE(base_crng.generation);				\
+	next_gen = (unsigned int)READ_ONCE(base_crng.generation);		\
 	if (batch->position >= ARRAY_SIZE(batch->entropy) ||			\
 	    next_gen != batch->generation) {					\
 		_get_random_bytes(batch->entropy, sizeof(batch->entropy));	\
-- 
2.52.0.107.ga0afd4fd5b-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/RFT PATCH 4/6] random: Use a lockless fast path for get_random_uXX()
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2025-11-27  9:22 ` [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation Ard Biesheuvel
@ 2025-11-27  9:22 ` Ard Biesheuvel
  2025-11-27 10:32   ` Ard Biesheuvel
  2025-11-27  9:22 ` [RFC/RFT PATCH 5/6] random: Plug race in preceding patch Ard Biesheuvel
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

Currently, the implementations of the get_random_uXX() API protect their
critical section with a local lock and disabling interrupts, to ensure
that the code does not race with itself when called from interrupt
context.

Given that the fast path does nothing more than read a single uXX
quantity from a linear buffer and bump the position pointer, poking the
hardware registers to disable and re-enable interrupts is
disproportionately costly, and best avoided.

There are two conditions under which the batched entropy buffer is
replenished, which is what forms the critical section:
- the buffer is exhausted
- the base_crng generation counter has incremented.

By combining the position and generation counters into a single u64, we
can use compare and exchange to implement the fast path without taking
the local lock or disabling interrupts. By constructing the expected and
next values carefully, the compare and exchange will only succeed if
- we did not race with ourselves, i.e., the compare and exchange
  increments the position counter by exactly 1;
- the buffer is not exhausted
- the generation counter equals the base_crng generation counter.

Only if the compare and exchange fails is the original slow path taken,
and only in that case do we take the local lock. This results in a
considerable speedup (3-5x) when benchmarking get_random_u8() in a tight
loop.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/char/random.c | 44 ++++++++++++++------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0e04bc60d034..71bd74871540 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -496,6 +496,12 @@ static ssize_t get_random_bytes_user(struct iov_iter *iter)
  * should be called and return 0 at least once at any point prior.
  */
 
+#ifdef __LITTLE_ENDIAN
+#define LOHI(lo, hi)	lo, hi
+#else
+#define LOHI(lo, hi)	hi, lo
+#endif
+
 #define DEFINE_BATCHED_ENTROPY(type)						\
 struct batch_ ##type {								\
 	/*									\
@@ -507,8 +513,12 @@ struct batch_ ##type {								\
 	 */									\
 	type entropy[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(type))];		\
 	local_lock_t lock;							\
-	unsigned int generation;						\
-	unsigned int position;							\
+	union {									\
+		struct {							\
+			unsigned int LOHI(position, generation);		\
+		};								\
+		u64 posgen;							\
+	};									\
 };										\
 										\
 static DEFINE_PER_CPU(struct batch_ ##type, batched_entropy_ ##type) = {	\
@@ -522,6 +532,7 @@ type get_random_ ##type(void)							\
 	unsigned long flags;							\
 	struct batch_ ##type *batch;						\
 	unsigned int next_gen;							\
+	u64 next;								\
 										\
 	warn_unseeded_randomness();						\
 										\
@@ -530,21 +541,28 @@ type get_random_ ##type(void)							\
 		return ret;							\
 	}									\
 										\
-	local_lock_irqsave(&batched_entropy_ ##type.lock, flags);		\
-	batch = raw_cpu_ptr(&batched_entropy_##type);				\
+	batch = &get_cpu_var(batched_entropy_##type);				\
 										\
 	next_gen = (unsigned int)READ_ONCE(base_crng.generation);		\
-	if (batch->position >= ARRAY_SIZE(batch->entropy) ||			\
-	    next_gen != batch->generation) {					\
-		_get_random_bytes(batch->entropy, sizeof(batch->entropy));	\
-		batch->position = 0;						\
-		batch->generation = next_gen;					\
+	next = (u64)next_gen << 32;						\
+	if (likely(batch->position < ARRAY_SIZE(batch->entropy))) {		\
+		next |=	batch->position + 1; /* next-1 is bogus otherwise */	\
+		ret = batch->entropy[batch->position];				\
+	}									\
+	if (cmpxchg64_local(&batch->posgen, next, next - 1) != next - 1) {	\
+		local_lock_irqsave(&batched_entropy_ ##type.lock, flags);	\
+		if (batch->position >= ARRAY_SIZE(batch->entropy) ||		\
+		    next_gen != batch->generation) {				\
+			_get_random_bytes(batch->entropy, sizeof(batch->entropy));\
+			batch->position = 0;					\
+			batch->generation = next_gen;				\
+		}								\
+		ret = batch->entropy[batch->position++];			\
+		local_unlock_irqrestore(&batched_entropy_ ##type.lock, flags);	\
 	}									\
 										\
-	ret = batch->entropy[batch->position];					\
-	batch->entropy[batch->position] = 0;					\
-	++batch->position;							\
-	local_unlock_irqrestore(&batched_entropy_ ##type.lock, flags);		\
+	batch->entropy[batch->position - 1] = 0;				\
+	put_cpu_var(batched_entropy_##type);					\
 	return ret;								\
 }										\
 EXPORT_SYMBOL(get_random_ ##type);
-- 
2.52.0.107.ga0afd4fd5b-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/RFT PATCH 5/6] random: Plug race in preceding patch
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2025-11-27  9:22 ` [RFC/RFT PATCH 4/6] random: Use a lockless fast path for get_random_uXX() Ard Biesheuvel
@ 2025-11-27  9:22 ` Ard Biesheuvel
  2025-11-28 11:13   ` david laight
  2025-11-27  9:22 ` [RFC/RFT PATCH 6/6] randomize_kstack: Use get_random_u8() at entry for entropy Ard Biesheuvel
  2025-11-27 12:12 ` [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ryan Roberts
  6 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

The lockless get_random_uXX() reads the next value from the linear
buffer and then overwrites it with a 0x0 value. This is racy, as the
code might be re-entered by an interrupt handler, and so the store might
redundantly wipe the location accessed by the interrupt context rather
than the interrupted context.

To plug this race, wipe the preceding location when reading the next
value from the linear buffer. Given that the position is always non-zero
outside of the critical section, this is guaranteed to be safe, and
ensures that the produced values are always wiped from the buffer.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 drivers/char/random.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 71bd74871540..e8ba460c5c9c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -547,6 +547,7 @@ type get_random_ ##type(void)							\
 	next = (u64)next_gen << 32;						\
 	if (likely(batch->position < ARRAY_SIZE(batch->entropy))) {		\
 		next |=	batch->position + 1; /* next-1 is bogus otherwise */	\
+		batch->entropy[batch->position - 1] = 0;			\
 		ret = batch->entropy[batch->position];				\
 	}									\
 	if (cmpxchg64_local(&batch->posgen, next, next - 1) != next - 1) {	\
-- 
2.52.0.107.ga0afd4fd5b-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC/RFT PATCH 6/6] randomize_kstack: Use get_random_u8() at entry for entropy
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2025-11-27  9:22 ` [RFC/RFT PATCH 5/6] random: Plug race in preceding patch Ard Biesheuvel
@ 2025-11-27  9:22 ` Ard Biesheuvel
  2025-11-27 12:12 ` [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ryan Roberts
  6 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27  9:22 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

From: Ard Biesheuvel <ardb@kernel.org>

The randomize kstack code has a couple of sharp edges, which are due to
the fact that a counter was considered more suitable for providing
entropy than the kernel's RNG.

Given that kstack randomization only requires 6-8 bits of entropy,
get_random_u8() is more suitable here, and its fast path has been made
lockless so that there is no overhead beyond a preempt_dis/enable and
a local 64-bit cmpxchg() to update the RNG's internal state.

This means there is no need to defer sampling the counter until syscall
exit, which also removes the need for a per-CPU variable to carry that
value over to the next syscall on the same CPU. That works around some
issues identified by Ryan in [0].

So replace the use of the per-CPU variable in add_random_kstack_offset()
with a direct use of get_random_u8() for all users, and turn
choose_kstack_random_offset() into an empty stub so that each arch can
get rid of it individually.

[0] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/Kconfig                     |  9 +++--
 include/linux/randomize_kstack.h | 36 +++-----------------
 init/main.c                      |  1 -
 3 files changed, 8 insertions(+), 38 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 61130b88964b..baea154f833e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1520,12 +1520,11 @@ config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	help
 	  An arch should select this symbol if it can support kernel stack
 	  offset randomization with calls to add_random_kstack_offset()
-	  during syscall entry and choose_random_kstack_offset() during
-	  syscall exit. Careful removal of -fstack-protector-strong and
+	  during syscall entry. Careful removal of -fstack-protector-strong and
 	  -fstack-protector should also be applied to the entry code and
-	  closely examined, as the artificial stack bump looks like an array
-	  to the compiler, so it will attempt to add canary checks regardless
-	  of the static branch state.
+	  closely examined, as the artificial stack bump looks like an array to
+	  the compiler, so it will attempt to add canary checks regardless of
+	  the static branch state.
 
 config RANDOMIZE_KSTACK_OFFSET
 	bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
index 1d982dbdd0d0..db1f056e61ec 100644
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -9,7 +9,6 @@
 
 DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 			 randomize_kstack_offset);
-DECLARE_PER_CPU(u32, kstack_offset);
 
 /*
  * Do not use this anywhere else in the kernel. This is used here because
@@ -50,49 +49,22 @@ DECLARE_PER_CPU(u32, kstack_offset);
  * add_random_kstack_offset - Increase stack utilization by previously
  *			      chosen random offset
  *
- * This should be used in the syscall entry path when interrupts and
- * preempt are disabled, and after user registers have been stored to
- * the stack. For testing the resulting entropy, please see:
+ * This should be used in the syscall entry path after user registers have been
+ * stored to the stack. For testing the resulting entropy, please see:
  * tools/testing/selftests/lkdtm/stack-entropy.sh
  */
 #define add_random_kstack_offset() do {					\
 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
 				&randomize_kstack_offset)) {		\
-		u32 offset = raw_cpu_read(kstack_offset);		\
+		u32 offset = get_random_u8() << 2;			\
 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
 		/* Keep allocation even after "ptr" loses scope. */	\
 		asm volatile("" :: "r"(ptr) : "memory");		\
 	}								\
 } while (0)
-
-/**
- * choose_random_kstack_offset - Choose the random offset for the next
- *				 add_random_kstack_offset()
- *
- * This should only be used during syscall exit when interrupts and
- * preempt are disabled. This position in the syscall flow is done to
- * frustrate attacks from userspace attempting to learn the next offset:
- * - Maximize the timing uncertainty visible from userspace: if the
- *   offset is chosen at syscall entry, userspace has much more control
- *   over the timing between choosing offsets. "How long will we be in
- *   kernel mode?" tends to be more difficult to predict than "how long
- *   will we be in user mode?"
- * - Reduce the lifetime of the new offset sitting in memory during
- *   kernel mode execution. Exposure of "thread-local" memory content
- *   (e.g. current, percpu, etc) tends to be easier than arbitrary
- *   location memory exposure.
- */
-#define choose_random_kstack_offset(rand) do {				\
-	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
-				&randomize_kstack_offset)) {		\
-		u32 offset = raw_cpu_read(kstack_offset);		\
-		offset = ror32(offset, 5) ^ (rand);			\
-		raw_cpu_write(kstack_offset, offset);			\
-	}								\
-} while (0)
 #else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
 #define add_random_kstack_offset()		do { } while (0)
-#define choose_random_kstack_offset(rand)	do { } while (0)
 #endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
+#define choose_random_kstack_offset(rand)	do { } while (0)
 
 #endif
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..048a62538242 100644
--- a/init/main.c
+++ b/init/main.c
@@ -830,7 +830,6 @@ static inline void initcall_debug_enable(void)
 #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
 DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
 			   randomize_kstack_offset);
-DEFINE_PER_CPU(u32, kstack_offset);
 
 static int __init early_randomize_kstack_offset(char *buf)
 {
-- 
2.52.0.107.ga0afd4fd5b-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation
  2025-11-27  9:22 ` [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation Ard Biesheuvel
@ 2025-11-27 10:11   ` david laight
  2025-11-27 10:15     ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: david laight @ 2025-11-27 10:11 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Ard Biesheuvel,
	Kees Cook, Ryan Roberts, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 10:22:30 +0100
Ard Biesheuvel <ardb+git@google.com> wrote:

> From: Ard Biesheuvel <ardb@kernel.org>
> 
> The batched entropy containers each have a generation field, to keep
> track of the base_crng generation from which it was last reseeded.
> 
> This use case does not require all bits of the unsigned long to be
> stored: storing only 32 bits is sufficient to determine whether or not
> we're at most 4 billion generations behind, which seems ample.
> 
> So use an unsigned int instead: this will allow a future patch to treat
> the generation and position as a single 64-bit quantity, which can be
> used locklessly in a compare-and-exchange() operation.

Probably best to use a u32.
While it will always(?) be the same as 'unsigned int' it is more
descriptive.

> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  drivers/char/random.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index b8b24b6ed3fe..0e04bc60d034 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -507,7 +507,7 @@ struct batch_ ##type {								\
>  	 */									\
>  	type entropy[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(type))];		\
>  	local_lock_t lock;							\
> -	unsigned long generation;						\
> +	unsigned int generation;						\
>  	unsigned int position;							\
>  };										\
>  										\
> @@ -521,7 +521,7 @@ type get_random_ ##type(void)							\
>  	type ret;								\
>  	unsigned long flags;							\
>  	struct batch_ ##type *batch;						\
> -	unsigned long next_gen;							\
> +	unsigned int next_gen;							\
>  										\
>  	warn_unseeded_randomness();						\
>  										\
> @@ -533,7 +533,7 @@ type get_random_ ##type(void)							\
>  	local_lock_irqsave(&batched_entropy_ ##type.lock, flags);		\
>  	batch = raw_cpu_ptr(&batched_entropy_##type);				\
>  										\
> -	next_gen = READ_ONCE(base_crng.generation);				\
> +	next_gen = (unsigned int)READ_ONCE(base_crng.generation);		\

Isn't that cast pointless?

	David

>  	if (batch->position >= ARRAY_SIZE(batch->entropy) ||			\
>  	    next_gen != batch->generation) {					\
>  		_get_random_bytes(batch->entropy, sizeof(batch->entropy));	\



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation
  2025-11-27 10:11   ` david laight
@ 2025-11-27 10:15     ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 10:15 UTC (permalink / raw)
  To: david laight
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Ryan Roberts, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 11:11, david laight <david.laight@runbox.com> wrote:
>
> On Thu, 27 Nov 2025 10:22:30 +0100
> Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > The batched entropy containers each have a generation field, to keep
> > track of the base_crng generation from which it was last reseeded.
> >
> > This use case does not require all bits of the unsigned long to be
> > stored: storing only 32 bits is sufficient to determine whether or not
> > we're at most 4 billion generations behind, which seems ample.
> >
> > So use an unsigned int instead: this will allow a future patch to treat
> > the generation and position as a single 64-bit quantity, which can be
> > used locklessly in a compare-and-exchange() operation.
>
> Probably best to use a u32.
> While it will always(?) be the same as 'unsigned int' it is more
> descriptive.
>
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  drivers/char/random.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index b8b24b6ed3fe..0e04bc60d034 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -507,7 +507,7 @@ struct batch_ ##type {                                                            \
> >        */                                                                     \
> >       type entropy[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(type))];               \
> >       local_lock_t lock;                                                      \
> > -     unsigned long generation;                                               \
> > +     unsigned int generation;                                                \
> >       unsigned int position;                                                  \
> >  };                                                                           \
> >                                                                               \
> > @@ -521,7 +521,7 @@ type get_random_ ##type(void)                                                     \
> >       type ret;                                                               \
> >       unsigned long flags;                                                    \
> >       struct batch_ ##type *batch;                                            \
> > -     unsigned long next_gen;                                                 \
> > +     unsigned int next_gen;                                                  \
> >                                                                               \
> >       warn_unseeded_randomness();                                             \
> >                                                                               \
> > @@ -533,7 +533,7 @@ type get_random_ ##type(void)                                                     \
> >       local_lock_irqsave(&batched_entropy_ ##type.lock, flags);               \
> >       batch = raw_cpu_ptr(&batched_entropy_##type);                           \
> >                                                                               \
> > -     next_gen = READ_ONCE(base_crng.generation);                             \
> > +     next_gen = (unsigned int)READ_ONCE(base_crng.generation);               \
>
> Isn't that cast pointless?
>

Depends on your definition of pointless :-)

We're deliberately truncating here, and the cast makes that explicit.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 4/6] random: Use a lockless fast path for get_random_uXX()
  2025-11-27  9:22 ` [RFC/RFT PATCH 4/6] random: Use a lockless fast path for get_random_uXX() Ard Biesheuvel
@ 2025-11-27 10:32   ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 10:32 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Kees Cook,
	Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 10:22, Ard Biesheuvel <ardb+git@google.com> wrote:
>
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Currently, the implementations of the get_random_uXX() API protect their
> critical section with a local lock and disabling interrupts, to ensure
> that the code does not race with itself when called from interrupt
> context.
>
> Given that the fast path does nothing more than read a single uXX
> quantity from a linear buffer and bump the position pointer, poking the
> hardware registers to disable and re-enable interrupts is
> disproportionately costly, and best avoided.
>
> There are two conditions under which the batched entropy buffer is
> replenished, which is what forms the critical section:
> - the buffer is exhausted
> - the base_crng generation counter has incremented.
>
> By combining the position and generation counters into a single u64, we
> can use compare and exchange to implement the fast path without taking
> the local lock or disabling interrupts. By constructing the expected and
> next values carefully, the compare and exchange will only succeed if
> - we did not race with ourselves, i.e., the compare and exchange
>   increments the position counter by exactly 1;
> - the buffer is not exhausted
> - the generation counter equals the base_crng generation counter.
>
> Only if the compare and exchange fails is the original slow path taken,
> and only in that case do we take the local lock. This results in a
> considerable speedup (3-5x) when benchmarking get_random_u8() in a tight
> loop.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  drivers/char/random.c | 44 ++++++++++++++------
>  1 file changed, 31 insertions(+), 13 deletions(-)
>

This needs the following applied to ensure correct behavior when first
called after the base_crng generation has already incremented to 1.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e8ba460c5c9c..dddbec7cf856 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -523,7 +523,7 @@ struct batch_ ##type {
                                 \

         \
 static DEFINE_PER_CPU(struct batch_ ##type, batched_entropy_ ##type)
= {       \
        .lock = INIT_LOCAL_LOCK(batched_entropy_ ##type.lock),
         \
-       .position = UINT_MAX
         \
+       .position = ARRAY_SIZE(batched_entropy_ ##type.entropy),
         \
 };
         \

         \
 type get_random_ ##type(void)
         \


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2025-11-27  9:22 ` [RFC/RFT PATCH 6/6] randomize_kstack: Use get_random_u8() at entry for entropy Ard Biesheuvel
@ 2025-11-27 12:12 ` Ryan Roberts
  2025-11-27 12:28   ` Ard Biesheuvel
  6 siblings, 1 reply; 28+ messages in thread
From: Ryan Roberts @ 2025-11-27 12:12 UTC (permalink / raw)
  To: Ard Biesheuvel, linux-hardening
  Cc: linux-arm-kernel, linux-kernel, Ard Biesheuvel, Kees Cook,
	Will Deacon, Arnd Bergmann, Jeremy Linton, Catalin Marinas,
	Mark Rutland, Jason A. Donenfeld

On 27/11/2025 09:22, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Ryan reports that get_random_u16() is dominant in the performance
> profiling of syscall entry when kstack randomization is enabled [0].
> 
> This is the reason many architectures rely on a counter instead, and
> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> is gathered and recorded in a per-CPU variable.
> 
> Let's try to make the get_random_uXX() fast path faster, and switch to
> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> wire it up in the syscall entry path, replacing the per-CPU variable,
> making the logic at syscall exit redundant.

I ran the same set of syscall benchmarks for this series as I've done for my 
series. 

The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
performance cost of turning it on without any changes to the implementation,
then the reduced performance cost of turning it on with my changes applied, and 
finally cost of turning it on with Ard's changes applied:

arm64 (AWS Graviton3):
+-----------------+--------------+-------------+---------------+-----------------+
| Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
|                 |              | rndstack-on |               |                 |
+=================+==============+=============+===============+=================+
| syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
|                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
|                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
+-----------------+--------------+-------------+---------------+-----------------+
| syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
|                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
|                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
+-----------------+--------------+-------------+---------------+-----------------+
| syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
|                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
|                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
+-----------------+--------------+-------------+---------------+-----------------+

So this fixes the tail problem. I guess get_random_u8() only takes the slow path 
every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure 
that fully explains it though.

But it's still a 10% cost on average.

Personally I think 10% syscall cost is too much to pay for 6 bits of stack 
randomisation. 3% is better, but still higher than we would all prefer, I'm sure.

Thanks,
Ryan

> 
> [0] https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/
> 
> Cc: Kees Cook <kees@kernel.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jeremy Linton <jeremy.linton@arm.com>
> Cc: Catalin Marinas <Catalin.Marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Jason A. Donenfeld <Jason@zx2c4.com>
> 
> Ard Biesheuvel (6):
>   hexagon: Wire up cmpxchg64_local() to generic implementation
>   arc: Wire up cmpxchg64_local() to generic implementation
>   random: Use u32 to keep track of batched entropy generation
>   random: Use a lockless fast path for get_random_uXX()
>   random: Plug race in preceding patch
>   randomize_kstack: Use get_random_u8() at entry for entropy
> 
>  arch/Kconfig                       |  9 ++--
>  arch/arc/include/asm/cmpxchg.h     |  3 ++
>  arch/hexagon/include/asm/cmpxchg.h |  4 ++
>  drivers/char/random.c              | 49 ++++++++++++++------
>  include/linux/randomize_kstack.h   | 36 ++------------
>  init/main.c                        |  1 -
>  6 files changed, 49 insertions(+), 53 deletions(-)
> 
> 
> base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 12:12 ` [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ryan Roberts
@ 2025-11-27 12:28   ` Ard Biesheuvel
  2025-11-27 13:08     ` Ryan Roberts
  2025-11-27 14:18     ` Ryan Roberts
  0 siblings, 2 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 12:28 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Ryan reports that get_random_u16() is dominant in the performance
> > profiling of syscall entry when kstack randomization is enabled [0].
> >
> > This is the reason many architectures rely on a counter instead, and
> > that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> > is gathered and recorded in a per-CPU variable.
> >
> > Let's try to make the get_random_uXX() fast path faster, and switch to
> > get_random_u8() so that we'll hit the slow path 2x less often. Then,
> > wire it up in the syscall entry path, replacing the per-CPU variable,
> > making the logic at syscall exit redundant.
>
> I ran the same set of syscall benchmarks for this series as I've done for my
> series.
>

Thanks!


> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> performance cost of turning it on without any changes to the implementation,
> then the reduced performance cost of turning it on with my changes applied, and
> finally cost of turning it on with Ard's changes applied:
>
> arm64 (AWS Graviton3):
> +-----------------+--------------+-------------+---------------+-----------------+
> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> |                 |              | rndstack-on |               |                 |
> +=================+==============+=============+===============+=================+
> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> +-----------------+--------------+-------------+---------------+-----------------+
> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> +-----------------+--------------+-------------+---------------+-----------------+
> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> +-----------------+--------------+-------------+---------------+-----------------+
>

What does the (R) mean?

> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> that fully explains it though.
>
> But it's still a 10% cost on average.
>
> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>

Interesting!

So the only thing that get_random_u8() does that could explain the
delta is calling into the scheduler on preempt_enable(), given that it
does very little beyond that.

Would you mind repeating this experiment after changing the
put_cpu_var() to preempt_enable_no_resched(), to test this theory?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 12:28   ` Ard Biesheuvel
@ 2025-11-27 13:08     ` Ryan Roberts
  2025-11-27 14:18     ` Ryan Roberts
  1 sibling, 0 replies; 28+ messages in thread
From: Ryan Roberts @ 2025-11-27 13:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On 27/11/2025 12:28, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> Ryan reports that get_random_u16() is dominant in the performance
>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>
>>> This is the reason many architectures rely on a counter instead, and
>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>> is gathered and recorded in a per-CPU variable.
>>>
>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>> making the logic at syscall exit redundant.
>>
>> I ran the same set of syscall benchmarks for this series as I've done for my
>> series.
>>
> 
> Thanks!
> 
> 
>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>> performance cost of turning it on without any changes to the implementation,
>> then the reduced performance cost of turning it on with my changes applied, and
>> finally cost of turning it on with Ard's changes applied:
>>
>> arm64 (AWS Graviton3):
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>> |                 |              | rndstack-on |               |                 |
>> +=================+==============+=============+===============+=================+
>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>>
> 
> What does the (R) mean?

(R) is statistically significant regression
(I) is statistically significant improvement

Where "statistically significant" is where the 95% confidence intervals of the
baseline and comparson do not overlap.

> 
>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>> that fully explains it though.
>>
>> But it's still a 10% cost on average.
>>
>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>
> 
> Interesting!
> 
> So the only thing that get_random_u8() does that could explain the
> delta is calling into the scheduler on preempt_enable(), given that it
> does very little beyond that.
> 
> Would you mind repeating this experiment after changing the
> put_cpu_var() to preempt_enable_no_resched(), to test this theory?

Yep it's queued.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 12:28   ` Ard Biesheuvel
  2025-11-27 13:08     ` Ryan Roberts
@ 2025-11-27 14:18     ` Ryan Roberts
  2025-11-27 15:03       ` Ard Biesheuvel
  1 sibling, 1 reply; 28+ messages in thread
From: Ryan Roberts @ 2025-11-27 14:18 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On 27/11/2025 12:28, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>
>>> Ryan reports that get_random_u16() is dominant in the performance
>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>
>>> This is the reason many architectures rely on a counter instead, and
>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>> is gathered and recorded in a per-CPU variable.
>>>
>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>> making the logic at syscall exit redundant.
>>
>> I ran the same set of syscall benchmarks for this series as I've done for my
>> series.
>>
> 
> Thanks!
> 
> 
>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>> performance cost of turning it on without any changes to the implementation,
>> then the reduced performance cost of turning it on with my changes applied, and
>> finally cost of turning it on with Ard's changes applied:
>>
>> arm64 (AWS Graviton3):
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>> |                 |              | rndstack-on |               |                 |
>> +=================+==============+=============+===============+=================+
>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>> +-----------------+--------------+-------------+---------------+-----------------+
>>
> 
> What does the (R) mean?
> 
>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>> that fully explains it though.
>>
>> But it's still a 10% cost on average.
>>
>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>
> 
> Interesting!
> 
> So the only thing that get_random_u8() does that could explain the
> delta is calling into the scheduler on preempt_enable(), given that it
> does very little beyond that.
> 
> Would you mind repeating this experiment after changing the
> put_cpu_var() to preempt_enable_no_resched(), to test this theory?

This has no impact on performance.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 14:18     ` Ryan Roberts
@ 2025-11-27 15:03       ` Ard Biesheuvel
  2025-11-27 15:40         ` Ard Biesheuvel
  2025-11-27 15:56         ` Ryan Roberts
  0 siblings, 2 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 15:03 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>
> >>> Ryan reports that get_random_u16() is dominant in the performance
> >>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>
> >>> This is the reason many architectures rely on a counter instead, and
> >>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>> is gathered and recorded in a per-CPU variable.
> >>>
> >>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>> making the logic at syscall exit redundant.
> >>
> >> I ran the same set of syscall benchmarks for this series as I've done for my
> >> series.
> >>
> >
> > Thanks!
> >
> >
> >> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >> performance cost of turning it on without any changes to the implementation,
> >> then the reduced performance cost of turning it on with my changes applied, and
> >> finally cost of turning it on with Ard's changes applied:
> >>
> >> arm64 (AWS Graviton3):
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> >> |                 |              | rndstack-on |               |                 |
> >> +=================+==============+=============+===============+=================+
> >> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> >> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> >> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> >> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> >> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> >> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> >> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> >> +-----------------+--------------+-------------+---------------+-----------------+
> >>
> >
> > What does the (R) mean?
> >
> >> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >> that fully explains it though.
> >>
> >> But it's still a 10% cost on average.
> >>
> >> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>
> >
> > Interesting!
> >
> > So the only thing that get_random_u8() does that could explain the
> > delta is calling into the scheduler on preempt_enable(), given that it
> > does very little beyond that.
> >
> > Would you mind repeating this experiment after changing the
> > put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>
> This has no impact on performance.
>

Thanks. But this is really rather surprising: what else could be
taking up that time, given that on the fast path, there are only some
loads and stores to the buffer, and a cmpxchg64_local(). Could it be
the latter that is causing so much latency? I suppose the local
cmpxchg() semantics don't really exist on arm64, and this uses the
exact same LSE instruction that would be used for an ordinary
cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.

In any case, there is no debate that your code is faster on arm64. I
also think that using prandom for this purpose is perfectly fine, even
without reseeding: with a 2^113 period and only 6 observable bits per
32 bit sample, predicting the next value reliably is maybe not
impossible, but hardly worth the extensive effort, given that we're
not generating cryptographic keys here.

So the question is really whether we want to dedicate 16 bytes per
task for this. I wouldn't mind personally, but it is something our
internal QA engineers tend to obsess over.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 2/6] arc: Wire up cmpxchg64_local() to generic implementation
  2025-11-27  9:22 ` [RFC/RFT PATCH 2/6] arc: " Ard Biesheuvel
@ 2025-11-27 15:06   ` Joey Gouly
  2025-12-03 16:59     ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Joey Gouly @ 2025-11-27 15:06 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Ard Biesheuvel,
	Kees Cook, Ryan Roberts, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, Nov 27, 2025 at 10:22:29AM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Provide cmpxchg64_local() for hexagon so we can start using it in
> generic code.

nit: this is for arc, not hexagon.

Thanks,
Joey

> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/arc/include/asm/cmpxchg.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arc/include/asm/cmpxchg.h b/arch/arc/include/asm/cmpxchg.h
> index 76f43db0890f..f2d55823645c 100644
> --- a/arch/arc/include/asm/cmpxchg.h
> +++ b/arch/arc/include/asm/cmpxchg.h
> @@ -12,6 +12,7 @@
>  
>  #include <asm/barrier.h>
>  #include <asm/smp.h>
> +#include <asm-generic/cmpxchg-local.h>
>  
>  #ifdef CONFIG_ARC_HAS_LLSC
>  
> @@ -142,4 +143,6 @@
>  
>  #endif
>  
> +#define arch_cmpxchg64_local __generic_cmpxchg64_local
> +
>  #endif
> -- 
> 2.52.0.107.ga0afd4fd5b-goog
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 15:03       ` Ard Biesheuvel
@ 2025-11-27 15:40         ` Ard Biesheuvel
  2025-11-27 15:56         ` Ryan Roberts
  1 sibling, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 15:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 16:03, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 27/11/2025 12:28, Ard Biesheuvel wrote:
> > > On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >>
> > >> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> > >>> From: Ard Biesheuvel <ardb@kernel.org>
> > >>>
> > >>> Ryan reports that get_random_u16() is dominant in the performance
> > >>> profiling of syscall entry when kstack randomization is enabled [0].
> > >>>
> > >>> This is the reason many architectures rely on a counter instead, and
> > >>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> > >>> is gathered and recorded in a per-CPU variable.
> > >>>
> > >>> Let's try to make the get_random_uXX() fast path faster, and switch to
> > >>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> > >>> wire it up in the syscall entry path, replacing the per-CPU variable,
> > >>> making the logic at syscall exit redundant.
> > >>
> > >> I ran the same set of syscall benchmarks for this series as I've done for my
> > >> series.
> > >>
> > >
> > > Thanks!
> > >
> > >
> > >> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> > >> performance cost of turning it on without any changes to the implementation,
> > >> then the reduced performance cost of turning it on with my changes applied, and
> > >> finally cost of turning it on with Ard's changes applied:
> > >>
> > >> arm64 (AWS Graviton3):
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> > >> |                 |              | rndstack-on |               |                 |
> > >> +=================+==============+=============+===============+=================+
> > >> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> > >> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> > >> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> > >> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> > >> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> > >> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> > >> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> > >> +-----------------+--------------+-------------+---------------+-----------------+
> > >>
> > >
> > > What does the (R) mean?
> > >
> > >> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> > >> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> > >> that fully explains it though.
> > >>
> > >> But it's still a 10% cost on average.
> > >>
> > >> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> > >> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> > >>
> > >
> > > Interesting!
> > >
> > > So the only thing that get_random_u8() does that could explain the
> > > delta is calling into the scheduler on preempt_enable(), given that it
> > > does very little beyond that.
> > >
> > > Would you mind repeating this experiment after changing the
> > > put_cpu_var() to preempt_enable_no_resched(), to test this theory?
> >
> > This has no impact on performance.
> >
>
> Thanks. But this is really rather surprising: what else could be
> taking up that time, given that on the fast path, there are only some
> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> the latter that is causing so much latency? I suppose the local
> cmpxchg() semantics don't really exist on arm64, and this uses the
> exact same LSE instruction that would be used for an ordinary
> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
>

FWIW, my naive get_random_u8() benchmark slows down by 3x on x86 if I
replace cmpxchg64_local() with cmpxchg64(), so I suspect the above
comparison will look different on x86 too.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 15:03       ` Ard Biesheuvel
  2025-11-27 15:40         ` Ard Biesheuvel
@ 2025-11-27 15:56         ` Ryan Roberts
  2025-11-27 16:58           ` Mark Rutland
  2025-11-28 10:07           ` Ard Biesheuvel
  1 sibling, 2 replies; 28+ messages in thread
From: Ryan Roberts @ 2025-11-27 15:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On 27/11/2025 15:03, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>
>>>>> Ryan reports that get_random_u16() is dominant in the performance
>>>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>>>
>>>>> This is the reason many architectures rely on a counter instead, and
>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>>>> is gathered and recorded in a per-CPU variable.
>>>>>
>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>>>> making the logic at syscall exit redundant.
>>>>
>>>> I ran the same set of syscall benchmarks for this series as I've done for my
>>>> series.
>>>>
>>>
>>> Thanks!
>>>
>>>
>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>>>> performance cost of turning it on without any changes to the implementation,
>>>> then the reduced performance cost of turning it on with my changes applied, and
>>>> finally cost of turning it on with Ard's changes applied:
>>>>
>>>> arm64 (AWS Graviton3):
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>>>> |                 |              | rndstack-on |               |                 |
>>>> +=================+==============+=============+===============+=================+
>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>
>>>
>>> What does the (R) mean?
>>>
>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>>>> that fully explains it though.
>>>>
>>>> But it's still a 10% cost on average.
>>>>
>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>>>
>>>
>>> Interesting!
>>>
>>> So the only thing that get_random_u8() does that could explain the
>>> delta is calling into the scheduler on preempt_enable(), given that it
>>> does very little beyond that.
>>>
>>> Would you mind repeating this experiment after changing the
>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>>
>> This has no impact on performance.
>>
> 
> Thanks. But this is really rather surprising: what else could be
> taking up that time, given that on the fast path, there are only some
> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> the latter that is causing so much latency? I suppose the local
> cmpxchg() semantics don't really exist on arm64, and this uses the
> exact same LSE instruction that would be used for an ordinary
> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
> 
> In any case, there is no debate that your code is faster on arm64. 

The results I have for x86 show it's faster than the rdtsc too, although that's
also somewhat surprising. I'll run your series on x86 to get the equivalent data.

> I
> also think that using prandom for this purpose is perfectly fine, even
> without reseeding: with a 2^113 period and only 6 observable bits per
> 32 bit sample, predicting the next value reliably is maybe not
> impossible, but hardly worth the extensive effort, given that we're
> not generating cryptographic keys here.
> 
> So the question is really whether we want to dedicate 16 bytes per
> task for this. I wouldn't mind personally, but it is something our
> internal QA engineers tend to obsess over.

Yeah that's a good point. Is this something we could potentially keep at the
start of the kstack? Is there any precident for keeping state there at the
moment? For arm64, I know there is a general feeling that 16K for the stack more
than enough (but we are stuck with it because 8K isn't quite enough). So it
would be "for free". I guess it would be tricky to do this in an arch-agnostic
way though...

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 15:56         ` Ryan Roberts
@ 2025-11-27 16:58           ` Mark Rutland
  2025-11-27 19:01             ` Ard Biesheuvel
  2025-11-28 10:07           ` Ard Biesheuvel
  1 sibling, 1 reply; 28+ messages in thread
From: Mark Rutland @ 2025-11-27 16:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Ard Biesheuvel, Ard Biesheuvel, linux-hardening, linux-arm-kernel,
	linux-kernel, Kees Cook, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Jason A. Donenfeld

On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> > So the question is really whether we want to dedicate 16 bytes per
> > task for this. I wouldn't mind personally, but it is something our
> > internal QA engineers tend to obsess over.
> 
> Yeah that's a good point.

I think it's a fair point that some people will obsesses over this, but
I think the concern is misplaced.

I know that people were very happy for the kernel FPSIMD context to
disappear from task_struct, but 16 bytes is a fair amount smaller, and
I'm pretty sure we can offset that with a small/moderate amount of work.

AFAICT there are extant holes in task_struct that could easily account
for 16 bytes. I can also see a few ways to rework arm64's thread_info
and thread_struct (which are both embedded within task_struct) to save
some space.

> Is this something we could potentially keep at the start of the
> kstack? Is there any precident for keeping state there at the moment?
> For arm64, I know there is a general feeling that 16K for the stack
> more than enough (but we are stuck with it because 8K isn't quite
> enough). So it would be "for free". I guess it would be tricky to do
> this in an arch-agnostic way though...

We went out of our way to stop playing silly games like that when we
moved thread_info into task_struct; please let's not bring that back.

Mark.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 16:58           ` Mark Rutland
@ 2025-11-27 19:01             ` Ard Biesheuvel
  2025-11-28 10:36               ` Ryan Roberts
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 19:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ryan Roberts, Ard Biesheuvel, linux-hardening, linux-arm-kernel,
	linux-kernel, Kees Cook, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 17:58, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
> > On 27/11/2025 15:03, Ard Biesheuvel wrote:
> > > So the question is really whether we want to dedicate 16 bytes per
> > > task for this. I wouldn't mind personally, but it is something our
> > > internal QA engineers tend to obsess over.
> >
> > Yeah that's a good point.
>
> I think it's a fair point that some people will obsesses over this, but
> I think the concern is misplaced.
>
> I know that people were very happy for the kernel FPSIMD context to
> disappear from task_struct, but 16 bytes is a fair amount smaller, and
> I'm pretty sure we can offset that with a small/moderate amount of work.
>
> AFAICT there are extant holes in task_struct that could easily account
> for 16 bytes. I can also see a few ways to rework arm64's thread_info
> and thread_struct (which are both embedded within task_struct) to save
> some space.
>

Oh, I completely agree. But it is going to come up one way or the other.

> > Is this something we could potentially keep at the start of the
> > kstack? Is there any precident for keeping state there at the moment?
> > For arm64, I know there is a general feeling that 16K for the stack
> > more than enough (but we are stuck with it because 8K isn't quite
> > enough). So it would be "for free". I guess it would be tricky to do
> > this in an arch-agnostic way though...
>
> We went out of our way to stop playing silly games like that when we
> moved thread_info into task_struct; please let's not bring that back.
>

Agreed. (after just having moved the kernel mode FP/SIMD buffer from
task_struct to the kernel mode stack :-))


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 15:56         ` Ryan Roberts
  2025-11-27 16:58           ` Mark Rutland
@ 2025-11-28 10:07           ` Ard Biesheuvel
  2025-11-28 10:32             ` Ryan Roberts
  1 sibling, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-28 10:07 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Kees Cook,
	Will Deacon, Arnd Bergmann, Jeremy Linton, Catalin Marinas,
	Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> >>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>>>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>>>
> >>>>> Ryan reports that get_random_u16() is dominant in the performance
> >>>>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>>>
> >>>>> This is the reason many architectures rely on a counter instead, and
> >>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>>>> is gathered and recorded in a per-CPU variable.
> >>>>>
> >>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>>>> making the logic at syscall exit redundant.
> >>>>
> >>>> I ran the same set of syscall benchmarks for this series as I've done for my
> >>>> series.
> >>>>
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >>>> performance cost of turning it on without any changes to the implementation,
> >>>> then the reduced performance cost of turning it on with my changes applied, and
> >>>> finally cost of turning it on with Ard's changes applied:
> >>>>
> >>>> arm64 (AWS Graviton3):
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> >>>> |                 |              | rndstack-on |               |                 |
> >>>> +=================+==============+=============+===============+=================+
> >>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> >>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> >>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> >>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> >>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> >>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> >>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> >>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>
> >>>
> >>> What does the (R) mean?
> >>>
> >>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >>>> that fully explains it though.
> >>>>
> >>>> But it's still a 10% cost on average.
> >>>>
> >>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>>>
> >>>
> >>> Interesting!
> >>>
> >>> So the only thing that get_random_u8() does that could explain the
> >>> delta is calling into the scheduler on preempt_enable(), given that it
> >>> does very little beyond that.
> >>>
> >>> Would you mind repeating this experiment after changing the
> >>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
> >>
> >> This has no impact on performance.
> >>
> >
> > Thanks. But this is really rather surprising: what else could be
> > taking up that time, given that on the fast path, there are only some
> > loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> > the latter that is causing so much latency? I suppose the local
> > cmpxchg() semantics don't really exist on arm64, and this uses the
> > exact same LSE instruction that would be used for an ordinary
> > cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
> >
> > In any case, there is no debate that your code is faster on arm64.
>
> The results I have for x86 show it's faster than the rdtsc too, although that's
> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
>

OK, brown paper bag time ...

I swapped the order of the 'old' and 'new' cmpxchg64_local()
arguments, resulting in some very odd behavior. I think this explains
why the tail latency was eliminated entirely, which is bizarre.

The speedup is also more modest now (~2x), which may still be
worthwhile, but likely insufficient for the kstack randomization case.

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-28 10:07           ` Ard Biesheuvel
@ 2025-11-28 10:32             ` Ryan Roberts
  2025-11-28 10:36               ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Ryan Roberts @ 2025-11-28 10:32 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Kees Cook,
	Will Deacon, Arnd Bergmann, Jeremy Linton, Catalin Marinas,
	Mark Rutland, Jason A. Donenfeld

On 28/11/2025 10:07, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
>>> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
>>>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>>>>>> From: Ard Biesheuvel <ardb@kernel.org>
>>>>>>>
>>>>>>> Ryan reports that get_random_u16() is dominant in the performance
>>>>>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>>>>>
>>>>>>> This is the reason many architectures rely on a counter instead, and
>>>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>>>>>> is gathered and recorded in a per-CPU variable.
>>>>>>>
>>>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>>>>>> making the logic at syscall exit redundant.
>>>>>>
>>>>>> I ran the same set of syscall benchmarks for this series as I've done for my
>>>>>> series.
>>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>>>>>> performance cost of turning it on without any changes to the implementation,
>>>>>> then the reduced performance cost of turning it on with my changes applied, and
>>>>>> finally cost of turning it on with Ard's changes applied:
>>>>>>
>>>>>> arm64 (AWS Graviton3):
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>>>>>> |                 |              | rndstack-on |               |                 |
>>>>>> +=================+==============+=============+===============+=================+
>>>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>>>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>>>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>>>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>>>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>>>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>>>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>>
>>>>>
>>>>> What does the (R) mean?
>>>>>
>>>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>>>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>>>>>> that fully explains it though.
>>>>>>
>>>>>> But it's still a 10% cost on average.
>>>>>>
>>>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>>>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>>>>>
>>>>>
>>>>> Interesting!
>>>>>
>>>>> So the only thing that get_random_u8() does that could explain the
>>>>> delta is calling into the scheduler on preempt_enable(), given that it
>>>>> does very little beyond that.
>>>>>
>>>>> Would you mind repeating this experiment after changing the
>>>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>>>>
>>>> This has no impact on performance.
>>>>
>>>
>>> Thanks. But this is really rather surprising: what else could be
>>> taking up that time, given that on the fast path, there are only some
>>> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
>>> the latter that is causing so much latency? I suppose the local
>>> cmpxchg() semantics don't really exist on arm64, and this uses the
>>> exact same LSE instruction that would be used for an ordinary
>>> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
>>>
>>> In any case, there is no debate that your code is faster on arm64.
>>
>> The results I have for x86 show it's faster than the rdtsc too, although that's
>> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
>>
> 
> OK, brown paper bag time ...
> 
> I swapped the order of the 'old' and 'new' cmpxchg64_local()
> arguments, resulting in some very odd behavior. I think this explains
> why the tail latency was eliminated entirely, which is bizarre.
> 
> The speedup is also more modest now (~2x), which may still be
> worthwhile, but likely insufficient for the kstack randomization case.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2

Ahh, so we were never taking the slow path. That would definitely explain it.

I had a go at running this on x86 but couldn't even get the kernel to boot on my
AWS Sapphire Rapids instance. Unfortunately I don't have access to the serial
console so can't tell why it failed. But I used the exact same procedure and
baseline for other runs so the only difference is your change.

I wonder if this issue somehow breaks the boot on that platform?

Anyway, I'll chuck this update at the benchmarks, but probably won't be until
next week...



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-27 19:01             ` Ard Biesheuvel
@ 2025-11-28 10:36               ` Ryan Roberts
  2025-11-28 11:44                 ` Mark Rutland
  0 siblings, 1 reply; 28+ messages in thread
From: Ryan Roberts @ 2025-11-28 10:36 UTC (permalink / raw)
  To: Ard Biesheuvel, Mark Rutland
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Jason A. Donenfeld

On 27/11/2025 19:01, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 17:58, Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
>>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
>>>> So the question is really whether we want to dedicate 16 bytes per
>>>> task for this. I wouldn't mind personally, but it is something our
>>>> internal QA engineers tend to obsess over.
>>>
>>> Yeah that's a good point.
>>
>> I think it's a fair point that some people will obsesses over this, but
>> I think the concern is misplaced.
>>
>> I know that people were very happy for the kernel FPSIMD context to
>> disappear from task_struct, but 16 bytes is a fair amount smaller, and
>> I'm pretty sure we can offset that with a small/moderate amount of work.
>>
>> AFAICT there are extant holes in task_struct that could easily account
>> for 16 bytes. I can also see a few ways to rework arm64's thread_info
>> and thread_struct (which are both embedded within task_struct) to save
>> some space.
>>
> 
> Oh, I completely agree. But it is going to come up one way or the other.

I'm always terrified of changing the layout of those god structs for fear of
accidentally breaking some cacheline clustering-based micro optimization.
Putting new variables into existing holes is one thing, but rearranging existing
data scares me - perhaps I'm being too cautious. I assumed there wouldn't be an
existing hole big enough for 16 bytes.

> 
>>> Is this something we could potentially keep at the start of the
>>> kstack? Is there any precident for keeping state there at the moment?
>>> For arm64, I know there is a general feeling that 16K for the stack
>>> more than enough (but we are stuck with it because 8K isn't quite
>>> enough). So it would be "for free". I guess it would be tricky to do
>>> this in an arch-agnostic way though...
>>
>> We went out of our way to stop playing silly games like that when we
>> moved thread_info into task_struct; please let's not bring that back.
>>

OK fair enough.

> 
> Agreed. (after just having moved the kernel mode FP/SIMD buffer from
> task_struct to the kernel mode stack :-))



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-28 10:32             ` Ryan Roberts
@ 2025-11-28 10:36               ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-28 10:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Kees Cook,
	Will Deacon, Arnd Bergmann, Jeremy Linton, Catalin Marinas,
	Mark Rutland, Jason A. Donenfeld

On Fri, 28 Nov 2025 at 11:32, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2025 10:07, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> >>> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
> >>>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
> >>>>>>> From: Ard Biesheuvel <ardb@kernel.org>
> >>>>>>>
> >>>>>>> Ryan reports that get_random_u16() is dominant in the performance
> >>>>>>> profiling of syscall entry when kstack randomization is enabled [0].
> >>>>>>>
> >>>>>>> This is the reason many architectures rely on a counter instead, and
> >>>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
> >>>>>>> is gathered and recorded in a per-CPU variable.
> >>>>>>>
> >>>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
> >>>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
> >>>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
> >>>>>>> making the logic at syscall exit redundant.
> >>>>>>
> >>>>>> I ran the same set of syscall benchmarks for this series as I've done for my
> >>>>>> series.
> >>>>>>
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>>
> >>>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
> >>>>>> performance cost of turning it on without any changes to the implementation,
> >>>>>> then the reduced performance cost of turning it on with my changes applied, and
> >>>>>> finally cost of turning it on with Ard's changes applied:
> >>>>>>
> >>>>>> arm64 (AWS Graviton3):
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
> >>>>>> |                 |              | rndstack-on |               |                 |
> >>>>>> +=================+==============+=============+===============+=================+
> >>>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
> >>>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
> >>>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
> >>>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
> >>>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
> >>>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
> >>>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
> >>>>>> +-----------------+--------------+-------------+---------------+-----------------+
> >>>>>>
> >>>>>
> >>>>> What does the (R) mean?
> >>>>>
> >>>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
> >>>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
> >>>>>> that fully explains it though.
> >>>>>>
> >>>>>> But it's still a 10% cost on average.
> >>>>>>
> >>>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
> >>>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
> >>>>>>
> >>>>>
> >>>>> Interesting!
> >>>>>
> >>>>> So the only thing that get_random_u8() does that could explain the
> >>>>> delta is calling into the scheduler on preempt_enable(), given that it
> >>>>> does very little beyond that.
> >>>>>
> >>>>> Would you mind repeating this experiment after changing the
> >>>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
> >>>>
> >>>> This has no impact on performance.
> >>>>
> >>>
> >>> Thanks. But this is really rather surprising: what else could be
> >>> taking up that time, given that on the fast path, there are only some
> >>> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
> >>> the latter that is causing so much latency? I suppose the local
> >>> cmpxchg() semantics don't really exist on arm64, and this uses the
> >>> exact same LSE instruction that would be used for an ordinary
> >>> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
> >>>
> >>> In any case, there is no debate that your code is faster on arm64.
> >>
> >> The results I have for x86 show it's faster than the rdtsc too, although that's
> >> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
> >>
> >
> > OK, brown paper bag time ...
> >
> > I swapped the order of the 'old' and 'new' cmpxchg64_local()
> > arguments, resulting in some very odd behavior. I think this explains
> > why the tail latency was eliminated entirely, which is bizarre.
> >
> > The speedup is also more modest now (~2x), which may still be
> > worthwhile, but likely insufficient for the kstack randomization case.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2
>
> Ahh, so we were never taking the slow path. That would definitely explain it.
>
> I had a go at running this on x86 but couldn't even get the kernel to boot on my
> AWS Sapphire Rapids instance. Unfortunately I don't have access to the serial
> console so can't tell why it failed. But I used the exact same procedure and
> baseline for other runs so the only difference is your change.
>
> I wonder if this issue somehow breaks the boot on that platform?
>

It does. That is how I noticed myself :-)

init_cea_offsets() fails to make progress because
get_random_u32_below() returns the same value every time.
Interestingly, it didn't trigger with KASLR disabled, which made the
debugging sessions all the more fun ...

> Anyway, I'll chuck this update at the benchmarks, but probably won't be until
> next week...
>

Sure. thanks.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 5/6] random: Plug race in preceding patch
  2025-11-27  9:22 ` [RFC/RFT PATCH 5/6] random: Plug race in preceding patch Ard Biesheuvel
@ 2025-11-28 11:13   ` david laight
  2025-11-28 11:18     ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: david laight @ 2025-11-28 11:13 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-hardening, linux-arm-kernel, linux-kernel, Ard Biesheuvel,
	Kees Cook, Ryan Roberts, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 10:22:32 +0100
Ard Biesheuvel <ardb+git@google.com> wrote:

> From: Ard Biesheuvel <ardb@kernel.org>
> 
> The lockless get_random_uXX() reads the next value from the linear
> buffer and then overwrites it with a 0x0 value. This is racy, as the
> code might be re-entered by an interrupt handler, and so the store might
> redundantly wipe the location accessed by the interrupt context rather
> than the interrupted context.

Is overwriting the used value even useful?
If someone manages to read the 'last' value, then they can equally read
the 'next' one - which is likely to be just as useful.
Moreover the zeros tell anyone who has managed the access the buffer
which entry will be used next - without having to find 'batch->position'.

There is probably more to gain from putting the control data in a
completely different piece of memory from the buffer.

	David

> 
> To plug this race, wipe the preceding location when reading the next
> value from the linear buffer. Given that the position is always non-zero
> outside of the critical section, this is guaranteed to be safe, and
> ensures that the produced values are always wiped from the buffer.
> 
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  drivers/char/random.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 71bd74871540..e8ba460c5c9c 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -547,6 +547,7 @@ type get_random_ ##type(void)							\
>  	next = (u64)next_gen << 32;						\
>  	if (likely(batch->position < ARRAY_SIZE(batch->entropy))) {		\
>  		next |=	batch->position + 1; /* next-1 is bogus otherwise */	\
> +		batch->entropy[batch->position - 1] = 0;			\
>  		ret = batch->entropy[batch->position];				\
>  	}									\
>  	if (cmpxchg64_local(&batch->posgen, next, next - 1) != next - 1) {	\



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 5/6] random: Plug race in preceding patch
  2025-11-28 11:13   ` david laight
@ 2025-11-28 11:18     ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-11-28 11:18 UTC (permalink / raw)
  To: david laight
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Ryan Roberts, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Fri, 28 Nov 2025 at 12:13, david laight <david.laight@runbox.com> wrote:
>
> On Thu, 27 Nov 2025 10:22:32 +0100
> Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > The lockless get_random_uXX() reads the next value from the linear
> > buffer and then overwrites it with a 0x0 value. This is racy, as the
> > code might be re-entered by an interrupt handler, and so the store might
> > redundantly wipe the location accessed by the interrupt context rather
> > than the interrupted context.
>
> Is overwriting the used value even useful?

That is an interesting question but it is orthogonal to this series.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack
  2025-11-28 10:36               ` Ryan Roberts
@ 2025-11-28 11:44                 ` Mark Rutland
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Rutland @ 2025-11-28 11:44 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Ard Biesheuvel, Ard Biesheuvel, linux-hardening, linux-arm-kernel,
	linux-kernel, Kees Cook, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Jason A. Donenfeld

On Fri, Nov 28, 2025 at 10:36:13AM +0000, Ryan Roberts wrote:
> On 27/11/2025 19:01, Ard Biesheuvel wrote:
> > On Thu, 27 Nov 2025 at 17:58, Mark Rutland <mark.rutland@arm.com> wrote:
> >>
> >> On Thu, Nov 27, 2025 at 03:56:59PM +0000, Ryan Roberts wrote:
> >>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
> >>>> So the question is really whether we want to dedicate 16 bytes per
> >>>> task for this. I wouldn't mind personally, but it is something our
> >>>> internal QA engineers tend to obsess over.
> >>>
> >>> Yeah that's a good point.
> >>
> >> I think it's a fair point that some people will obsesses over this, but
> >> I think the concern is misplaced.
> >>
> >> I know that people were very happy for the kernel FPSIMD context to
> >> disappear from task_struct, but 16 bytes is a fair amount smaller, and
> >> I'm pretty sure we can offset that with a small/moderate amount of work.
> >>
> >> AFAICT there are extant holes in task_struct that could easily account
> >> for 16 bytes. I can also see a few ways to rework arm64's thread_info
> >> and thread_struct (which are both embedded within task_struct) to save
> >> some space.
> > 
> > Oh, I completely agree. But it is going to come up one way or the other.
> 
> I'm always terrified of changing the layout of those god structs for fear of
> accidentally breaking some cacheline clustering-based micro optimization.
> Putting new variables into existing holes is one thing, but rearranging existing
> data scares me - perhaps I'm being too cautious. I assumed there wouldn't be an
> existing hole big enough for 16 bytes.

FWIW, ignoring holes, the trailing padding is pretty big. In v6.18-rc1
defconfig task_struct appears to have ~40 bytes of padding due to
64-byte alignment. So (in that configuration) adding 16 bytes doesn't
actually increase the size of the structure.

I have a few specific changes in mind which could ammortize 16 bytes, so
even if this turns out to be an issue, we can make good.

For example, I'd like to change arm64's FPSIMD/SVE/SME context switch
to remove the opportunistic reuse of context after A->B->A migration.
That would remove the need for 'fpsimd_cpu' and 'kernel_fpsimd_cpu' in
thread struct (which is embedded within task struct, at the end), saving
8 bytes.

If we change the way we encod the 'vl' and 'vl_onexec' array elements,
we can shrink those from unsigned int down to u8, which would sve 12
bytes.

Mark.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC/RFT PATCH 2/6] arc: Wire up cmpxchg64_local() to generic implementation
  2025-11-27 15:06   ` Joey Gouly
@ 2025-12-03 16:59     ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2025-12-03 16:59 UTC (permalink / raw)
  To: Joey Gouly
  Cc: Ard Biesheuvel, linux-hardening, linux-arm-kernel, linux-kernel,
	Kees Cook, Ryan Roberts, Will Deacon, Arnd Bergmann,
	Jeremy Linton, Catalin Marinas, Mark Rutland, Jason A. Donenfeld

On Thu, 27 Nov 2025 at 16:06, Joey Gouly <joey.gouly@arm.com> wrote:
>
> On Thu, Nov 27, 2025 at 10:22:29AM +0100, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Provide cmpxchg64_local() for hexagon so we can start using it in
> > generic code.
>
> nit: this is for arc, not hexagon.
>

Indeed, thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-12-03 16:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-27  9:22 [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ard Biesheuvel
2025-11-27  9:22 ` [RFC/RFT PATCH 1/6] hexagon: Wire up cmpxchg64_local() to generic implementation Ard Biesheuvel
2025-11-27  9:22 ` [RFC/RFT PATCH 2/6] arc: " Ard Biesheuvel
2025-11-27 15:06   ` Joey Gouly
2025-12-03 16:59     ` Ard Biesheuvel
2025-11-27  9:22 ` [RFC/RFT PATCH 3/6] random: Use u32 to keep track of batched entropy generation Ard Biesheuvel
2025-11-27 10:11   ` david laight
2025-11-27 10:15     ` Ard Biesheuvel
2025-11-27  9:22 ` [RFC/RFT PATCH 4/6] random: Use a lockless fast path for get_random_uXX() Ard Biesheuvel
2025-11-27 10:32   ` Ard Biesheuvel
2025-11-27  9:22 ` [RFC/RFT PATCH 5/6] random: Plug race in preceding patch Ard Biesheuvel
2025-11-28 11:13   ` david laight
2025-11-28 11:18     ` Ard Biesheuvel
2025-11-27  9:22 ` [RFC/RFT PATCH 6/6] randomize_kstack: Use get_random_u8() at entry for entropy Ard Biesheuvel
2025-11-27 12:12 ` [RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack Ryan Roberts
2025-11-27 12:28   ` Ard Biesheuvel
2025-11-27 13:08     ` Ryan Roberts
2025-11-27 14:18     ` Ryan Roberts
2025-11-27 15:03       ` Ard Biesheuvel
2025-11-27 15:40         ` Ard Biesheuvel
2025-11-27 15:56         ` Ryan Roberts
2025-11-27 16:58           ` Mark Rutland
2025-11-27 19:01             ` Ard Biesheuvel
2025-11-28 10:36               ` Ryan Roberts
2025-11-28 11:44                 ` Mark Rutland
2025-11-28 10:07           ` Ard Biesheuvel
2025-11-28 10:32             ` Ryan Roberts
2025-11-28 10:36               ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).