Re: [DISCUSSION] kstack offset randomization: bugs and performance

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
       [not found] <66c4e2a0-c7fb-46c2-acce-8a040a71cd8e@arm.com>
@ 2025-11-17 11:31 ` Ryan Roberts
  2025-11-17 16:47   ` Arnd Bergmann
                     ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-11-17 11:31 UTC (permalink / raw)
  To: Kees Cook, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas, Mark Rutland
  Cc: linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

Sorry; forgot to add Mark and the lists!


On 17/11/2025 11:30, Ryan Roberts wrote:
> Hi All,
> 
> Over the last few years we had a few complaints that syscall performance on
> arm64 is slower than x86. Most recently, it was observed that a certain Java
> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> ideas about how performance could be improved.
> 
> But I'll get to the performance angle in a bit. First, I want to discuss some
> bugs that I believe I have uncovered during code review...
> 
> 
> Bug 1: We have the following pattern:
> 
> add_random_kstack_offset()
>   enable_interrupts_and_preemption()
>     do_syscall()
>   disable_interrupts_and_preemption()
> choose_random_kstack_offset(random)
> 
> Where add_random_kstack_offset() adds an offset to the stack that was chosen by
> a previous call to choose_random_kstack_offset() and stored in a per-cpu
> variable. But since preemption is enabled during the syscall, surely an attacker
> could defeat this by arranging for the thread to be preempted or migrated while
> executing the syscall? That way the new offset is calculated for a different CPU
> and a subsequent syscall on the original CPU will use the original offset?
> 
> I think we could just pass the random seed to add_random_kstack_offset() so that
> we consume the old and buffer the new atomically? We would still buffer it
> across syscalls to avoid the guessability issue that's documented. Then
> choose_random_kstack_offset() could be removed. Or we could store
> per-task_struct given it is only 32-bits?
> 
> 
> Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
> document their requirement to be called with interrupts and preemption disabled.
> They use raw_cpu_*(), which require this. But on arm64, they are called from
> invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
> don't think this will cause functional harm for arm64's implementations of
> raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
> being referred to. Perhaps there is a way for user code to exploit this to
> defeat the purpose of the feature.
> 
> This should be straightforward to fix; if we take the task_struct approach for
> bug 1, then that would also fix this issue too because the requirement to be in
> atomic context goes away. Otherwsise it can be moved earlier in the callchain,
> before interrupts are enabled.
> 
> 
> Then we get to the performance aspect...
> 
> arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
> originally came from the crng. get_random_u16() does
> local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
> fastpath and the slow path). It turns out that this locking/unlocking accounts
> for 30%-50% of the total cost of kstack offset randomization. By introducing a
> new raw_try_get_random_uX() helper that's called from a context where irqs are
> disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
> costing so much).
> 
> Furthermore, given we are actually only using 6 bits of entropy per syscall, we
> could instead just request a u8 instead of a u16 and only throw away 2 bits
> instead of 10 bits. This means we drain the entropy buffer half as quickly and
> make half as many slow calls into the crng:
> 
> +-----------+---------------+--------------+-----------------+-----------------+
> | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
> |           |    (baseline) |              |                 |                 |
> +===========+===============+==============+=================+=================+
> | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
> +-----------+---------------+--------------+-----------------+-----------------+
> | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
> +-----------+---------------+--------------+-----------------+-----------------+
> | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
> +-----------+---------------+--------------+-----------------+-----------------+
> 
> I expect we could even choose to re-buffer and save those 2 bits so we call the
> slow path even less often.
> 
> I believe this helps the mean latency significantly without sacrificing any
> strength. But it doesn't reduce the tail latency because we still have to call
> into the crng eventually.
> 
> So here's another idea: Could we use siphash to generate some random bits? We
> would generate the secret key at boot using the crng. Then generate a 64 bit
> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> new hash). As long as the key remains secret, the hash is unpredictable.
> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> that would last for 10 syscalls at 6 bits per call. So we would still have to
> call siphash every 10 syscalls, so there would still be a tail, but from my
> experiements, it's much less than the crng:
> 
> +-----------+---------------+--------------+-----------------+-----------------+
> | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
> |           |    (baseline) |              |                 |                 |
> +===========+===============+==============+=================+=================+
> | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
> +-----------+---------------+--------------+-----------------+-----------------+
> | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
> +-----------+---------------+--------------+-----------------+-----------------+
> | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
> +-----------+---------------+--------------+-----------------+-----------------+
> 
> Could this give us a middle ground between strong-crng and
> weak-timestamp-counter? Perhaps the main issue is that we need to store the
> secret key for a long period?
> 
> 
> Anyway, I plan to work up a series with the bugfixes and performance
> improvements. I'll add the siphash approach as an experimental addition and get
> some more detailed numbers for all the options. But wanted to raise it all here
> first to get any early feedback.
> 
> 
> [1]
> https://lore.kernel.org/linux-arm-kernel/20240305221824.3300322-1-jeremy.linton@arm.com/
> 
> Thanks,
> Ryan
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 11:31 ` [DISCUSSION] kstack offset randomization: bugs and performance Ryan Roberts
@ 2025-11-17 16:47   ` Arnd Bergmann
  2025-11-17 17:23     ` Ryan Roberts
                       ` (2 more replies)
  2025-11-17 20:27   ` Kees Cook
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 27+ messages in thread
From: Arnd Bergmann @ 2025-11-17 16:47 UTC (permalink / raw)
  To: Ryan Roberts, Kees Cook, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas, Mark Rutland
  Cc: linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List,
	Jason A. Donenfeld

On Mon, Nov 17, 2025, at 12:31, Ryan Roberts wrote:
> On 17/11/2025 11:30, Ryan Roberts wrote:
>> Hi All,
>> 
>> Over the last few years we had a few complaints that syscall performance on
>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>> ideas about how performance could be improved.

>> I believe this helps the mean latency significantly without sacrificing any
>> strength. But it doesn't reduce the tail latency because we still have to call
>> into the crng eventually.
>> 
>> So here's another idea: Could we use siphash to generate some random bits? We
>> would generate the secret key at boot using the crng. Then generate a 64 bit
>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>> new hash). As long as the key remains secret, the hash is unpredictable.
>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>> call siphash every 10 syscalls, so there would still be a tail, but from my
>> experiements, it's much less than the crng:

IIRC, Jason argued against creating another type of prng inside of the
kernel for a special purpose. 

As I understand, the other architectures already just use the cycle counter
because that is random enough, but for arm64 the cntvct runs on an
unspecified frequency that is often too low.

However, most future machines are ARMv9.1 or higher and require a 1GHz
timer frequency. I also checked Graviton-3 (Neoverse-V1, ARMv8.4), which
is running its timer at 1.05GHz.

My M2 Mac is running at a slower 24MHz timer. Between two getpid()
syscalls, I see cntvct_el0 advance between 20 and 70 cycles, which
still gives a few bits of entropy but not the six bits we actually
want to use.

How about we just check the timer frequency at boot and patch out the
get_random_u16 call for a cntvct read if it gets updated fast enough?
That would at least take care of the overhead on most new designs and
hopefully on a large subset of the servers that are in active use.

      Arnd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 16:47   ` Arnd Bergmann
@ 2025-11-17 17:23     ` Ryan Roberts
  2025-11-17 17:46     ` Mark Rutland
  2025-11-18 17:15     ` Jason A. Donenfeld
  2 siblings, 0 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-11-17 17:23 UTC (permalink / raw)
  To: Arnd Bergmann, Kees Cook, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas, Mark Rutland
  Cc: linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List,
	Jason A. Donenfeld

On 17/11/2025 16:47, Arnd Bergmann wrote:
> On Mon, Nov 17, 2025, at 12:31, Ryan Roberts wrote:
>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Over the last few years we had a few complaints that syscall performance on
>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>>> ideas about how performance could be improved.
> 
> 
>>> I believe this helps the mean latency significantly without sacrificing any
>>> strength. But it doesn't reduce the tail latency because we still have to call
>>> into the crng eventually.
>>>
>>> So here's another idea: Could we use siphash to generate some random bits? We
>>> would generate the secret key at boot using the crng. Then generate a 64 bit
>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>>> new hash). As long as the key remains secret, the hash is unpredictable.
>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>>> call siphash every 10 syscalls, so there would still be a tail, but from my
>>> experiements, it's much less than the crng:
> 
> IIRC, Jason argued against creating another type of prng inside of the
> kernel for a special purpose. 

siphash is already supported by the kernel; siphash_1u64() and friends. So I
think you could argue this is just creative use of an existing crypto primitive? ;-)

> 
> As I understand, the other architectures already just use the cycle counter
> because that is random enough, but for arm64 the cntvct runs on an
> unspecified frequency that is often too low.
> 
> However, most future machines are ARMv9.1 or higher and require a 1GHz
> timer frequency. I also checked Graviton-3 (Neoverse-V1, ARMv8.4), which
> is running its timer at 1.05GHz.
> 
> My M2 Mac is running at a slower 24MHz timer. Between two getpid()
> syscalls, I see cntvct_el0 advance between 20 and 70 cycles, which
> still gives a few bits of entropy but not the six bits we actually
> want to use.
> 
> How about we just check the timer frequency at boot and patch out the
> get_random_u16 call for a cntvct read if it gets updated fast enough?
> That would at least take care of the overhead on most new designs and
> hopefully on a large subset of the servers that are in active use.

That certainly sounds simple and reasonable to me (as a non-security guy). My
earlier optimizations would still help the non-conformant systems with mean latency.

We would then end up where arm64 can be just as performant as the other arches
for the same level of security. If that level of security is deemed
insufficient, then a new option to use get_random_u16() (or siphash) could be
introduced, which would be arch-independent.

Thanks,
Ryan

> 
>       Arnd



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 16:47   ` Arnd Bergmann
  2025-11-17 17:23     ` Ryan Roberts
@ 2025-11-17 17:46     ` Mark Rutland
  2025-11-17 23:04       ` Arnd Bergmann
  2025-11-18 17:15     ` Jason A. Donenfeld
  2 siblings, 1 reply; 27+ messages in thread
From: Mark Rutland @ 2025-11-17 17:46 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ryan Roberts, Kees Cook, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List,
	Jason A. Donenfeld

On Mon, Nov 17, 2025 at 05:47:05PM +0100, Arnd Bergmann wrote:
> As I understand, the other architectures already just use the cycle counter
> because that is random enough, but for arm64 the cntvct runs on an
> unspecified frequency that is often too low.
> 
> However, most future machines are ARMv9.1 or higher and require a 1GHz
> timer frequency. I also checked Graviton-3 (Neoverse-V1, ARMv8.4), which
> is running its timer at 1.05GHz.

Note that 1GHz requirement is for the *effective frequency*, not the
underlying counter resolution. The 1GHz requirement means that the
counter must increment by 10^9 per second, but it doesn't mean that it
actually increments by 1 every 1 ns.

See ARM DDI 0487 L.b, page D12-6793, which says:

| Counter resolution
|
| The counter resolution is a representation of how frequently the
| counter is updated.
|
| For example, a counter might have an effective frequency of 1GHz, but
| the actual clock runs at 125MHz and therefore the counter resolution
| is 125Mhz.
|
| From Armv8.6, Arm recommends the counter resolution is not less than
| 50MHz in normal running operation.

... and note that (unfortunately) that latter point is a recommendation,
not a requirement.

> My M2 Mac is running at a slower 24MHz timer. Between two getpid()
> syscalls, I see cntvct_el0 advance between 20 and 70 cycles, which
> still gives a few bits of entropy but not the six bits we actually
> want to use.
> 
> How about we just check the timer frequency at boot and patch out the
> get_random_u16 call for a cntvct read if it gets updated fast enough?
> That would at least take care of the overhead on most new designs and
> hopefully on a large subset of the servers that are in active use.

As above, we cannot rely on the frequency for this, and I don't think we
should use the timer in this way.

To be clear, my objection here is purely about the timer. I don't want
us to rely upon something that doesn't actually provide the guarantee we
need. I'm more than happy with the mechanism for randomization being
changed.

Mark.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 11:31 ` [DISCUSSION] kstack offset randomization: bugs and performance Ryan Roberts
  2025-11-17 16:47   ` Arnd Bergmann
@ 2025-11-17 20:27   ` Kees Cook
  2025-11-18 10:28     ` Ryan Roberts
  2025-11-18 11:05     ` Mark Rutland
  2025-11-17 20:56   ` Jeremy Linton
  2025-11-24 14:36   ` Will Deacon
  3 siblings, 2 replies; 27+ messages in thread
From: Kees Cook @ 2025-11-17 20:27 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Arnd Bergmann, Ard Biesheuvel, Jeremy Linton, Will Deacon,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> Sorry; forgot to add Mark and the lists!
> 
> 
> On 17/11/2025 11:30, Ryan Roberts wrote:
> > Hi All,
> > 
> > Over the last few years we had a few complaints that syscall performance on
> > arm64 is slower than x86. Most recently, it was observed that a certain Java
> > benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> > get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> > ideas about how performance could be improved.
> > 
> > But I'll get to the performance angle in a bit. First, I want to discuss some
> > bugs that I believe I have uncovered during code review...
> > 
> > 
> > Bug 1: We have the following pattern:
> > 
> > add_random_kstack_offset()
> >   enable_interrupts_and_preemption()
> >     do_syscall()
> >   disable_interrupts_and_preemption()
> > choose_random_kstack_offset(random)
> > 
> > Where add_random_kstack_offset() adds an offset to the stack that was chosen by
> > a previous call to choose_random_kstack_offset() and stored in a per-cpu
> > variable. But since preemption is enabled during the syscall, surely an attacker
> > could defeat this by arranging for the thread to be preempted or migrated while
> > executing the syscall? That way the new offset is calculated for a different CPU
> > and a subsequent syscall on the original CPU will use the original offset?

Oh, interesting -- I hadn't considered that the entire thread would be
moved between CPUs while it is *IN* the syscall. Yeah, that would
effectively cause the offset to never change for the moved-from CPU. Ew.

> > I think we could just pass the random seed to add_random_kstack_offset() so that
> > we consume the old and buffer the new atomically? We would still buffer it
> > across syscalls to avoid the guessability issue that's documented. Then
> > choose_random_kstack_offset() could be removed. Or we could store
> > per-task_struct given it is only 32-bits?

I had wanted to avoid both growing task_struct and to avoid tying the
randomness to a given task -- then unpredictability could be higher
(barring the bug above), and could be only reduced to per-thread by
pinning a thread exclusively to a single CPU.

> > Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
> > document their requirement to be called with interrupts and preemption disabled.
> > They use raw_cpu_*(), which require this. But on arm64, they are called from
> > invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
> > don't think this will cause functional harm for arm64's implementations of
> > raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
> > being referred to. Perhaps there is a way for user code to exploit this to
> > defeat the purpose of the feature.
> > 
> > This should be straightforward to fix; if we take the task_struct approach for
> > bug 1, then that would also fix this issue too because the requirement to be in
> > atomic context goes away. Otherwsise it can be moved earlier in the callchain,
> > before interrupts are enabled.

I can't speak to these internals, just that I'd hope to avoid forcing
the randomness down to the thread-level.

> > 
> > 
> > Then we get to the performance aspect...
> > 
> > arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
> > originally came from the crng. get_random_u16() does
> > local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
> > fastpath and the slow path). It turns out that this locking/unlocking accounts
> > for 30%-50% of the total cost of kstack offset randomization. By introducing a
> > new raw_try_get_random_uX() helper that's called from a context where irqs are
> > disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
> > costing so much).
> > 
> > Furthermore, given we are actually only using 6 bits of entropy per syscall, we
> > could instead just request a u8 instead of a u16 and only throw away 2 bits
> > instead of 10 bits. This means we drain the entropy buffer half as quickly and
> > make half as many slow calls into the crng:
> > 
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
> > |           |    (baseline) |              |                 |                 |
> > +===========+===============+==============+=================+=================+
> > | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > 
> > I expect we could even choose to re-buffer and save those 2 bits so we call the
> > slow path even less often.
> > 
> > I believe this helps the mean latency significantly without sacrificing any
> > strength. But it doesn't reduce the tail latency because we still have to call
> > into the crng eventually.
> > 
> > So here's another idea: Could we use siphash to generate some random bits? We
> > would generate the secret key at boot using the crng. Then generate a 64 bit
> > siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> > new hash). As long as the key remains secret, the hash is unpredictable.
> > (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> > that would last for 10 syscalls at 6 bits per call. So we would still have to
> > call siphash every 10 syscalls, so there would still be a tail, but from my
> > experiements, it's much less than the crng:
> > 
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
> > |           |    (baseline) |              |                 |                 |
> > +===========+===============+==============+=================+=================+
> > | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > 
> > Could this give us a middle ground between strong-crng and
> > weak-timestamp-counter? Perhaps the main issue is that we need to store the
> > secret key for a long period?

All these ideas seem fine to me. I agree about only needed 6 bits, so a
u8 is good. Since we already use siphash extensively, that also seems
fine, though it'd be nice if we could find a solution that avoided
intermittent reseeding.

> > Anyway, I plan to work up a series with the bugfixes and performance
> > improvements. I'll add the siphash approach as an experimental addition and get
> > some more detailed numbers for all the options. But wanted to raise it all here
> > first to get any early feedback.

Thanks for tackling this!

-Kees

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 11:31 ` [DISCUSSION] kstack offset randomization: bugs and performance Ryan Roberts
  2025-11-17 16:47   ` Arnd Bergmann
  2025-11-17 20:27   ` Kees Cook
@ 2025-11-17 20:56   ` Jeremy Linton
  2025-11-18 11:05     ` Ryan Roberts
  2025-11-24 14:36   ` Will Deacon
  3 siblings, 1 reply; 27+ messages in thread
From: Jeremy Linton @ 2025-11-17 20:56 UTC (permalink / raw)
  To: Ryan Roberts, Kees Cook, Arnd Bergmann, Ard Biesheuvel,
	Will Deacon, Catalin Marinas, Mark Rutland
  Cc: linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

Hi,

On 11/17/25 5:31 AM, Ryan Roberts wrote:
> Sorry; forgot to add Mark and the lists!
> 
> 
> On 17/11/2025 11:30, Ryan Roberts wrote:
>> Hi All,
>>
>> Over the last few years we had a few complaints that syscall performance on
>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>> ideas about how performance could be improved.
>>
>> But I'll get to the performance angle in a bit. First, I want to discuss some
>> bugs that I believe I have uncovered during code review...
>>
>>
>> Bug 1: We have the following pattern:
>>
>> add_random_kstack_offset()
>>    enable_interrupts_and_preemption()
>>      do_syscall()
>>    disable_interrupts_and_preemption()
>> choose_random_kstack_offset(random)
>>
>> Where add_random_kstack_offset() adds an offset to the stack that was chosen by
>> a previous call to choose_random_kstack_offset() and stored in a per-cpu
>> variable. But since preemption is enabled during the syscall, surely an attacker
>> could defeat this by arranging for the thread to be preempted or migrated while
>> executing the syscall? That way the new offset is calculated for a different CPU
>> and a subsequent syscall on the original CPU will use the original offset?
>>
>> I think we could just pass the random seed to add_random_kstack_offset() so that
>> we consume the old and buffer the new atomically? We would still buffer it
>> across syscalls to avoid the guessability issue that's documented. Then
>> choose_random_kstack_offset() could be removed. Or we could store
>> per-task_struct given it is only 32-bits?
>>
>>
>> Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
>> document their requirement to be called with interrupts and preemption disabled.
>> They use raw_cpu_*(), which require this. But on arm64, they are called from
>> invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
>> don't think this will cause functional harm for arm64's implementations of
>> raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
>> being referred to. Perhaps there is a way for user code to exploit this to
>> defeat the purpose of the feature.
>>
>> This should be straightforward to fix; if we take the task_struct approach for
>> bug 1, then that would also fix this issue too because the requirement to be in
>> atomic context goes away. Otherwsise it can be moved earlier in the callchain,
>> before interrupts are enabled.
>>
>>
>> Then we get to the performance aspect...
>>
>> arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
>> originally came from the crng. get_random_u16() does
>> local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
>> fastpath and the slow path). It turns out that this locking/unlocking accounts
>> for 30%-50% of the total cost of kstack offset randomization. By introducing a
>> new raw_try_get_random_uX() helper that's called from a context where irqs are
>> disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
>> costing so much).

IIRC it was because there is another thread filling batch of random numbers.


>>
>> Furthermore, given we are actually only using 6 bits of entropy per syscall, we
>> could instead just request a u8 instead of a u16 and only throw away 2 bits
>> instead of 10 bits. This means we drain the entropy buffer half as quickly and
>> make half as many slow calls into the crng:
>>
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
>> |           |    (baseline) |              |                 |                 |
>> +===========+===============+==============+=================+=================+
>> | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>>
>> I expect we could even choose to re-buffer and save those 2 bits so we call the
>> slow path even less often.
>>
>> I believe this helps the mean latency significantly without sacrificing any
>> strength. But it doesn't reduce the tail latency because we still have to call
>> into the crng eventually.
>>
>> So here's another idea: Could we use siphash to generate some random bits? We
>> would generate the secret key at boot using the crng. Then generate a 64 bit
>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>> new hash). As long as the key remains secret, the hash is unpredictable.
>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>> call siphash every 10 syscalls, so there would still be a tail, but from my
>> experiements, it's much less than the crng:
>>
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
>> |           |    (baseline) |              |                 |                 |
>> +===========+===============+==============+=================+=================+
>> | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>> | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
>> +-----------+---------------+--------------+-----------------+-----------------+
>>
>> Could this give us a middle ground between strong-crng and
>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>> secret key for a long period?

This is sorta fundamental with the (C)PRNG's. There is an estimated 
cycle (2^112 it says for prandom) and the more state you keep hidden to 
increase that value slower the generation becomes.

If your not worried about tail latency, then just use prandom per the 
patch below and xor in some data from the get_random_* into the local 
state on some interval. Say every million calls. One might argue for a 
couple bits of a counter on some interval would be enough if the tsc is 
good enough for x86, but frankly any counter must be independent from 
the core cpu clocks to have any hope of those values being 
unpredictable. Or just reseed it only on a subset of syscalls known to 
have a lot of variance, or some other per-cpu work, that at least will 
hide when its being reseeded from someone measuring the latency variance.

I guess it depends on how strong you believe the existing system is. My 
perspective is that resseding it is probably a waste of time since its 
being quantized down to 2^5 bits anyway.

>>
>>
>> Anyway, I plan to work up a series with the bugfixes and performance
>> improvements. I'll add the siphash approach as an experimental addition and get
>> some more detailed numbers for all the options. But wanted to raise it all here
>> first to get any early feedback.
>>
>>
>> [1]
>> https://lore.kernel.org/linux-arm-kernel/20240305221824.3300322-1-jeremy.linton@arm.com/
>>
>> Thanks,
>> Ryan
>>
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 17:46     ` Mark Rutland
@ 2025-11-17 23:04       ` Arnd Bergmann
  0 siblings, 0 replies; 27+ messages in thread
From: Arnd Bergmann @ 2025-11-17 23:04 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Ryan Roberts, Kees Cook, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List,
	Jason A . Donenfeld

On Mon, Nov 17, 2025, at 18:46, Mark Rutland wrote:
> On Mon, Nov 17, 2025 at 05:47:05PM +0100, Arnd Bergmann wrote:
>> As I understand, the other architectures already just use the cycle counter
>> because that is random enough, but for arm64 the cntvct runs on an
>> unspecified frequency that is often too low.
>> 
>> However, most future machines are ARMv9.1 or higher and require a 1GHz
>> timer frequency. I also checked Graviton-3 (Neoverse-V1, ARMv8.4), which
>> is running its timer at 1.05GHz.
>
> Note that 1GHz requirement is for the *effective frequency*, not the
> underlying counter resolution. The 1GHz requirement means that the
> counter must increment by 10^9 per second, but it doesn't mean that it
> actually increments by 1 every 1 ns.
>
> See ARM DDI 0487 L.b, page D12-6793, which says:
>
> | Counter resolution
> |
> | The counter resolution is a representation of how frequently the
> | counter is updated.
> |
> | For example, a counter might have an effective frequency of 1GHz, but
> | the actual clock runs at 125MHz and therefore the counter resolution
> | is 125Mhz.
> |
> | From Armv8.6, Arm recommends the counter resolution is not less than
> | 50MHz in normal running operation.
>
> ... and note that (unfortunately) that latter point is a recommendation,
> not a requirement.

Right, I see. As far as I can tell, the actual resolution on s390,
x86 and powerpc is also lower than the effective frequency, but
it does indeed seem to be sufficiently higher than this so it's
usually above six bits of entropy there per syscall there.

     Arnd


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 20:27   ` Kees Cook
@ 2025-11-18 10:28     ` Ryan Roberts
  2025-11-18 11:25       ` Mark Rutland
  2025-11-18 11:05     ` Mark Rutland
  1 sibling, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-11-18 10:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Arnd Bergmann, Ard Biesheuvel, Jeremy Linton, Will Deacon,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On 17/11/2025 20:27, Kees Cook wrote:
> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>> Sorry; forgot to add Mark and the lists!
>>
>>
>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Over the last few years we had a few complaints that syscall performance on
>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>>> ideas about how performance could be improved.
>>>
>>> But I'll get to the performance angle in a bit. First, I want to discuss some
>>> bugs that I believe I have uncovered during code review...
>>>
>>>
>>> Bug 1: We have the following pattern:
>>>
>>> add_random_kstack_offset()
>>>   enable_interrupts_and_preemption()
>>>     do_syscall()
>>>   disable_interrupts_and_preemption()
>>> choose_random_kstack_offset(random)
>>>
>>> Where add_random_kstack_offset() adds an offset to the stack that was chosen by
>>> a previous call to choose_random_kstack_offset() and stored in a per-cpu
>>> variable. But since preemption is enabled during the syscall, surely an attacker
>>> could defeat this by arranging for the thread to be preempted or migrated while
>>> executing the syscall? That way the new offset is calculated for a different CPU
>>> and a subsequent syscall on the original CPU will use the original offset?
> 
> Oh, interesting -- I hadn't considered that the entire thread would be
> moved between CPUs while it is *IN* the syscall. Yeah, that would
> effectively cause the offset to never change for the moved-from CPU. Ew.
> 
>>> I think we could just pass the random seed to add_random_kstack_offset() so that
>>> we consume the old and buffer the new atomically? We would still buffer it
>>> across syscalls to avoid the guessability issue that's documented. Then
>>> choose_random_kstack_offset() could be removed. Or we could store
>>> per-task_struct given it is only 32-bits?
> 
> I had wanted to avoid both growing task_struct and to avoid tying the
> randomness to a given task -- then unpredictability could be higher
> (barring the bug above), and could be only reduced to per-thread by
> pinning a thread exclusively to a single CPU.

OK, so if we want to keep it per-cpu and close this gap, there are 2 options I
can think of; either get rid of choose_random_kstack_offset() and just pass the
random value to add_random_kstack_offset(), or have a hook in the context switch
path for the case where the thread gets preempted while in the syscall.

Personally I think the first option is sufficient and much simpler.

The original rationale for a separate choose_random_kstack_offset() at the end
of the syscall is described as:

 * This position in the syscall flow is done to
 * frustrate attacks from userspace attempting to learn the next offset:
 * - Maximize the timing uncertainty visible from userspace: if the
 *   offset is chosen at syscall entry, userspace has much more control
 *   over the timing between choosing offsets. "How long will we be in
 *   kernel mode?" tends to be more difficult to predict than "how long
 *   will we be in user mode?"
 * - Reduce the lifetime of the new offset sitting in memory during
 *   kernel mode execution. Exposure of "thread-local" memory content
 *   (e.g. current, percpu, etc) tends to be easier than arbitrary
 *   location memory exposure.

I'm not totally convinced by the first argument; for arches that use the tsc,
sampling the tsc at syscall entry would mean that userspace can figure out the
random value that will be used for syscall N by sampling the tsc and adding a
bit just before calling syscall N. Sampling the tsc at syscall exit would mean
that userspace can figure out the random value that will be used for syscall N
by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
don't really see any difference in protection?

If you're trying force the kernel-sampled tsc to be a specific value, then for
the sample-on-exit case, userspace can just make a syscall with an invalid id as
it's syscall N-1 and in that case the duration between entry and exit is tiny
and fixed so it's still pretty simple to force the value.

So what do you think of this approach? :

#define add_random_kstack_offset(rand) do {				\
	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
				&randomize_kstack_offset)) {		\
		u32 offset = raw_cpu_read(kstack_offset);		\
		u8 *ptr;						\
									\
		offset = ror32(offset, 5) ^ (rand);			\
		raw_cpu_write(kstack_offset, offset);			\
		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
		/* Keep allocation even after "ptr" loses scope. */	\
		asm volatile("" :: "r"(ptr) : "memory");		\
	}								\
} while (0)

This ignores "Maximize the timing uncertainty" (but that's ok because the
current version doesn't really do that either), but strengthens "Reduce the
lifetime of the new offset sitting in memory".

> 
>>> Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
>>> document their requirement to be called with interrupts and preemption disabled.
>>> They use raw_cpu_*(), which require this. But on arm64, they are called from
>>> invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
>>> don't think this will cause functional harm for arm64's implementations of
>>> raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
>>> being referred to. Perhaps there is a way for user code to exploit this to
>>> defeat the purpose of the feature.
>>>
>>> This should be straightforward to fix; if we take the task_struct approach for
>>> bug 1, then that would also fix this issue too because the requirement to be in
>>> atomic context goes away. Otherwsise it can be moved earlier in the callchain,
>>> before interrupts are enabled.
> 
> I can't speak to these internals, just that I'd hope to avoid forcing
> the randomness down to the thread-level.

ACK.

> 
>>>
>>>
>>> Then we get to the performance aspect...
>>>
>>> arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
>>> originally came from the crng. get_random_u16() does
>>> local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
>>> fastpath and the slow path). It turns out that this locking/unlocking accounts
>>> for 30%-50% of the total cost of kstack offset randomization. By introducing a
>>> new raw_try_get_random_uX() helper that's called from a context where irqs are
>>> disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
>>> costing so much).
>>>
>>> Furthermore, given we are actually only using 6 bits of entropy per syscall, we
>>> could instead just request a u8 instead of a u16 and only throw away 2 bits
>>> instead of 10 bits. This means we drain the entropy buffer half as quickly and
>>> make half as many slow calls into the crng:
>>>
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
>>> |           |    (baseline) |              |                 |                 |
>>> +===========+===============+==============+=================+=================+
>>> | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>>
>>> I expect we could even choose to re-buffer and save those 2 bits so we call the
>>> slow path even less often.
>>>
>>> I believe this helps the mean latency significantly without sacrificing any
>>> strength. But it doesn't reduce the tail latency because we still have to call
>>> into the crng eventually.
>>>
>>> So here's another idea: Could we use siphash to generate some random bits? We
>>> would generate the secret key at boot using the crng. Then generate a 64 bit
>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>>> new hash). As long as the key remains secret, the hash is unpredictable.
>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>>> call siphash every 10 syscalls, so there would still be a tail, but from my
>>> experiements, it's much less than the crng:
>>>
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
>>> |           |    (baseline) |              |                 |                 |
>>> +===========+===============+==============+=================+=================+
>>> | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>>
>>> Could this give us a middle ground between strong-crng and
>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>>> secret key for a long period?
> 
> All these ideas seem fine to me. I agree about only needed 6 bits, so a
> u8 is good. Since we already use siphash extensively, that also seems
> fine, though it'd be nice if we could find a solution that avoided
> intermittent reseeding.

ACK.

Thanks,
Ryan

> 
>>> Anyway, I plan to work up a series with the bugfixes and performance
>>> improvements. I'll add the siphash approach as an experimental addition and get
>>> some more detailed numbers for all the options. But wanted to raise it all here
>>> first to get any early feedback.
> 
> Thanks for tackling this!
> 
> -Kees
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 20:27   ` Kees Cook
  2025-11-18 10:28     ` Ryan Roberts
@ 2025-11-18 11:05     ` Mark Rutland
  1 sibling, 0 replies; 27+ messages in thread
From: Mark Rutland @ 2025-11-18 11:05 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ryan Roberts, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Mon, Nov 17, 2025 at 12:27:36PM -0800, Kees Cook wrote:
> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> > On 17/11/2025 11:30, Ryan Roberts wrote:
> > > I think we could just pass the random seed to add_random_kstack_offset() so that
> > > we consume the old and buffer the new atomically? We would still buffer it
> > > across syscalls to avoid the guessability issue that's documented. Then
> > > choose_random_kstack_offset() could be removed. Or we could store
> > > per-task_struct given it is only 32-bits?
> 
> I had wanted to avoid both growing task_struct and to avoid tying the
> randomness to a given task -- then unpredictability could be higher
> (barring the bug above), and could be only reduced to per-thread by
> pinning a thread exclusively to a single CPU.

I appreciate the rationale of not growing task struct, but I think the
"unpredictability could be higher" rationale doesn't hold any water.

If an adversary has userspace code execution, they can pin the thread to
a CPU. Regardless of whether they have that capability, in common cases
the thread isn't going to migrate for a number of syscalls if those are
non-blocking. The unpredictability gained by CPU migrations is slim to
none.

From a thread model perspective, we need this to be unpredictable
*regardless* of any pinning.

From a quick look at task struct, it appears that (on 64-bit platforms)
there are several opportunities for rearranging fields to pack things
together and save bytes. I think that if the state is sufficiently
small, we don't actually have to grow task struct. Practically speaking,
if this is small enough no-one will notice.

> > > Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
> > > document their requirement to be called with interrupts and preemption disabled.
> > > They use raw_cpu_*(), which require this. But on arm64, they are called from
> > > invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
> > > don't think this will cause functional harm for arm64's implementations of
> > > raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
> > > being referred to. Perhaps there is a way for user code to exploit this to
> > > defeat the purpose of the feature.
> > > 
> > > This should be straightforward to fix; if we take the task_struct approach for
> > > bug 1, then that would also fix this issue too because the requirement to be in
> > > atomic context goes away. Otherwsise it can be moved earlier in the callchain,
> > > before interrupts are enabled.
> 
> I can't speak to these internals, just that I'd hope to avoid forcing
> the randomness down to the thread-level.

From my perspective this would be *much* nicer as a per-thread property,
since it wouldn't have any problematic interactions with preemption, and
wouldn't force awkward requirements on architecture code.

i.e. I completely disagree -- I think this would be much better per
thread.

Using siphash per thread sounds fine to me.

Mark.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 20:56   ` Jeremy Linton
@ 2025-11-18 11:05     ` Ryan Roberts
  0 siblings, 0 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-11-18 11:05 UTC (permalink / raw)
  To: Jeremy Linton, Kees Cook, Arnd Bergmann, Ard Biesheuvel,
	Will Deacon, Catalin Marinas, Mark Rutland
  Cc: linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On 17/11/2025 20:56, Jeremy Linton wrote:
> Hi,
> 
> On 11/17/25 5:31 AM, Ryan Roberts wrote:
>> Sorry; forgot to add Mark and the lists!
>>
>>
>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Over the last few years we had a few complaints that syscall performance on
>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>>> ideas about how performance could be improved.
>>>
>>> But I'll get to the performance angle in a bit. First, I want to discuss some
>>> bugs that I believe I have uncovered during code review...
>>>
>>>
>>> Bug 1: We have the following pattern:
>>>
>>> add_random_kstack_offset()
>>>    enable_interrupts_and_preemption()
>>>      do_syscall()
>>>    disable_interrupts_and_preemption()
>>> choose_random_kstack_offset(random)
>>>
>>> Where add_random_kstack_offset() adds an offset to the stack that was chosen by
>>> a previous call to choose_random_kstack_offset() and stored in a per-cpu
>>> variable. But since preemption is enabled during the syscall, surely an attacker
>>> could defeat this by arranging for the thread to be preempted or migrated while
>>> executing the syscall? That way the new offset is calculated for a different CPU
>>> and a subsequent syscall on the original CPU will use the original offset?
>>>
>>> I think we could just pass the random seed to add_random_kstack_offset() so that
>>> we consume the old and buffer the new atomically? We would still buffer it
>>> across syscalls to avoid the guessability issue that's documented. Then
>>> choose_random_kstack_offset() could be removed. Or we could store
>>> per-task_struct given it is only 32-bits?
>>>
>>>
>>> Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
>>> document their requirement to be called with interrupts and preemption disabled.
>>> They use raw_cpu_*(), which require this. But on arm64, they are called from
>>> invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
>>> don't think this will cause functional harm for arm64's implementations of
>>> raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
>>> being referred to. Perhaps there is a way for user code to exploit this to
>>> defeat the purpose of the feature.
>>>
>>> This should be straightforward to fix; if we take the task_struct approach for
>>> bug 1, then that would also fix this issue too because the requirement to be in
>>> atomic context goes away. Otherwsise it can be moved earlier in the callchain,
>>> before interrupts are enabled.
>>>
>>>
>>> Then we get to the performance aspect...
>>>
>>> arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
>>> originally came from the crng. get_random_u16() does
>>> local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
>>> fastpath and the slow path). It turns out that this locking/unlocking accounts
>>> for 30%-50% of the total cost of kstack offset randomization. By introducing a
>>> new raw_try_get_random_uX() helper that's called from a context where irqs are
>>> disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
>>> costing so much).
> 
> IIRC it was because there is another thread filling batch of random numbers.

Err, why should that cause slow down? The lock is a local_lock, so it shouldn't
cause contention between 2 cpus. Conceptually each CPU has a different lock, right?

For non-RT:

#define local_lock_irqsave(lock, flags)				\
	__local_lock_irqsave(this_cpu_ptr(lock), flags)


#define __local_lock_irqsave(lock, flags)			\
	do {							\
		local_irq_save(flags);				\
		__local_lock_acquire(lock);			\
	} while (0)



local_irq_save() resolves to these 2 functions, which don't "look" expensive:

static __always_inline unsigned long __daif_local_save_flags(void)
{
	return read_sysreg(daif);
}
static __always_inline void __daif_local_irq_disable(void)
{
	barrier();
	asm volatile("msr daifset, #3");
	barrier();
}



__local_lock_acquire resolves to this where, the lock is a local_lock_t, so this
doesn't do anything:

#define __local_lock_acquire(lock)					\
	do {								\
		local_trylock_t *tl;					\
		local_lock_t *l;					\
									\
		l = (local_lock_t *)(lock);				\
		tl = (local_trylock_t *)l;				\
		_Generic((lock),					\
			local_trylock_t *: ({				\
				lockdep_assert(tl->acquired == 0);	\
				WRITE_ONCE(tl->acquired, 1);		\
			}),						\
			local_lock_t *: (void)0);			\
		local_lock_acquire(l);					\
	} while (0)

> 
> 
>>>
>>> Furthermore, given we are actually only using 6 bits of entropy per syscall, we
>>> could instead just request a u8 instead of a u16 and only throw away 2 bits
>>> instead of 10 bits. This means we drain the entropy buffer half as quickly and
>>> make half as many slow calls into the crng:
>>>
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
>>> |           |    (baseline) |              |                 |                 |
>>> +===========+===============+==============+=================+=================+
>>> | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>>
>>> I expect we could even choose to re-buffer and save those 2 bits so we call the
>>> slow path even less often.
>>>
>>> I believe this helps the mean latency significantly without sacrificing any
>>> strength. But it doesn't reduce the tail latency because we still have to call
>>> into the crng eventually.
>>>
>>> So here's another idea: Could we use siphash to generate some random bits? We
>>> would generate the secret key at boot using the crng. Then generate a 64 bit
>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>>> new hash). As long as the key remains secret, the hash is unpredictable.
>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>>> call siphash every 10 syscalls, so there would still be a tail, but from my
>>> experiements, it's much less than the crng:
>>>
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
>>> |           |    (baseline) |              |                 |                 |
>>> +===========+===============+==============+=================+=================+
>>> | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>> | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
>>> +-----------+---------------+--------------+-----------------+-----------------+
>>>
>>> Could this give us a middle ground between strong-crng and
>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>>> secret key for a long period?
> 
> This is sorta fundamental with the (C)PRNG's. There is an estimated cycle (2^112
> it says for prandom) and the more state you keep hidden to increase that value
> slower the generation becomes.
> 
> If your not worried about tail latency, then just use prandom per the patch
> below and xor in some data from the get_random_* into the local state on some
> interval. Say every million calls. One might argue for a couple bits of a
> counter on some interval would be enough if the tsc is good enough for x86, but
> frankly any counter must be independent from the core cpu clocks to have any
> hope of those values being unpredictable. Or just reseed it only on a subset of
> syscalls known to have a lot of variance, or some other per-cpu work, that at
> least will hide when its being reseeded from someone measuring the latency
> variance.

All sound reasonable to me. Lemme get some data for different options.

> 
> I guess it depends on how strong you believe the existing system is. My
> perspective is that resseding it is probably a waste of time since its being
> quantized down to 2^5 bits anyway.

nit: it's 6 bits so 2^6 = 64 values.

I tend to agree; If you have some stack attack where you need to know the
offset, I guess you can just select one offset and repeat the attack until it
suceeds, and on average it will succeed 1 in 64 times? It doesn't sound like we
should be paying such a performance price for that level of protection to me.

Thanksm,
Ryan

> 
>>>
>>>
>>> Anyway, I plan to work up a series with the bugfixes and performance
>>> improvements. I'll add the siphash approach as an experimental addition and get
>>> some more detailed numbers for all the options. But wanted to raise it all here
>>> first to get any early feedback.
>>>
>>>
>>> [1]
>>> https://lore.kernel.org/linux-arm-kernel/20240305221824.3300322-1-
>>> jeremy.linton@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-18 10:28     ` Ryan Roberts
@ 2025-11-18 11:25       ` Mark Rutland
  2025-11-18 12:16         ` Ryan Roberts
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Rutland @ 2025-11-18 11:25 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kees Cook, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Tue, Nov 18, 2025 at 10:28:29AM +0000, Ryan Roberts wrote:
> On 17/11/2025 20:27, Kees Cook wrote:
> > On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> >> On 17/11/2025 11:30, Ryan Roberts wrote:
> The original rationale for a separate choose_random_kstack_offset() at the end
> of the syscall is described as:
> 
>  * This position in the syscall flow is done to
>  * frustrate attacks from userspace attempting to learn the next offset:
>  * - Maximize the timing uncertainty visible from userspace: if the
>  *   offset is chosen at syscall entry, userspace has much more control
>  *   over the timing between choosing offsets. "How long will we be in
>  *   kernel mode?" tends to be more difficult to predict than "how long
>  *   will we be in user mode?"
>  * - Reduce the lifetime of the new offset sitting in memory during
>  *   kernel mode execution. Exposure of "thread-local" memory content
>  *   (e.g. current, percpu, etc) tends to be easier than arbitrary
>  *   location memory exposure.
> 
> I'm not totally convinced by the first argument; for arches that use the tsc,
> sampling the tsc at syscall entry would mean that userspace can figure out the
> random value that will be used for syscall N by sampling the tsc and adding a
> bit just before calling syscall N. Sampling the tsc at syscall exit would mean
> that userspace can figure out the random value that will be used for syscall N
> by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
> don't really see any difference in protection?
> 
> If you're trying force the kernel-sampled tsc to be a specific value, then for
> the sample-on-exit case, userspace can just make a syscall with an invalid id as
> it's syscall N-1 and in that case the duration between entry and exit is tiny
> and fixed so it's still pretty simple to force the value.

FWIW, I agree. I don't think we're gaining much based on the placement
of choose_random_kstack_offset() at the start/end of the entry/exit
sequences.

As an aside, it looks like x86 calls choose_random_kstack_offset() for
*any* return to userspace, including non-syscall returns (e.g. from
IRQ), in arch_exit_to_user_mode_prepare(). There's some additional
randomness/perturbation that'll cause, but logically it's not necessary
to do that for *all* returns to userspace.

> So what do you think of this approach? :
> 
> #define add_random_kstack_offset(rand) do {				\
> 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
> 				&randomize_kstack_offset)) {		\
> 		u32 offset = raw_cpu_read(kstack_offset);		\
> 		u8 *ptr;						\
> 									\
> 		offset = ror32(offset, 5) ^ (rand);			\
> 		raw_cpu_write(kstack_offset, offset);			\
> 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
> 		/* Keep allocation even after "ptr" loses scope. */	\
> 		asm volatile("" :: "r"(ptr) : "memory");		\
> 	}								\
> } while (0)
> 
> This ignores "Maximize the timing uncertainty" (but that's ok because the
> current version doesn't really do that either), but strengthens "Reduce the
> lifetime of the new offset sitting in memory".

Is this assuming that 'rand' can be generated in a non-preemptible
context? If so (and this is non-preemptible), that's fine.

I'm not sure whether that was the intent, or this was ignoring the
rescheduling problem.

If we do this per-task, then that concern disappears, and this can all
be preemptible.

Mark.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-18 11:25       ` Mark Rutland
@ 2025-11-18 12:16         ` Ryan Roberts
  0 siblings, 0 replies; 27+ messages in thread
From: Ryan Roberts @ 2025-11-18 12:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kees Cook, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On 18/11/2025 11:25, Mark Rutland wrote:
> On Tue, Nov 18, 2025 at 10:28:29AM +0000, Ryan Roberts wrote:
>> On 17/11/2025 20:27, Kees Cook wrote:
>>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>>>> On 17/11/2025 11:30, Ryan Roberts wrote:
>> The original rationale for a separate choose_random_kstack_offset() at the end
>> of the syscall is described as:
>>
>>  * This position in the syscall flow is done to
>>  * frustrate attacks from userspace attempting to learn the next offset:
>>  * - Maximize the timing uncertainty visible from userspace: if the
>>  *   offset is chosen at syscall entry, userspace has much more control
>>  *   over the timing between choosing offsets. "How long will we be in
>>  *   kernel mode?" tends to be more difficult to predict than "how long
>>  *   will we be in user mode?"
>>  * - Reduce the lifetime of the new offset sitting in memory during
>>  *   kernel mode execution. Exposure of "thread-local" memory content
>>  *   (e.g. current, percpu, etc) tends to be easier than arbitrary
>>  *   location memory exposure.
>>
>> I'm not totally convinced by the first argument; for arches that use the tsc,
>> sampling the tsc at syscall entry would mean that userspace can figure out the
>> random value that will be used for syscall N by sampling the tsc and adding a
>> bit just before calling syscall N. Sampling the tsc at syscall exit would mean
>> that userspace can figure out the random value that will be used for syscall N
>> by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
>> don't really see any difference in protection?
>>
>> If you're trying force the kernel-sampled tsc to be a specific value, then for
>> the sample-on-exit case, userspace can just make a syscall with an invalid id as
>> it's syscall N-1 and in that case the duration between entry and exit is tiny
>> and fixed so it's still pretty simple to force the value.
> 
> FWIW, I agree. I don't think we're gaining much based on the placement
> of choose_random_kstack_offset() at the start/end of the entry/exit
> sequences.
> 
> As an aside, it looks like x86 calls choose_random_kstack_offset() for
> *any* return to userspace, including non-syscall returns (e.g. from
> IRQ), in arch_exit_to_user_mode_prepare(). There's some additional
> randomness/perturbation that'll cause, but logically it's not necessary
> to do that for *all* returns to userspace.

(as is s390)

Hmm, that's interesting; that will defeat the attack where a task is migrated
away from the cpu mid-syscall, since any future return to user space will still
stir the per-cpu pot.

So getting rid of choose_random_kstack_offset() would likely reduce security for
x86 and s390 because we would only be sampling the tsc on entry to syscalls.

But similarly, I think changing to keeping the offset per-task has potential to
reduce the security here too, because each return to user space will only mix
the current per-task value.

Perhaps this is actually an argument to keep it per-cpu.

For arm64 performance, the interrupt timing provides a somewhat-random source
that an attacker can't control or guess. So could we take a similar route; fold
in the timer value at every return to user (or even every interrupt). Then use
that with Jeremy's per-cpu prng, which is seeded by the crng only once, at
cpu-online time.

I know you don't like relying on the timer, Mark, but sampling it per interrupt
feels a bit stronger to me?

> 
>> So what do you think of this approach? :
>>
>> #define add_random_kstack_offset(rand) do {				\
>> 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
>> 				&randomize_kstack_offset)) {		\
>> 		u32 offset = raw_cpu_read(kstack_offset);		\
>> 		u8 *ptr;						\
>> 									\
>> 		offset = ror32(offset, 5) ^ (rand);			\
>> 		raw_cpu_write(kstack_offset, offset);			\
>> 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
>> 		/* Keep allocation even after "ptr" loses scope. */	\
>> 		asm volatile("" :: "r"(ptr) : "memory");		\
>> 	}								\
>> } while (0)
>>
>> This ignores "Maximize the timing uncertainty" (but that's ok because the
>> current version doesn't really do that either), but strengthens "Reduce the
>> lifetime of the new offset sitting in memory".
> 
> Is this assuming that 'rand' can be generated in a non-preemptible
> context? If so (and this is non-preemptible), that's fine.

Yes this needs to be called with preemption disabled and yes it assumes that
rand can be generated in non-preemptible context - which is true for all arch's
rand sources today.

Although given the cost of get_random_u16() (or u8) for arm64, for the case
where it has to call into the crng, I was considering that we would first "try
get rand" before enabling preemption, and if that failed, enable preemption then
do the slow path. That would at least allow preemption for RT kernels (due to
get_random_u16()'s local_lock).

But given the above discussion about return-to-user, perhaps this is not the way
to go anyway. I suspect it makes sense to keep the entropy collection separate
from it's usage.

> 
> I'm not sure whether that was the intent, or this was ignoring the
> rescheduling problem.
> 
> If we do this per-task, then that concern disappears, and this can all
> be preemptible.
> 
> Mark.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 16:47   ` Arnd Bergmann
  2025-11-17 17:23     ` Ryan Roberts
  2025-11-17 17:46     ` Mark Rutland
@ 2025-11-18 17:15     ` Jason A. Donenfeld
  2025-11-18 17:21       ` Ryan Roberts
  2 siblings, 1 reply; 27+ messages in thread
From: Jason A. Donenfeld @ 2025-11-18 17:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ryan Roberts, Kees Cook, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Mon, Nov 17, 2025 at 05:47:05PM +0100, Arnd Bergmann wrote:
> On Mon, Nov 17, 2025, at 12:31, Ryan Roberts wrote:
> > On 17/11/2025 11:30, Ryan Roberts wrote:
> >> Hi All,
> >> 
> >> Over the last few years we had a few complaints that syscall performance on
> >> arm64 is slower than x86. Most recently, it was observed that a certain Java
> >> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> >> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> >> ideas about how performance could be improved.
> 
> 
> >> I believe this helps the mean latency significantly without sacrificing any
> >> strength. But it doesn't reduce the tail latency because we still have to call
> >> into the crng eventually.
> >> 
> >> So here's another idea: Could we use siphash to generate some random bits? We
> >> would generate the secret key at boot using the crng. Then generate a 64 bit
> >> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> >> new hash). As long as the key remains secret, the hash is unpredictable.
> >> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> >> that would last for 10 syscalls at 6 bits per call. So we would still have to
> >> call siphash every 10 syscalls, so there would still be a tail, but from my
> >> experiements, it's much less than the crng:
> 
> IIRC, Jason argued against creating another type of prng inside of the
> kernel for a special purpose. 

Yes indeed... I'm really not a fan of adding bespoke crypto willynilly
like that. Let's make get_random_u*() faster. If you're finding that the
issue with it is the locking, and that you're calling this from irq
context anyway, then your proposal (if I read this discussion correctly)
to add a raw_get_random_u*() seems like it could be sensible. Those
functions are generated via macro anyway, so it wouldn't be too much to
add the raw overloads. Feel free to send a patch to my random.git tree
if you'd like to give that a try.

Jason


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-18 17:15     ` Jason A. Donenfeld
@ 2025-11-18 17:21       ` Ryan Roberts
  2025-11-18 17:28         ` Jason A. Donenfeld
  0 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-11-18 17:21 UTC (permalink / raw)
  To: Jason A. Donenfeld, Arnd Bergmann
  Cc: Kees Cook, Ard Biesheuvel, Jeremy Linton, Will Deacon,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On 18/11/2025 17:15, Jason A. Donenfeld wrote:
> On Mon, Nov 17, 2025 at 05:47:05PM +0100, Arnd Bergmann wrote:
>> On Mon, Nov 17, 2025, at 12:31, Ryan Roberts wrote:
>>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> Over the last few years we had a few complaints that syscall performance on
>>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
>>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
>>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
>>>> ideas about how performance could be improved.
>>
>>
>>>> I believe this helps the mean latency significantly without sacrificing any
>>>> strength. But it doesn't reduce the tail latency because we still have to call
>>>> into the crng eventually.
>>>>
>>>> So here's another idea: Could we use siphash to generate some random bits? We
>>>> would generate the secret key at boot using the crng. Then generate a 64 bit
>>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
>>>> new hash). As long as the key remains secret, the hash is unpredictable.
>>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
>>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
>>>> call siphash every 10 syscalls, so there would still be a tail, but from my
>>>> experiements, it's much less than the crng:
>>
>> IIRC, Jason argued against creating another type of prng inside of the
>> kernel for a special purpose. 
> 
> Yes indeed... I'm really not a fan of adding bespoke crypto willynilly
> like that. Let's make get_random_u*() faster. If you're finding that the
> issue with it is the locking, and that you're calling this from irq
> context anyway, then your proposal (if I read this discussion correctly)
> to add a raw_get_random_u*() seems like it could be sensible. Those
> functions are generated via macro anyway, so it wouldn't be too much to
> add the raw overloads. Feel free to send a patch to my random.git tree
> if you'd like to give that a try.

Thanks Jason; that's exactly what I did, and it helps. But I think ultimately
the get_random_uXX() slow path is too slow; that's the part that causes the tail
latency problem. I doubt there are options for speeding that up?

Anyway, I'm currently prototyping a few options and getting clear performance
numbers. I'll be back in a couple of days and we can continue the discussion in
light of the data.

Thanks,
Ryan

> 
> Jason



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-18 17:21       ` Ryan Roberts
@ 2025-11-18 17:28         ` Jason A. Donenfeld
  0 siblings, 0 replies; 27+ messages in thread
From: Jason A. Donenfeld @ 2025-11-18 17:28 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Arnd Bergmann, Kees Cook, Ard Biesheuvel, Jeremy Linton,
	Will Deacon, Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List,
	ebiggers

On Tue, Nov 18, 2025 at 05:21:17PM +0000, Ryan Roberts wrote:
> On 18/11/2025 17:15, Jason A. Donenfeld wrote:
> > On Mon, Nov 17, 2025 at 05:47:05PM +0100, Arnd Bergmann wrote:
> >> On Mon, Nov 17, 2025, at 12:31, Ryan Roberts wrote:
> >>> On 17/11/2025 11:30, Ryan Roberts wrote:
> >>>> Hi All,
> >>>>
> >>>> Over the last few years we had a few complaints that syscall performance on
> >>>> arm64 is slower than x86. Most recently, it was observed that a certain Java
> >>>> benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> >>>> get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> >>>> ideas about how performance could be improved.
> >>
> >>
> >>>> I believe this helps the mean latency significantly without sacrificing any
> >>>> strength. But it doesn't reduce the tail latency because we still have to call
> >>>> into the crng eventually.
> >>>>
> >>>> So here's another idea: Could we use siphash to generate some random bits? We
> >>>> would generate the secret key at boot using the crng. Then generate a 64 bit
> >>>> siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> >>>> new hash). As long as the key remains secret, the hash is unpredictable.
> >>>> (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> >>>> that would last for 10 syscalls at 6 bits per call. So we would still have to
> >>>> call siphash every 10 syscalls, so there would still be a tail, but from my
> >>>> experiements, it's much less than the crng:
> >>
> >> IIRC, Jason argued against creating another type of prng inside of the
> >> kernel for a special purpose. 
> > 
> > Yes indeed... I'm really not a fan of adding bespoke crypto willynilly
> > like that. Let's make get_random_u*() faster. If you're finding that the
> > issue with it is the locking, and that you're calling this from irq
> > context anyway, then your proposal (if I read this discussion correctly)
> > to add a raw_get_random_u*() seems like it could be sensible. Those
> > functions are generated via macro anyway, so it wouldn't be too much to
> > add the raw overloads. Feel free to send a patch to my random.git tree
> > if you'd like to give that a try.
> 
> Thanks Jason; that's exactly what I did, and it helps. But I think ultimately
> the get_random_uXX() slow path is too slow; that's the part that causes the tail
> latency problem. I doubt there are options for speeding that up?
> 
> Anyway, I'm currently prototyping a few options and getting clear performance
> numbers. I'll be back in a couple of days and we can continue the discussion in
> light of the data.
 
Interesting... I would be curious to see what sorts of stable numbers
you find. Because most of the time, get_random_uXX() should just be
copying memory. Does the unlikely slower case really matter that much? I
suspect it doesn't matter for anything real. On the other hand, it's
probably possible to improve the slow path on ARM a bit by using the
pure-ARM assembly chacha implementation that we use in the vDSO:
https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/tree/arch/arm64/kernel/vdso/vgetrandom-chacha.S
Or by using the non-generic code already provided by libcrypto from
random.c.

Jason


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-17 11:31 ` [DISCUSSION] kstack offset randomization: bugs and performance Ryan Roberts
                     ` (2 preceding siblings ...)
  2025-11-17 20:56   ` Jeremy Linton
@ 2025-11-24 14:36   ` Will Deacon
  2025-11-24 17:11     ` Kees Cook
  3 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2025-11-24 14:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kees Cook, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> On 17/11/2025 11:30, Ryan Roberts wrote:
> > Could this give us a middle ground between strong-crng and
> > weak-timestamp-counter? Perhaps the main issue is that we need to store the
> > secret key for a long period?
> > 
> > 
> > Anyway, I plan to work up a series with the bugfixes and performance
> > improvements. I'll add the siphash approach as an experimental addition and get
> > some more detailed numbers for all the options. But wanted to raise it all here
> > first to get any early feedback.

FWIW, I share Mark's concerns about using a counter for this. Given that
the feature currently appears to be both slow _and_ broken I'd vote for
either removing it or switching over to per-thread offsets as a first
step. We already have a per-task stack canary with
CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
do something similar here.

Speeding up the crypto feels like something that could happen separately.

Will


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-24 14:36   ` Will Deacon
@ 2025-11-24 17:11     ` Kees Cook
  2025-11-24 17:50       ` Ryan Roberts
  2025-11-24 19:08       ` Will Deacon
  0 siblings, 2 replies; 27+ messages in thread
From: Kees Cook @ 2025-11-24 17:11 UTC (permalink / raw)
  To: Will Deacon, Ryan Roberts
  Cc: Arnd Bergmann, Ard Biesheuvel, Jeremy Linton, Catalin Marinas,
	Mark Rutland, linux-arm-kernel@lists.infradead.org,
	Linux Kernel Mailing List



On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
>On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>> On 17/11/2025 11:30, Ryan Roberts wrote:
>> > Could this give us a middle ground between strong-crng and
>> > weak-timestamp-counter? Perhaps the main issue is that we need to store the
>> > secret key for a long period?
>> > 
>> > 
>> > Anyway, I plan to work up a series with the bugfixes and performance
>> > improvements. I'll add the siphash approach as an experimental addition and get
>> > some more detailed numbers for all the options. But wanted to raise it all here
>> > first to get any early feedback.
>
>FWIW, I share Mark's concerns about using a counter for this. Given that
>the feature currently appears to be both slow _and_ broken I'd vote for
>either removing it or switching over to per-thread offsets as a first
>step.

That it has potential weaknesses doesn't mean it should be entirely removed.

> We already have a per-task stack canary with
>CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
>do something similar here.

That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)

>Speeding up the crypto feels like something that could happen separately.

Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.

-Kees


-- 
Kees Cook


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-24 17:11     ` Kees Cook
@ 2025-11-24 17:50       ` Ryan Roberts
  2025-11-24 20:51         ` Kees Cook
  2025-11-24 19:08       ` Will Deacon
  1 sibling, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-11-24 17:50 UTC (permalink / raw)
  To: Kees Cook, Will Deacon
  Cc: Arnd Bergmann, Ard Biesheuvel, Jeremy Linton, Catalin Marinas,
	Mark Rutland, linux-arm-kernel@lists.infradead.org,
	Linux Kernel Mailing List

On 24/11/2025 17:11, Kees Cook wrote:
> 
> 
> On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>>> Could this give us a middle ground between strong-crng and
>>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>>>> secret key for a long period?
>>>>
>>>>
>>>> Anyway, I plan to work up a series with the bugfixes and performance
>>>> improvements. I'll add the siphash approach as an experimental addition and get
>>>> some more detailed numbers for all the options. But wanted to raise it all here
>>>> first to get any early feedback.
>>
>> FWIW, I share Mark's concerns about using a counter for this. Given that
>> the feature currently appears to be both slow _and_ broken I'd vote for
>> either removing it or switching over to per-thread offsets as a first
>> step.
> 
> That it has potential weaknesses doesn't mean it should be entirely removed.
> 
>> We already have a per-task stack canary with
>> CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
>> do something similar here.
> 
> That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)
> 
>> Speeding up the crypto feels like something that could happen separately.
> 
> Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.

Just to say I haven't forgotten about this; I ended up having to switch to
something more urgent. Hoping to get back to it later this week. I don't think
this is an urgent issue, so hopefully folks are ok waiting.

I propose to post whatever I end up with then we can all disscuss from there.
But the rough shape so far:

Fixes:
 - Remove choose_random_kstack_offset()
 - arch passes random into add_random_kstack_offset() (fixes migration bypass)
 - Move add_random_kstack_offset() to el0_svc()/el0_svc_compat() (before
   enabling interrupts) to fix non-preemption requirement (arm64)

Perf Improvements:
 - Based on Jeremy's prng, but buffer the 32 bits and use 6 bits per syscall (so
   cost of prng generation is amortized over 5 syscalls)
 - Reseed prng using get_random_u64() every 64K prng invocations (so cost of
   get_random_u64() is amortized over 64K*5 syscalls)
 - So while get_random_u64() still has a latency spike, it's so infrequent that
   it doesn't show up in p99.9 for my benchmarks.
 - If we want to change it to per-task, I think it's all amenable.
 - I'll leave the timer off limits for arm64.

Although I'm seeing some inconsistencies in the performance measurements, so
need to get that understood properly first.

Thanks,
Ryan


> 
> -Kees
> 
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-24 17:11     ` Kees Cook
  2025-11-24 17:50       ` Ryan Roberts
@ 2025-11-24 19:08       ` Will Deacon
  1 sibling, 0 replies; 27+ messages in thread
From: Will Deacon @ 2025-11-24 19:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ryan Roberts, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Mon, Nov 24, 2025 at 09:11:23AM -0800, Kees Cook wrote:
> On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
> >On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> >> On 17/11/2025 11:30, Ryan Roberts wrote:
> >> > Could this give us a middle ground between strong-crng and
> >> > weak-timestamp-counter? Perhaps the main issue is that we need to store the
> >> > secret key for a long period?
> >> > 
> >> > 
> >> > Anyway, I plan to work up a series with the bugfixes and performance
> >> > improvements. I'll add the siphash approach as an experimental addition and get
> >> > some more detailed numbers for all the options. But wanted to raise it all here
> >> > first to get any early feedback.
> >
> >FWIW, I share Mark's concerns about using a counter for this. Given that
> >the feature currently appears to be both slow _and_ broken I'd vote for
> >either removing it or switching over to per-thread offsets as a first
> >step.
> 
> That it has potential weaknesses doesn't mean it should be entirely
> removed.

Well, we can always bring it back when it does something useful :)

> > We already have a per-task stack canary with
> >CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
> >do something similar here.
> 
> That's not a reasonable comparison: the stack canary cannot change
> arbitrarily for a task or it would immediately crash on a function return.
> :)

Fair enough, but I was thinking more about concerns relating to the
size of task struct. I don't think that's a huge concern in this case
and we already have tonnes of junk in thread_struct if you want to put
it there instead. Certainly, persevering with per-cpu data just feels
like the wrong approach to me based on Ryan's report.

> >Speeding up the crypto feels like something that could happen separately.
> 
> Sure. But let's see what Ryan's patches look like. The suggested changes
> sound good to me.

I guess we'll have to wait and see but some of the ideas in this thread
(e.g. using the counter and interrupt timing) seem pretty flawed to me
so I was trying to avoid Ryan wasting his time.

Will


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-24 17:50       ` Ryan Roberts
@ 2025-11-24 20:51         ` Kees Cook
  2025-11-25 11:14           ` Ryan Roberts
  0 siblings, 1 reply; 27+ messages in thread
From: Kees Cook @ 2025-11-24 20:51 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Mon, Nov 24, 2025 at 05:50:14PM +0000, Ryan Roberts wrote:
> On 24/11/2025 17:11, Kees Cook wrote:
> > 
> > 
> > On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
> >> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> >>> On 17/11/2025 11:30, Ryan Roberts wrote:
> >>>> Could this give us a middle ground between strong-crng and
> >>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
> >>>> secret key for a long period?
> >>>>
> >>>>
> >>>> Anyway, I plan to work up a series with the bugfixes and performance
> >>>> improvements. I'll add the siphash approach as an experimental addition and get
> >>>> some more detailed numbers for all the options. But wanted to raise it all here
> >>>> first to get any early feedback.
> >>
> >> FWIW, I share Mark's concerns about using a counter for this. Given that
> >> the feature currently appears to be both slow _and_ broken I'd vote for
> >> either removing it or switching over to per-thread offsets as a first
> >> step.
> > 
> > That it has potential weaknesses doesn't mean it should be entirely removed.
> > 
> >> We already have a per-task stack canary with
> >> CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
> >> do something similar here.
> > 
> > That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)
> > 
> >> Speeding up the crypto feels like something that could happen separately.
> > 
> > Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.
> 
> Just to say I haven't forgotten about this; I ended up having to switch to
> something more urgent. Hoping to get back to it later this week. I don't think
> this is an urgent issue, so hopefully folks are ok waiting.
> 
> I propose to post whatever I end up with then we can all disscuss from there.
> But the rough shape so far:
> 
> Fixes:
>  - Remove choose_random_kstack_offset()
>  - arch passes random into add_random_kstack_offset() (fixes migration bypass)
>  - Move add_random_kstack_offset() to el0_svc()/el0_svc_compat() (before
>    enabling interrupts) to fix non-preemption requirement (arm64)

I thought we'd keep choose_random_kstack_offset() and just move
everything into a per-task location? (And for arm64 only)

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-24 20:51         ` Kees Cook
@ 2025-11-25 11:14           ` Ryan Roberts
  2025-11-26 22:58             ` Ard Biesheuvel
  0 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-11-25 11:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: Will Deacon, Arnd Bergmann, Ard Biesheuvel, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On 24/11/2025 20:51, Kees Cook wrote:
> On Mon, Nov 24, 2025 at 05:50:14PM +0000, Ryan Roberts wrote:
>> On 24/11/2025 17:11, Kees Cook wrote:
>>>
>>>
>>> On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
>>>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>>>>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>>>>> Could this give us a middle ground between strong-crng and
>>>>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>>>>>> secret key for a long period?
>>>>>>
>>>>>>
>>>>>> Anyway, I plan to work up a series with the bugfixes and performance
>>>>>> improvements. I'll add the siphash approach as an experimental addition and get
>>>>>> some more detailed numbers for all the options. But wanted to raise it all here
>>>>>> first to get any early feedback.
>>>>
>>>> FWIW, I share Mark's concerns about using a counter for this. Given that
>>>> the feature currently appears to be both slow _and_ broken I'd vote for
>>>> either removing it or switching over to per-thread offsets as a first
>>>> step.
>>>
>>> That it has potential weaknesses doesn't mean it should be entirely removed.
>>>
>>>> We already have a per-task stack canary with
>>>> CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
>>>> do something similar here.
>>>
>>> That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)
>>>
>>>> Speeding up the crypto feels like something that could happen separately.
>>>
>>> Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.
>>
>> Just to say I haven't forgotten about this; I ended up having to switch to
>> something more urgent. Hoping to get back to it later this week. I don't think
>> this is an urgent issue, so hopefully folks are ok waiting.
>>
>> I propose to post whatever I end up with then we can all disscuss from there.
>> But the rough shape so far:
>>
>> Fixes:
>>  - Remove choose_random_kstack_offset()
>>  - arch passes random into add_random_kstack_offset() (fixes migration bypass)
>>  - Move add_random_kstack_offset() to el0_svc()/el0_svc_compat() (before
>>    enabling interrupts) to fix non-preemption requirement (arm64)
> 
> I thought we'd keep choose_random_kstack_offset() and just move
> everything into a per-task location? (And for arm64 only)

Err... I thought you were the one arguing against per-task state? I'm not really
keen on having arm64 do a completely different thing to everyone else; It seems
reasonable to me that we only need to (continue to) abstract the random source
per-arch and the rest should remain common?

Per my previous mails, I'm not really sure what choose_random_kstack_offset() is
giving us in practice. Why not simplify?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-25 11:14           ` Ryan Roberts
@ 2025-11-26 22:58             ` Ard Biesheuvel
  2025-11-27  8:00               ` Kees Cook
  0 siblings, 1 reply; 27+ messages in thread
From: Ard Biesheuvel @ 2025-11-26 22:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Tue, 25 Nov 2025 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 24/11/2025 20:51, Kees Cook wrote:
> > On Mon, Nov 24, 2025 at 05:50:14PM +0000, Ryan Roberts wrote:
> >> On 24/11/2025 17:11, Kees Cook wrote:
> >>>
> >>>
> >>> On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
> >>>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> >>>>> On 17/11/2025 11:30, Ryan Roberts wrote:
> >>>>>> Could this give us a middle ground between strong-crng and
> >>>>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
> >>>>>> secret key for a long period?
> >>>>>>
> >>>>>>
> >>>>>> Anyway, I plan to work up a series with the bugfixes and performance
> >>>>>> improvements. I'll add the siphash approach as an experimental addition and get
> >>>>>> some more detailed numbers for all the options. But wanted to raise it all here
> >>>>>> first to get any early feedback.
> >>>>
> >>>> FWIW, I share Mark's concerns about using a counter for this. Given that
> >>>> the feature currently appears to be both slow _and_ broken I'd vote for
> >>>> either removing it or switching over to per-thread offsets as a first
> >>>> step.
> >>>
> >>> That it has potential weaknesses doesn't mean it should be entirely removed.
> >>>
> >>>> We already have a per-task stack canary with
> >>>> CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
> >>>> do something similar here.
> >>>
> >>> That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)
> >>>
> >>>> Speeding up the crypto feels like something that could happen separately.
> >>>
> >>> Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.
> >>
> >> Just to say I haven't forgotten about this; I ended up having to switch to
> >> something more urgent. Hoping to get back to it later this week. I don't think
> >> this is an urgent issue, so hopefully folks are ok waiting.
> >>
> >> I propose to post whatever I end up with then we can all disscuss from there.
> >> But the rough shape so far:
> >>
> >> Fixes:
> >>  - Remove choose_random_kstack_offset()
> >>  - arch passes random into add_random_kstack_offset() (fixes migration bypass)
> >>  - Move add_random_kstack_offset() to el0_svc()/el0_svc_compat() (before
> >>    enabling interrupts) to fix non-preemption requirement (arm64)
> >
> > I thought we'd keep choose_random_kstack_offset() and just move
> > everything into a per-task location? (And for arm64 only)
>
> Err... I thought you were the one arguing against per-task state? I'm not really
> keen on having arm64 do a completely different thing to everyone else; It seems
> reasonable to me that we only need to (continue to) abstract the random source
> per-arch and the rest should remain common?
>
> Per my previous mails, I'm not really sure what choose_random_kstack_offset() is
> giving us in practice. Why not simplify?
>

It seems to me that arm64 inherited a lot of complexity from other
architectures that rely on a counter. AIUI, the 'non-preemption
requirement' and the 'migration bypass' issues are both a result of
the fact that the syscall duration is relied upon to provide more
pseudo-randomness, which is recorded in a per-CPU counter when the
syscall completes and carried over to the next invocation. (I don't
really buy the idea that having the next offset in memory for a
shorter time buys us anything in terms of robustness tbh)

I have some patches that make get_random_uXX()'s fast path lockless
that I will send out shortly (assuming no hate mail from the bots in
the mean time), which should make it suitable for general use as the
entropy source for kstack randomization on all architectures, modulo
the tail latency issue, but I'm not sure I understand why that is a
problem to begin with if it occurs sufficiently rarely. Is that a
PREEMPT_RT issue? Would it be better if the refill of the per-CPU
batched entropy buffers was relegated to some kind of kthread so it
can be scheduled independently? (Those buffers are all the same size
so we could easily keep a few hot spares)

TL;DR I agree with Ryan but I'd like to fix it across the board, i.e.,
get rid of the per-CPU variable entirely and all the dodgy uses of
counters.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-26 22:58             ` Ard Biesheuvel
@ 2025-11-27  8:00               ` Kees Cook
  2025-11-27 11:50                 ` Ryan Roberts
  0 siblings, 1 reply; 27+ messages in thread
From: Kees Cook @ 2025-11-27  8:00 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ryan Roberts, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Wed, Nov 26, 2025 at 11:58:40PM +0100, Ard Biesheuvel wrote:
> On Tue, 25 Nov 2025 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 24/11/2025 20:51, Kees Cook wrote:
> > > On Mon, Nov 24, 2025 at 05:50:14PM +0000, Ryan Roberts wrote:
> > >> On 24/11/2025 17:11, Kees Cook wrote:
> > >>>
> > >>>
> > >>> On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
> > >>>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> > >>>>> On 17/11/2025 11:30, Ryan Roberts wrote:
> > >>>>>> Could this give us a middle ground between strong-crng and
> > >>>>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
> > >>>>>> secret key for a long period?
> > >>>>>>
> > >>>>>>
> > >>>>>> Anyway, I plan to work up a series with the bugfixes and performance
> > >>>>>> improvements. I'll add the siphash approach as an experimental addition and get
> > >>>>>> some more detailed numbers for all the options. But wanted to raise it all here
> > >>>>>> first to get any early feedback.
> > >>>>
> > >>>> FWIW, I share Mark's concerns about using a counter for this. Given that
> > >>>> the feature currently appears to be both slow _and_ broken I'd vote for
> > >>>> either removing it or switching over to per-thread offsets as a first
> > >>>> step.
> > >>>
> > >>> That it has potential weaknesses doesn't mean it should be entirely removed.
> > >>>
> > >>>> We already have a per-task stack canary with
> > >>>> CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
> > >>>> do something similar here.
> > >>>
> > >>> That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)
> > >>>
> > >>>> Speeding up the crypto feels like something that could happen separately.
> > >>>
> > >>> Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.
> > >>
> > >> Just to say I haven't forgotten about this; I ended up having to switch to
> > >> something more urgent. Hoping to get back to it later this week. I don't think
> > >> this is an urgent issue, so hopefully folks are ok waiting.
> > >>
> > >> I propose to post whatever I end up with then we can all disscuss from there.
> > >> But the rough shape so far:
> > >>
> > >> Fixes:
> > >>  - Remove choose_random_kstack_offset()
> > >>  - arch passes random into add_random_kstack_offset() (fixes migration bypass)
> > >>  - Move add_random_kstack_offset() to el0_svc()/el0_svc_compat() (before
> > >>    enabling interrupts) to fix non-preemption requirement (arm64)
> > >
> > > I thought we'd keep choose_random_kstack_offset() and just move
> > > everything into a per-task location? (And for arm64 only)
> >
> > Err... I thought you were the one arguing against per-task state? I'm not really
> > keen on having arm64 do a completely different thing to everyone else; It seems
> > reasonable to me that we only need to (continue to) abstract the random source
> > per-arch and the rest should remain common?
> >
> > Per my previous mails, I'm not really sure what choose_random_kstack_offset() is
> > giving us in practice. Why not simplify?
> >
> 
> It seems to me that arm64 inherited a lot of complexity from other
> architectures that rely on a counter. AIUI, the 'non-preemption
> requirement' and the 'migration bypass' issues are both a result of
> the fact that the syscall duration is relied upon to provide more
> pseudo-randomness, which is recorded in a per-CPU counter when the
> syscall completes and carried over to the next invocation. (I don't
> really buy the idea that having the next offset in memory for a
> shorter time buys us anything in terms of robustness tbh)
> 
> I have some patches that make get_random_uXX()'s fast path lockless
> that I will send out shortly (assuming no hate mail from the bots in
> the mean time), which should make it suitable for general use as the
> entropy source for kstack randomization on all architectures, modulo
> the tail latency issue, but I'm not sure I understand why that is a
> problem to begin with if it occurs sufficiently rarely. Is that a
> PREEMPT_RT issue? Would it be better if the refill of the per-CPU
> batched entropy buffers was relegated to some kind of kthread so it
> can be scheduled independently? (Those buffers are all the same size
> so we could easily keep a few hot spares)
> 
> TL;DR I agree with Ryan but I'd like to fix it across the board, i.e.,
> get rid of the per-CPU variable entirely and all the dodgy uses of
> counters.

Right, the whole design of picking the next random number after the
syscall is to try to reduce the control an attacker would have given an
instruction counter is the randomness source in most cases (during the
original design no one could agree on a fast enough randomness source).

If we can fix that then we can fix this for all architectures and all
randomness users generally.

-Kees

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-27  8:00               ` Kees Cook
@ 2025-11-27 11:50                 ` Ryan Roberts
  2025-11-27 12:19                   ` Ard Biesheuvel
  0 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-11-27 11:50 UTC (permalink / raw)
  To: Kees Cook, Ard Biesheuvel
  Cc: Will Deacon, Arnd Bergmann, Jeremy Linton, Catalin Marinas,
	Mark Rutland, linux-arm-kernel@lists.infradead.org,
	Linux Kernel Mailing List

On 27/11/2025 08:00, Kees Cook wrote:
> On Wed, Nov 26, 2025 at 11:58:40PM +0100, Ard Biesheuvel wrote:
>> On Tue, 25 Nov 2025 at 12:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 24/11/2025 20:51, Kees Cook wrote:
>>>> On Mon, Nov 24, 2025 at 05:50:14PM +0000, Ryan Roberts wrote:
>>>>> On 24/11/2025 17:11, Kees Cook wrote:
>>>>>>
>>>>>>
>>>>>> On November 24, 2025 6:36:25 AM PST, Will Deacon <will@kernel.org> wrote:
>>>>>>> On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
>>>>>>>> On 17/11/2025 11:30, Ryan Roberts wrote:
>>>>>>>>> Could this give us a middle ground between strong-crng and
>>>>>>>>> weak-timestamp-counter? Perhaps the main issue is that we need to store the
>>>>>>>>> secret key for a long period?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Anyway, I plan to work up a series with the bugfixes and performance
>>>>>>>>> improvements. I'll add the siphash approach as an experimental addition and get
>>>>>>>>> some more detailed numbers for all the options. But wanted to raise it all here
>>>>>>>>> first to get any early feedback.
>>>>>>>
>>>>>>> FWIW, I share Mark's concerns about using a counter for this. Given that
>>>>>>> the feature currently appears to be both slow _and_ broken I'd vote for
>>>>>>> either removing it or switching over to per-thread offsets as a first
>>>>>>> step.
>>>>>>
>>>>>> That it has potential weaknesses doesn't mean it should be entirely removed.
>>>>>>
>>>>>>> We already have a per-task stack canary with
>>>>>>> CONFIG_STACKPROTECTOR_PER_TASK so I don't understand the reluctance to
>>>>>>> do something similar here.
>>>>>>
>>>>>> That's not a reasonable comparison: the stack canary cannot change arbitrarily for a task or it would immediately crash on a function return. :)
>>>>>>
>>>>>>> Speeding up the crypto feels like something that could happen separately.
>>>>>>
>>>>>> Sure. But let's see what Ryan's patches look like. The suggested changes sound good to me.
>>>>>
>>>>> Just to say I haven't forgotten about this; I ended up having to switch to
>>>>> something more urgent. Hoping to get back to it later this week. I don't think
>>>>> this is an urgent issue, so hopefully folks are ok waiting.
>>>>>
>>>>> I propose to post whatever I end up with then we can all disscuss from there.
>>>>> But the rough shape so far:
>>>>>
>>>>> Fixes:
>>>>>  - Remove choose_random_kstack_offset()
>>>>>  - arch passes random into add_random_kstack_offset() (fixes migration bypass)
>>>>>  - Move add_random_kstack_offset() to el0_svc()/el0_svc_compat() (before
>>>>>    enabling interrupts) to fix non-preemption requirement (arm64)
>>>>
>>>> I thought we'd keep choose_random_kstack_offset() and just move
>>>> everything into a per-task location? (And for arm64 only)
>>>
>>> Err... I thought you were the one arguing against per-task state? I'm not really
>>> keen on having arm64 do a completely different thing to everyone else; It seems
>>> reasonable to me that we only need to (continue to) abstract the random source
>>> per-arch and the rest should remain common?
>>>
>>> Per my previous mails, I'm not really sure what choose_random_kstack_offset() is
>>> giving us in practice. Why not simplify?
>>>
>>
>> It seems to me that arm64 inherited a lot of complexity from other
>> architectures that rely on a counter. AIUI, the 'non-preemption
>> requirement' and the 'migration bypass' issues are both a result of
>> the fact that the syscall duration is relied upon to provide more
>> pseudo-randomness, which is recorded in a per-CPU counter when the
>> syscall completes and carried over to the next invocation. (I don't
>> really buy the idea that having the next offset in memory for a
>> shorter time buys us anything in terms of robustness tbh)

Agreed.

>>
>> I have some patches that make get_random_uXX()'s fast path lockless
>> that I will send out shortly (assuming no hate mail from the bots in
>> the mean time), which should make it suitable for general use as the
>> entropy source for kstack randomization on all architectures, modulo
>> the tail latency issue, but I'm not sure I understand why that is a
>> problem to begin with if it occurs sufficiently rarely. Is that a
>> PREEMPT_RT issue? 

Yes; RT was Jeremy's original motivation for looking at the prng approach.

For the issue I see, improving the mean would be sufficient, but improving the
tail too is a bonus.

>> Would it be better if the refill of the per-CPU
>> batched entropy buffers was relegated to some kind of kthread so it
>> can be scheduled independently? (Those buffers are all the same size
>> so we could easily keep a few hot spares)

That came up in Jeremy's thread last year. My understanding was that this would
not help because either the thread is lower priority, in which case you can't
guarrantee it will run, or it is higher priority, in which case the RT thread
still takes the glitch. (But I'm hand waving - I'm not expert on the details).

>>
>> TL;DR I agree with Ryan but I'd like to fix it across the board, i.e.,
>> get rid of the per-CPU variable entirely and all the dodgy uses of
>> counters.
> 
> Right, the whole design of picking the next random number after the
> syscall is to try to reduce the control an attacker would have given an
> instruction counter is the randomness source in most cases (during the
> original design no one could agree on a fast enough randomness source).
> 
> If we can fix that then we can fix this for all architectures and all
> randomness users generally.
> 
> -Kees
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-27 11:50                 ` Ryan Roberts
@ 2025-11-27 12:19                   ` Ard Biesheuvel
  2025-11-27 14:09                     ` Ryan Roberts
  0 siblings, 1 reply; 27+ messages in thread
From: Ard Biesheuvel @ 2025-11-27 12:19 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Thu, 27 Nov 2025 at 12:50, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2025 08:00, Kees Cook wrote:
> > On Wed, Nov 26, 2025 at 11:58:40PM +0100, Ard Biesheuvel wrote:
...
> >> the tail latency issue, but I'm not sure I understand why that is a
> >> problem to begin with if it occurs sufficiently rarely. Is that a
> >> PREEMPT_RT issue?
>
> Yes; RT was Jeremy's original motivation for looking at the prng approach.
>
> For the issue I see, improving the mean would be sufficient, but improving the
> tail too is a bonus.
>
> >> Would it be better if the refill of the per-CPU
> >> batched entropy buffers was relegated to some kind of kthread so it
> >> can be scheduled independently? (Those buffers are all the same size
> >> so we could easily keep a few hot spares)
>
> That came up in Jeremy's thread last year. My understanding was that this would
> not help because either the thread is lower priority, in which case you can't
> guarrantee it will run, or it is higher priority, in which case the RT thread
> still takes the glitch. (But I'm hand waving - I'm not expert on the details).
>

PREEMPT_RT is generally more concerned about the worst case latency
being bounded rather than being as low as possible.

The get_random fallback runs a few rounds of chacha20, which takes
more time than just reading the next value and bumping the position
counter. But that does not imply it fails to meet RT constraints.

And if a thread running ChaCha20 in the background fails to get enough
cycles, it is not an RT problem, it is an ordinary starvation problem,
which can only be achieved by doing less work in total. But cranking
prandom_u32_state() on every syscall is not free either.

In summary, it would be good to have a better problem statement wrt RT
constraints before assuming that 99% tail latency is something to
obsess about, especially given the fact het getpid() is not that
representative a syscall to begin with.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-27 12:19                   ` Ard Biesheuvel
@ 2025-11-27 14:09                     ` Ryan Roberts
  2025-11-27 19:17                       ` Kees Cook
  0 siblings, 1 reply; 27+ messages in thread
From: Ryan Roberts @ 2025-11-27 14:09 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Kees Cook, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On 27/11/2025 12:19, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 12:50, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2025 08:00, Kees Cook wrote:
>>> On Wed, Nov 26, 2025 at 11:58:40PM +0100, Ard Biesheuvel wrote:
> ...
>>>> the tail latency issue, but I'm not sure I understand why that is a
>>>> problem to begin with if it occurs sufficiently rarely. Is that a
>>>> PREEMPT_RT issue?
>>
>> Yes; RT was Jeremy's original motivation for looking at the prng approach.
>>
>> For the issue I see, improving the mean would be sufficient, but improving the
>> tail too is a bonus.
>>
>>>> Would it be better if the refill of the per-CPU
>>>> batched entropy buffers was relegated to some kind of kthread so it
>>>> can be scheduled independently? (Those buffers are all the same size
>>>> so we could easily keep a few hot spares)
>>
>> That came up in Jeremy's thread last year. My understanding was that this would
>> not help because either the thread is lower priority, in which case you can't
>> guarrantee it will run, or it is higher priority, in which case the RT thread
>> still takes the glitch. (But I'm hand waving - I'm not expert on the details).
>>
> 
> PREEMPT_RT is generally more concerned about the worst case latency
> being bounded rather than being as low as possible.

Sure, but if you can reduce the tail, that's still "better" right?

> 
> The get_random fallback runs a few rounds of chacha20, which takes
> more time than just reading the next value and bumping the position
> counter. But that does not imply it fails to meet RT constraints.
> 
> And if a thread running ChaCha20 in the background fails to get enough
> cycles, it is not an RT problem, it is an ordinary starvation problem,
> which can only be achieved by doing less work in total. But cranking
> prandom_u32_state() on every syscall is not free either.

Indeed, but it's  a lot cheaper than get_random. See:

https://lore.kernel.org/all/20251127105958.2427758-1-ryan.roberts@arm.com/

> 
> In summary, it would be good to have a better problem statement wrt RT
> constraints before assuming that 99% tail latency is something to
> obsess about, especially given the fact het getpid() is not that
> representative a syscall to begin with.

I think that's a fair point. But I also think the results I link above show very
clearly that one approach is more performant than the other, in terms of the
overhead of syscall entry and exit. And as I said when starting this thread,
that is something we have had complaints about from partners.

Personally, based on that data, I think we could reduce it to this decision tree:

is a prng good enough for kstack offset randomization?
  yes: is 3% syscall entry/exit overhead a reasonable price?
    yes: Land my series
    no: rip out kstack offset randomization
  no: is 10% syscall entry/exit overhead a reasonable price?
    yes: Land Ard's series
    no: rip out kstack offset randomization

For the avoidance of doubt, my opinion is that prng is good enough for 6 bits.

By the way, my sense is that we won't get much below 3% no matter what we do. It
looks to me like it could be bottlenecked on __alloca() which forces any
speculation using the incorrect stack address to be abandoned. So I don't think
offloading to a thread will end up helping us much. I don't have data that shows
that conclusively, but that's my intuition from some earlier benchmarking.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [DISCUSSION] kstack offset randomization: bugs and performance
  2025-11-27 14:09                     ` Ryan Roberts
@ 2025-11-27 19:17                       ` Kees Cook
  0 siblings, 0 replies; 27+ messages in thread
From: Kees Cook @ 2025-11-27 19:17 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Ard Biesheuvel, Will Deacon, Arnd Bergmann, Jeremy Linton,
	Catalin Marinas, Mark Rutland,
	linux-arm-kernel@lists.infradead.org, Linux Kernel Mailing List

On Thu, Nov 27, 2025 at 02:09:04PM +0000, Ryan Roberts wrote:
>     no: rip out kstack offset randomization

Uh, no. There are system builders that have no problem with the current
performance. The _feature_ must stay because it has users. Whether it is
default-enabled, etc, sure, I have no problem with making changes there.
Whatever we can do to make it better, I obviously welcome, but we can't
remove it.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-11-27 19:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <66c4e2a0-c7fb-46c2-acce-8a040a71cd8e@arm.com>
2025-11-17 11:31 ` [DISCUSSION] kstack offset randomization: bugs and performance Ryan Roberts
2025-11-17 16:47   ` Arnd Bergmann
2025-11-17 17:23     ` Ryan Roberts
2025-11-17 17:46     ` Mark Rutland
2025-11-17 23:04       ` Arnd Bergmann
2025-11-18 17:15     ` Jason A. Donenfeld
2025-11-18 17:21       ` Ryan Roberts
2025-11-18 17:28         ` Jason A. Donenfeld
2025-11-17 20:27   ` Kees Cook
2025-11-18 10:28     ` Ryan Roberts
2025-11-18 11:25       ` Mark Rutland
2025-11-18 12:16         ` Ryan Roberts
2025-11-18 11:05     ` Mark Rutland
2025-11-17 20:56   ` Jeremy Linton
2025-11-18 11:05     ` Ryan Roberts
2025-11-24 14:36   ` Will Deacon
2025-11-24 17:11     ` Kees Cook
2025-11-24 17:50       ` Ryan Roberts
2025-11-24 20:51         ` Kees Cook
2025-11-25 11:14           ` Ryan Roberts
2025-11-26 22:58             ` Ard Biesheuvel
2025-11-27  8:00               ` Kees Cook
2025-11-27 11:50                 ` Ryan Roberts
2025-11-27 12:19                   ` Ard Biesheuvel
2025-11-27 14:09                     ` Ryan Roberts
2025-11-27 19:17                       ` Kees Cook
2025-11-24 19:08       ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).