Re: [DISCUSSION] kstack offset randomization: bugs and performance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kees Cook <kees@kernel.org>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>, Ard Biesheuvel <ardb@kernel.org>,
	Jeremy Linton <jeremy.linton@arm.com>,
	Will Deacon <will@kernel.org>,
	Catalin Marinas <Catalin.Marinas@arm.com>,
	Mark Rutland <mark.rutland@arm.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [DISCUSSION] kstack offset randomization: bugs and performance
Date: Mon, 17 Nov 2025 12:27:36 -0800	[thread overview]
Message-ID: <202511171221.517FC4F@keescook> (raw)
In-Reply-To: <dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com>

On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> Sorry; forgot to add Mark and the lists!
> 
> 
> On 17/11/2025 11:30, Ryan Roberts wrote:
> > Hi All,
> > 
> > Over the last few years we had a few complaints that syscall performance on
> > arm64 is slower than x86. Most recently, it was observed that a certain Java
> > benchmark that does a lot of fstat and lseek is spending ~10% of it's time in
> > get_random_u16(). Cue a bit of digging, which led me to [1] and also to some new
> > ideas about how performance could be improved.
> > 
> > But I'll get to the performance angle in a bit. First, I want to discuss some
> > bugs that I believe I have uncovered during code review...
> > 
> > 
> > Bug 1: We have the following pattern:
> > 
> > add_random_kstack_offset()
> >   enable_interrupts_and_preemption()
> >     do_syscall()
> >   disable_interrupts_and_preemption()
> > choose_random_kstack_offset(random)
> > 
> > Where add_random_kstack_offset() adds an offset to the stack that was chosen by
> > a previous call to choose_random_kstack_offset() and stored in a per-cpu
> > variable. But since preemption is enabled during the syscall, surely an attacker
> > could defeat this by arranging for the thread to be preempted or migrated while
> > executing the syscall? That way the new offset is calculated for a different CPU
> > and a subsequent syscall on the original CPU will use the original offset?

Oh, interesting -- I hadn't considered that the entire thread would be
moved between CPUs while it is *IN* the syscall. Yeah, that would
effectively cause the offset to never change for the moved-from CPU. Ew.

> > I think we could just pass the random seed to add_random_kstack_offset() so that
> > we consume the old and buffer the new atomically? We would still buffer it
> > across syscalls to avoid the guessability issue that's documented. Then
> > choose_random_kstack_offset() could be removed. Or we could store
> > per-task_struct given it is only 32-bits?

I had wanted to avoid both growing task_struct and to avoid tying the
randomness to a given task -- then unpredictability could be higher
(barring the bug above), and could be only reduced to per-thread by
pinning a thread exclusively to a single CPU.

> > Bug 2: add_random_kstack_offset() and choose_random_kstack_offset() both
> > document their requirement to be called with interrupts and preemption disabled.
> > They use raw_cpu_*(), which require this. But on arm64, they are called from
> > invoke_syscall(), where interrupts and preemption are _enabled_. In practice, I
> > don't think this will cause functional harm for arm64's implementations of
> > raw_cpu_*(), but means that it's possible that the wrong per-cpu structure is
> > being referred to. Perhaps there is a way for user code to exploit this to
> > defeat the purpose of the feature.
> > 
> > This should be straightforward to fix; if we take the task_struct approach for
> > bug 1, then that would also fix this issue too because the requirement to be in
> > atomic context goes away. Otherwsise it can be moved earlier in the callchain,
> > before interrupts are enabled.

I can't speak to these internals, just that I'd hope to avoid forcing
the randomness down to the thread-level.

> > 
> > 
> > Then we get to the performance aspect...
> > 
> > arm64 uses get_random_u16() to get 16 bits from a per-cpu entropy buffer that
> > originally came from the crng. get_random_u16() does
> > local_lock_irqsave()/local_unlock_irqrestore() inside every call (both the
> > fastpath and the slow path). It turns out that this locking/unlocking accounts
> > for 30%-50% of the total cost of kstack offset randomization. By introducing a
> > new raw_try_get_random_uX() helper that's called from a context where irqs are
> > disabled, I can eliminate that cost. (I also plan to dig into exactly why it's
> > costing so much).
> > 
> > Furthermore, given we are actually only using 6 bits of entropy per syscall, we
> > could instead just request a u8 instead of a u16 and only throw away 2 bits
> > instead of 10 bits. This means we drain the entropy buffer half as quickly and
> > make half as many slow calls into the crng:
> > 
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | Benchmark | randomize=off | randomize=on | + no local_lock | + get_random_u8 |
> > |           |    (baseline) |              |                 |                 |
> > +===========+===============+==============+=================+=================+
> > | getpid    |          0.19 |  (R) -11.43% |      (R) -8.41% |      (R) -5.97% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | getppid   |          0.19 |  (R) -13.81% |      (R) -7.83% |      (R) -6.14% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | invalid   |          0.18 |  (R) -12.22% |      (R) -5.55% |      (R) -3.70% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > 
> > I expect we could even choose to re-buffer and save those 2 bits so we call the
> > slow path even less often.
> > 
> > I believe this helps the mean latency significantly without sacrificing any
> > strength. But it doesn't reduce the tail latency because we still have to call
> > into the crng eventually.
> > 
> > So here's another idea: Could we use siphash to generate some random bits? We
> > would generate the secret key at boot using the crng. Then generate a 64 bit
> > siphash of (cntvct_el0 ^ tweak) (where tweak increments every time we generate a
> > new hash). As long as the key remains secret, the hash is unpredictable.
> > (perhaps we don't even need the timer value). For every hash we get 64 bits, so
> > that would last for 10 syscalls at 6 bits per call. So we would still have to
> > call siphash every 10 syscalls, so there would still be a tail, but from my
> > experiements, it's much less than the crng:
> > 
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | Benchmark | randomize=off | randomize=on |         siphash |   Jeremy's prng |
> > |           |    (baseline) |              |                 |                 |
> > +===========+===============+==============+=================+=================+
> > | getpid    |          0.19 |  (R) -11.43% |      (R) -5.74% |      (R) -2.06% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | getppid   |          0.19 |  (R) -13.81% |      (R) -3.39% |      (R) -2.59% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > | invalid   |          0.18 |  (R) -12.22% |      (R) -2.43% |          -1.31% |
> > +-----------+---------------+--------------+-----------------+-----------------+
> > 
> > Could this give us a middle ground between strong-crng and
> > weak-timestamp-counter? Perhaps the main issue is that we need to store the
> > secret key for a long period?

All these ideas seem fine to me. I agree about only needed 6 bits, so a
u8 is good. Since we already use siphash extensively, that also seems
fine, though it'd be nice if we could find a solution that avoided
intermittent reseeding.

> > Anyway, I plan to work up a series with the bugfixes and performance
> > improvements. I'll add the siphash approach as an experimental addition and get
> > some more detailed numbers for all the options. But wanted to raise it all here
> > first to get any early feedback.

Thanks for tackling this!

-Kees

-- 
Kees Cook

next prev parent reply	other threads:[~2025-11-17 20:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <66c4e2a0-c7fb-46c2-acce-8a040a71cd8e@arm.com>
2025-11-17 11:31 ` [DISCUSSION] kstack offset randomization: bugs and performance Ryan Roberts
2025-11-17 16:47   ` Arnd Bergmann
2025-11-17 17:23     ` Ryan Roberts
2025-11-17 17:46     ` Mark Rutland
2025-11-17 23:04       ` Arnd Bergmann
2025-11-18 17:15     ` Jason A. Donenfeld
2025-11-18 17:21       ` Ryan Roberts
2025-11-18 17:28         ` Jason A. Donenfeld
2025-11-17 20:27   ` Kees Cook [this message]
2025-11-18 10:28     ` Ryan Roberts
2025-11-18 11:25       ` Mark Rutland
2025-11-18 12:16         ` Ryan Roberts
2025-11-18 11:05     ` Mark Rutland
2025-11-17 20:56   ` Jeremy Linton
2025-11-18 11:05     ` Ryan Roberts
2025-11-24 14:36   ` Will Deacon
2025-11-24 17:11     ` Kees Cook
2025-11-24 17:50       ` Ryan Roberts
2025-11-24 20:51         ` Kees Cook
2025-11-25 11:14           ` Ryan Roberts
2025-11-26 22:58             ` Ard Biesheuvel
2025-11-27  8:00               ` Kees Cook
2025-11-27 11:50                 ` Ryan Roberts
2025-11-27 12:19                   ` Ard Biesheuvel
2025-11-27 14:09                     ` Ryan Roberts
2025-11-27 19:17                       ` Kees Cook
2025-11-24 19:08       ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202511171221.517FC4F@keescook \
    --to=kees@kernel.org \
    --cc=Catalin.Marinas@arm.com \
    --cc=ardb@kernel.org \
    --cc=arnd@arndb.de \
    --cc=jeremy.linton@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=ryan.roberts@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.