public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC] Circumventing FineIBT Via Entrypoints
       [not found] <Z60NwR4w/28Z7XUa@ubun>
@ 2025-02-12 22:29 ` Jann Horn
  2025-02-13  1:31   ` Andrew Cooper
  2025-02-13  6:15   ` Jennifer Miller
  0 siblings, 2 replies; 40+ messages in thread
From: Jann Horn @ 2025-02-12 22:29 UTC (permalink / raw)
  To: Jennifer Miller, Andy Lutomirski
  Cc: linux-hardening, kees, joao, samitolvanen, kernel list

+Andy Lutomirski (X86 entry code maintainer)

On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller <jmill@asu.edu> wrote:
> As part of a recently accepted paper we demonstrated that syscall
> entrypoints can be misused on x86-64 systems to generically bypass
> FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We
> communicated this finding to s@k.o before submitting the paper and were
> encouraged to bring the issue to hardening after the paper was accepted to
> have a discussion on how to address the issue.
>
> The bypass takes advantage of the architectural requirement of entrypoints
> to begin with the endbr64 instruction and the ability to control GS_BASE
> from userspace via wrgsbase, from to the FSGSBASE extension, in order to
> perform a stack pivot to a ROP-chain.

Oh, fun, that's a gnarly quirk.

> Here is a snippet of the 64-bit entrypoint code:
> ```
> entry_SYSCALL_64:
> <+0>:     endbr64
> <+4>:     swapgs
> <+7>:     mov    QWORD PTR gs:0x6014,rsp
> <+16>:    jmp    <entry_SYSCALL_64+36>
> <+18>:    mov    rsp,cr3
> <+21>:    nop
> <+26>:    and    rsp,0xffffffffffffe7ff
> <+33>:    mov    cr3,rsp
> <+36>:    mov    rsp,QWORD PTR gs:0x32c98
> ```
>
> This is a valid target from any indirect callsite under FineIBT due to the
> endbr64 instruction and the lack of a software CFI check. After hijacking
> control flow to the entrypoint, executing swapgs will swap to the user
> controlled GS_BASE, which will be used to set the stack pointer, leading to
> a stack pivot. The rest of the entrypoint will execute with a hijacked
> GS_BASE on a user controlled stack. The stack page we use is one mapped in
> the user address space and from another thread we race overwriting returns
> addresses on the stack to pivot a second time to a ROP-chain. For this to
> succeed we required a large area of user-controlled kernel memory that can
> serve as the forged GS_BASE address, we did this by spraying 2MB
> Transparent Huge Pages to fill the kernel physical memory map with
> controlled 2MB allocations and guessing relative to the base address of the
> area to hit a page we control.
>
> We evaluated an approach to patching the issue in the paper but it touched
> the userspace API a bit, added an error code returned by syscalls if they
> are invoked with a kernel address in GS_BASE, which is not a great
> solution.
>
> Linus provided some thoughts on how to potentially address this issue
> in our communication with s@k.o, suggesting the kernel could make the
> KERNEL_GS_BASE match the GS_BASE value so both registers always contain a
> valid kernel address and a confusion induced by executing swapgs an extra
> time cannot occur, and restore the value of KERNEL_GS_BASE ahead of
> executing swapgs in the exit path.
>
> I started working on a patch based on the approach suggested by Linus but I
> haven't been able to get it passing the relevant x86 selftests yet. It
> turned out that it's more than the entrypoint code that needs to be
> modified for it to work, we need to correctly save and restore the user's
> GS_BASE across task switches and ensure it is updated correctly when set
> via arch_prctl and ptrace. Unfortunately, I lack familiarity with those
> parts of the kernel, and my understanding is that the paper will be made
> public in a couple weeks so I didn't want to delay too long on bringing the
> issue to this list.
>
> Assuming this is an issue you all feel is worth addressing, I will continue
> working on providing a patch. I'm concerned though that the overhead from
> adding a wrmsr on both syscall entry and exit to overwrite and restore the
> KERNEL_GS_BASE MSR may be quite high, so any feedback in regards to the
> approach or suggestions of alternate approaches to patching are welcome :)

Since the kernel, as far as I understand, uses FineIBT without
backwards control flow protection (in other words, I think we assume
that the kernel stack is trusted?), could we build a cheaper
check on that basis somehow? For example, maybe we could do something like:

```
endbr64
test rsp, rsp
js slowpath
swapgs
```

So we'd have the fast normal case where RSP points to userspace
(meaning we can't be coming from the kernel unless our stack has
already been pivoted, in which case forward edge protection alone
can't help anymore), and the slow case where RSP points to kernel
memory - in that case we'd then have to do some slower checks to
figure out whether weird userspace is making a syscall with RSP
pointing to the kernel, or whether we're coming from hijacked kernel
control flow.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-12 22:29 ` [RFC] Circumventing FineIBT Via Entrypoints Jann Horn
@ 2025-02-13  1:31   ` Andrew Cooper
  2025-02-13  2:09     ` Jann Horn
                       ` (2 more replies)
  2025-02-13  6:15   ` Jennifer Miller
  1 sibling, 3 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-02-13  1:31 UTC (permalink / raw)
  To: jannh
  Cc: jmill, joao, kees, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

>> Assuming this is an issue you all feel is worth addressing, I will
>> continue working on providing a patch. I'm concerned though that the
>> overhead from adding a wrmsr on both syscall entry and exit to
>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
>> any feedback in regards to the approach or suggestions of alternate
>> approaches to patching are welcome :) 
>
> Since the kernel, as far as I understand, uses FineIBT without
> backwards control flow protection (in other words, I think we assume
> that the kernel stack is trusted?),

This is fun indeed.  Linux cannot use supervisor shadow stacks because
the mess around NMI re-entrancy (and IST more generally) requires ROP
gadgets in order to function safely.  Implementing this with shadow
stacks active, while not impossible, is deemed to be prohibitively
complicated.

Linux's supervisor shadow stack support is waiting for FRED support,
which fixes both the NMI re-entrancy problem, and other exceptions
nesting within NMIs, as well as prohibiting the use of the SWAPGS
instruction as FRED tries to make sure that the correct GS is always in
context.

But, FRED support is slated for PantherLake/DiamondRapids which haven't
shipped yet, so are no use to the problem right now.

> could we build a cheaper
> check on that basis somehow? For example, maybe we could do something like:
>
> ```
> endbr64
> test rsp, rsp
> js slowpath
> swapgs
> ```

I presume it's been pointed out already, but there are 3 related
entrypoints here.  SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
which are related.

But, any other IDT entry is in a similar bucket.  If we're corrupting a
function pointer or return address to redirect here, then the check of
CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
stack frame.

For IDT entries, checking %rsp is reasonable, because userspace can't
forge a kernel-like %rsp.  However, SYSCALL64 specifically leaves %rsp
entirely attacker controlled (and even potentially non-canonical), so
I'm wondering what you hand in mind for the slowpath to truly
distinguish kernel context from user context?

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  1:31   ` Andrew Cooper
@ 2025-02-13  2:09     ` Jann Horn
  2025-02-13  2:42       ` Andrew Cooper
  2025-02-13 20:28     ` Kees Cook
  2025-02-14  9:54     ` Peter Zijlstra
  2 siblings, 1 reply; 40+ messages in thread
From: Jann Horn @ 2025-02-13  2:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: jmill, joao, kees, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >> Assuming this is an issue you all feel is worth addressing, I will
> >> continue working on providing a patch. I'm concerned though that the
> >> overhead from adding a wrmsr on both syscall entry and exit to
> >> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
> >> any feedback in regards to the approach or suggestions of alternate
> >> approaches to patching are welcome :)
> >
> > Since the kernel, as far as I understand, uses FineIBT without
> > backwards control flow protection (in other words, I think we assume
> > that the kernel stack is trusted?),
>
> This is fun indeed.  Linux cannot use supervisor shadow stacks because
> the mess around NMI re-entrancy (and IST more generally) requires ROP
> gadgets in order to function safely.  Implementing this with shadow
> stacks active, while not impossible, is deemed to be prohibitively
> complicated.
>
> Linux's supervisor shadow stack support is waiting for FRED support,
> which fixes both the NMI re-entrancy problem, and other exceptions
> nesting within NMIs, as well as prohibiting the use of the SWAPGS
> instruction as FRED tries to make sure that the correct GS is always in
> context.
>
> But, FRED support is slated for PantherLake/DiamondRapids which haven't
> shipped yet, so are no use to the problem right now.
>
> > could we build a cheaper
> > check on that basis somehow? For example, maybe we could do something like:
> >
> > ```
> > endbr64
> > test rsp, rsp
> > js slowpath
> > swapgs
> > ```
>
> I presume it's been pointed out already, but there are 3 related
> entrypoints here.  SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
> which are related.
>
> But, any other IDT entry is in a similar bucket.  If we're corrupting a
> function pointer or return address to redirect here, then the check of
> CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
> stack frame.
>
> For IDT entries, checking %rsp is reasonable, because userspace can't
> forge a kernel-like %rsp.  However, SYSCALL64 specifically leaves %rsp
> entirely attacker controlled (and even potentially non-canonical), so
> I'm wondering what you hand in mind for the slowpath to truly
> distinguish kernel context from user context?

Hm, yeah, that seems hard - maybe the best we could do is to make sure
that the inactive gsbase has the correct value for our CPU's kernel
gsbase? Kinda like a paranoid_entry, except more painful because we'd
first have to figure out a place to spill registers to before we can
start using stuff like rdmsr... Then a function pointer overwrite
might still turn into returning to userspace with a sysret with GPRs
full of kernel pointers, but at least we wouldn't run off of a bogus
gsbase anymore?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  2:09     ` Jann Horn
@ 2025-02-13  2:42       ` Andrew Cooper
  2025-02-22 20:43         ` Rudolf Marek
  2025-02-28 12:13         ` Florian Weimer
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-02-13  2:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: jmill, joao, kees, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On 13/02/2025 2:09 am, Jann Horn wrote:
> On Thu, Feb 13, 2025 at 2:31 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>>> Assuming this is an issue you all feel is worth addressing, I will
>>>> continue working on providing a patch. I'm concerned though that the
>>>> overhead from adding a wrmsr on both syscall entry and exit to
>>>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
>>>> any feedback in regards to the approach or suggestions of alternate
>>>> approaches to patching are welcome :)
>>> Since the kernel, as far as I understand, uses FineIBT without
>>> backwards control flow protection (in other words, I think we assume
>>> that the kernel stack is trusted?),
>> This is fun indeed.  Linux cannot use supervisor shadow stacks because
>> the mess around NMI re-entrancy (and IST more generally) requires ROP
>> gadgets in order to function safely.  Implementing this with shadow
>> stacks active, while not impossible, is deemed to be prohibitively
>> complicated.
>>
>> Linux's supervisor shadow stack support is waiting for FRED support,
>> which fixes both the NMI re-entrancy problem, and other exceptions
>> nesting within NMIs, as well as prohibiting the use of the SWAPGS
>> instruction as FRED tries to make sure that the correct GS is always in
>> context.
>>
>> But, FRED support is slated for PantherLake/DiamondRapids which haven't
>> shipped yet, so are no use to the problem right now.
>>
>>> could we build a cheaper
>>> check on that basis somehow? For example, maybe we could do something like:
>>>
>>> ```
>>> endbr64
>>> test rsp, rsp
>>> js slowpath
>>> swapgs
>>> ```
>> I presume it's been pointed out already, but there are 3 related
>> entrypoints here.  SYSCALL64 which is discussed, SYSCALL32 and SYSENTER
>> which are related.
>>
>> But, any other IDT entry is in a similar bucket.  If we're corrupting a
>> function pointer or return address to redirect here, then the check of
>> CS(%rsp) to control the conditional SWAPGS is an OoB read in the callers
>> stack frame.
>>
>> For IDT entries, checking %rsp is reasonable, because userspace can't
>> forge a kernel-like %rsp.  However, SYSCALL64 specifically leaves %rsp
>> entirely attacker controlled (and even potentially non-canonical), so
>> I'm wondering what you hand in mind for the slowpath to truly
>> distinguish kernel context from user context?
> Hm, yeah, that seems hard - maybe the best we could do is to make sure
> that the inactive gsbase has the correct value for our CPU's kernel
> gsbase? Kinda like a paranoid_entry, except more painful because we'd
> first have to figure out a place to spill registers to before we can
> start using stuff like rdmsr... Then a function pointer overwrite
> might still turn into returning to userspace with a sysret with GPRs
> full of kernel pointers, but at least we wouldn't run off of a bogus
> gsbase anymore?

Thinking about this some more, I think it's impossible to distinguish.

One of the many sharp edges of SYSCALL (and SYSENTER for that matter) is
that they're instructions expected to be only be used by userspace, but
that be executed in supervisor too[1].  They're asymmetric with their
SYSRET (and SYSEXIT) counterparts which are CPL0 instructions that
strictly transition into CPL3.

The SYSCALL behaviour TLDR is:

    %rcx = %rip
    %r11 = %eflags
    %cs = fixed attr
    %ss = fixed attr
    %rip = MSR_LSTAR

which means that %rcx (old rip) is the only piece of state which
userspace can't feasibly forge (and therefore could distinguish a
SYSCALL from user vs kernel mode), yet if we're talking about a JOP
chain to get here, then %rcx is under attacker control too.

There are a variety of solutions to this problem that involve not using
%gs for per-cpu data.  I also expect that to be wholly unpopular and
dismissed as an approach.

~Andrew

[1] No-one back then was brave enough to design CPL3-only instructions.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-12 22:29 ` [RFC] Circumventing FineIBT Via Entrypoints Jann Horn
  2025-02-13  1:31   ` Andrew Cooper
@ 2025-02-13  6:15   ` Jennifer Miller
  2025-02-13 19:23     ` Jann Horn
  1 sibling, 1 reply; 40+ messages in thread
From: Jennifer Miller @ 2025-02-13  6:15 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, linux-hardening, kees, joao, samitolvanen,
	kernel list, Andrew Cooper

On Wed, Feb 12, 2025 at 11:29:02PM +0100, Jann Horn wrote:
> +Andy Lutomirski (X86 entry code maintainer)
> 
> On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller <jmill@asu.edu> wrote:
> > As part of a recently accepted paper we demonstrated that syscall
> > entrypoints can be misused on x86-64 systems to generically bypass
> > FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We
> > communicated this finding to s@k.o before submitting the paper and were
> > encouraged to bring the issue to hardening after the paper was accepted to
> > have a discussion on how to address the issue.
> >
> > The bypass takes advantage of the architectural requirement of entrypoints
> > to begin with the endbr64 instruction and the ability to control GS_BASE
> > from userspace via wrgsbase, from to the FSGSBASE extension, in order to
> > perform a stack pivot to a ROP-chain.
> 
> Oh, fun, that's a gnarly quirk.

yeag :)

> Since the kernel, as far as I understand, uses FineIBT without
> backwards control flow protection (in other words, I think we assume
> that the kernel stack is trusted?), could we build a cheaper
> check on that basis somehow? For example, maybe we could do something like:
> 
> ```
> endbr64
> test rsp, rsp
> js slowpath
> swapgs
> ```
> 
> So we'd have the fast normal case where RSP points to userspace
> (meaning we can't be coming from the kernel unless our stack has
> already been pivoted, in which case forward edge protection alone
> can't help anymore), and the slow case where RSP points to kernel
> memory - in that case we'd then have to do some slower checks to
> figure out whether weird userspace is making a syscall with RSP
> pointing to the kernel, or whether we're coming from hijacked kernel
> control flow.

I've been tinkering this idea a bit and came with something.

In short, we could have the slowpath branch as you suggested, in the 
slowpath permit the stack switch and preserving of the registers on the
stack, but then do a sanity check according to the __per_cpu_offset array
and decide from there whether we should continue executing the entrypoint
or die/attempt to recover.

Here is some napkin asm for this I wrote for the 64-bit syscall entrypoint, 
I think more or less the same could be done for the other entrypoints.

```
    endbr64
    test rsp, rsp
    js slowpath

    swapgs
    ~~fastpath continues~~

; path taken when rsp was a kernel address
; we have no choice really but to switch to the stack from the untrusted
; gsbase but after doing so we have to be careful about what we put on the
; stack
slowpath:
    swapgs

; swap stacks as normal
    mov    QWORD PTR gs:[rip+0x7f005f85],rsp       # 0x6014 <cpu_tss_rw+20>
    mov    rsp,QWORD PTR gs:[rip+0x7f02c56d]       # 0x2c618 <pcpu_hot+24>

    ~~normal push and clear GPRs sequence here~~

; we entered with an rsp in the kernel address range.
; we already did swapgs but we don't know if we can trust our gsbase yet.
; we should be able to trust the ro_after_init __per_cpu_offset array
; though.

; check that gsbase is the expected value for our current cpu
    rdtscp
    mov rax, QWORD PTR [8*ecx-0x7d7be460] <__per_cpu_offset>

    rdgsbase rbx

    cmp rbx, rax
    je fastpath_after_regs_preserved

    wrgsbase rax

; if we reach here we are being exploited and should explode or attempt
; to recover
```

The unfortunate part is that it would still result in the register state
being dumped on top of some attacker controlled address, so if the error
path is recoverable someone could still use entrypoints to convert control
flow hijacking into memory corruption via register dump. So it would kill
the ability to get ROP but it would still be possible to dump regs over
modprobe_path, core_pattern, etc.

Does this seem feasible and any better than the alternative of overwriting
and restoring KERNEL_GS_BASE?

~Jennifer

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  6:15   ` Jennifer Miller
@ 2025-02-13 19:23     ` Jann Horn
  2025-02-13 21:24       ` Andrew Cooper
  2025-02-14 22:25       ` Josh Poimboeuf
  0 siblings, 2 replies; 40+ messages in thread
From: Jann Horn @ 2025-02-13 19:23 UTC (permalink / raw)
  To: Jennifer Miller
  Cc: Andy Lutomirski, linux-hardening, kees, joao, samitolvanen,
	kernel list, Andrew Cooper

On Thu, Feb 13, 2025 at 7:15 AM Jennifer Miller <jmill@asu.edu> wrote:
> On Wed, Feb 12, 2025 at 11:29:02PM +0100, Jann Horn wrote:
> > +Andy Lutomirski (X86 entry code maintainer)
> >
> > On Wed, Feb 12, 2025 at 10:08 PM Jennifer Miller <jmill@asu.edu> wrote:
> > > As part of a recently accepted paper we demonstrated that syscall
> > > entrypoints can be misused on x86-64 systems to generically bypass
> > > FineIBT/KERNEL_IBT from forwards-edge control flow hijacking. We
> > > communicated this finding to s@k.o before submitting the paper and were
> > > encouraged to bring the issue to hardening after the paper was accepted to
> > > have a discussion on how to address the issue.
> > >
> > > The bypass takes advantage of the architectural requirement of entrypoints
> > > to begin with the endbr64 instruction and the ability to control GS_BASE
> > > from userspace via wrgsbase, from to the FSGSBASE extension, in order to
> > > perform a stack pivot to a ROP-chain.
> >
> > Oh, fun, that's a gnarly quirk.
>
> yeag :)
>
> > Since the kernel, as far as I understand, uses FineIBT without
> > backwards control flow protection (in other words, I think we assume
> > that the kernel stack is trusted?), could we build a cheaper
> > check on that basis somehow? For example, maybe we could do something like:
> >
> > ```
> > endbr64
> > test rsp, rsp
> > js slowpath
> > swapgs
> > ```
> >
> > So we'd have the fast normal case where RSP points to userspace
> > (meaning we can't be coming from the kernel unless our stack has
> > already been pivoted, in which case forward edge protection alone
> > can't help anymore), and the slow case where RSP points to kernel
> > memory - in that case we'd then have to do some slower checks to
> > figure out whether weird userspace is making a syscall with RSP
> > pointing to the kernel, or whether we're coming from hijacked kernel
> > control flow.
>
> I've been tinkering this idea a bit and came with something.
>
> In short, we could have the slowpath branch as you suggested, in the
> slowpath permit the stack switch and preserving of the registers on the
> stack, but then do a sanity check according to the __per_cpu_offset array
> and decide from there whether we should continue executing the entrypoint
> or die/attempt to recover.

One ugly option to avoid the register spilling might be to say
"userspace is not allowed to execute a SYSCALL instruction while RSP
is a kernel address, and if userspace does it anyway, the kernel can
kill the process". Then the slowpath could immediately start using the
GPRs without having to worry about where to save their old values, and
it could read the correct gsbase with the GET_PERCPU_BASE macro. It
would be an ABI change, but one that is probably fairly unlikely to
actually break stuff? But it would require a bit of extra kernel code
on the slowpath, which is kinda annoying...

> Here is some napkin asm for this I wrote for the 64-bit syscall entrypoint,
> I think more or less the same could be done for the other entrypoints.
>
> ```
>     endbr64
>     test rsp, rsp
>     js slowpath
>
>     swapgs
>     ~~fastpath continues~~
>
> ; path taken when rsp was a kernel address
> ; we have no choice really but to switch to the stack from the untrusted
> ; gsbase but after doing so we have to be careful about what we put on the
> ; stack
> slowpath:
>     swapgs
>
> ; swap stacks as normal
>     mov    QWORD PTR gs:[rip+0x7f005f85],rsp       # 0x6014 <cpu_tss_rw+20>
>     mov    rsp,QWORD PTR gs:[rip+0x7f02c56d]       # 0x2c618 <pcpu_hot+24>
>
>     ~~normal push and clear GPRs sequence here~~
>
> ; we entered with an rsp in the kernel address range.
> ; we already did swapgs but we don't know if we can trust our gsbase yet.
> ; we should be able to trust the ro_after_init __per_cpu_offset array
> ; though.
>
> ; check that gsbase is the expected value for our current cpu
>     rdtscp
>     mov rax, QWORD PTR [8*ecx-0x7d7be460] <__per_cpu_offset>
>
>     rdgsbase rbx
>
>     cmp rbx, rax
>     je fastpath_after_regs_preserved
>
>     wrgsbase rax
>
> ; if we reach here we are being exploited and should explode or attempt
> ; to recover
> ```
>
> The unfortunate part is that it would still result in the register state
> being dumped on top of some attacker controlled address, so if the error
> path is recoverable someone could still use entrypoints to convert control
> flow hijacking into memory corruption via register dump. So it would kill
> the ability to get ROP but it would still be possible to dump regs over
> modprobe_path, core_pattern, etc.

It is annoying that we (as far as I know) don't have a nice clear
security model for what exactly CFI in the kernel is supposed to
achieve - though I guess that's partly because in its current version,
it only happens to protect against cases where an attacker gets a
function pointer overwrite, but not the probably more common cases
where the attacker (also?) gets an object pointer overwrite...

> Does this seem feasible and any better than the alternative of overwriting
> and restoring KERNEL_GS_BASE?

The syscall entry point is a hot path; my main reason for suggesting
the RSP check is that I'm worried about the performance impact of the
gsbase-overwriting approach, but I don't actually have numbers on
that. I figure a test + conditional jump is about the cheapest we can
do... Do we know how many cycles wrgsbase takes, and how serializing
is it? Sadly Agner Fog's tables don't seem to list it...

How would we actually do that overwriting and restoring of
KERNEL_GS_BASE? Would we need a scratch register for that?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  1:31   ` Andrew Cooper
  2025-02-13  2:09     ` Jann Horn
@ 2025-02-13 20:28     ` Kees Cook
  2025-02-13 20:41       ` Andrew Cooper
  2025-02-14  9:54     ` Peter Zijlstra
  2 siblings, 1 reply; 40+ messages in thread
From: Kees Cook @ 2025-02-13 20:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: jannh, jmill, joao, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 01:31:30AM +0000, Andrew Cooper wrote:
> >> Assuming this is an issue you all feel is worth addressing, I will
> >> continue working on providing a patch. I'm concerned though that the
> >> overhead from adding a wrmsr on both syscall entry and exit to
> >> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
> >> any feedback in regards to the approach or suggestions of alternate
> >> approaches to patching are welcome :) 
> >
> > Since the kernel, as far as I understand, uses FineIBT without
> > backwards control flow protection (in other words, I think we assume
> > that the kernel stack is trusted?),
> 
> This is fun indeed.  Linux cannot use supervisor shadow stacks because
> the mess around NMI re-entrancy (and IST more generally) requires ROP
> gadgets in order to function safely.  Implementing this with shadow
> stacks active, while not impossible, is deemed to be prohibitively
> complicated.

And just validate my understanding here, this attack is fundamentally
about FineIBT, not regular CFI (IBT or not), as the validation of target
addresses is done at indirect call time, yes?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 20:28     ` Kees Cook
@ 2025-02-13 20:41       ` Andrew Cooper
  2025-02-13 20:53         ` Kees Cook
  2025-02-14 10:05         ` Peter Zijlstra
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-02-13 20:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: jannh, jmill, joao, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On 13/02/2025 8:28 pm, Kees Cook wrote:
> On Thu, Feb 13, 2025 at 01:31:30AM +0000, Andrew Cooper wrote:
>>>> Assuming this is an issue you all feel is worth addressing, I will
>>>> continue working on providing a patch. I'm concerned though that the
>>>> overhead from adding a wrmsr on both syscall entry and exit to
>>>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
>>>> any feedback in regards to the approach or suggestions of alternate
>>>> approaches to patching are welcome :) 
>>> Since the kernel, as far as I understand, uses FineIBT without
>>> backwards control flow protection (in other words, I think we assume
>>> that the kernel stack is trusted?),
>> This is fun indeed.  Linux cannot use supervisor shadow stacks because
>> the mess around NMI re-entrancy (and IST more generally) requires ROP
>> gadgets in order to function safely.  Implementing this with shadow
>> stacks active, while not impossible, is deemed to be prohibitively
>> complicated.
> And just validate my understanding here, this attack is fundamentally
> about FineIBT, not regular CFI (IBT or not), as the validation of target
> addresses is done at indirect call time, yes?

I'm not sure I'd classify it like that.  As a pivot primitive, it works
very widely.

FineIBT (more specifically any hybrid CFI scheme which includes CET-IBT)
relies on hardware to do the course grain violation detection, and some
software hash for fine grain violation detection.

In this case, the requirement for the SYSCALL entrypoint to have an
ENDBR64 instruction means it passes the CET-IBT check (does not yield
#CP), and then lacks the software hash check as well.

i.e. this renders FineIBT (and other hybrid CFI schemes) rather moot,
because one hole is all the attacker needs to win, if they can control a
function pointer / return address.  At which point it's a large overhead
for no security benefit over simple CET-IBT.

The problem is that SYSCALL entry/exit is a toxic operating mode,
because you only have to think about sneezing and another user->kernel
priv-esc appears.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 20:41       ` Andrew Cooper
@ 2025-02-13 20:53         ` Kees Cook
  2025-02-13 20:57           ` Jann Horn
  2025-02-14  9:57           ` Peter Zijlstra
  2025-02-14 10:05         ` Peter Zijlstra
  1 sibling, 2 replies; 40+ messages in thread
From: Kees Cook @ 2025-02-13 20:53 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: jannh, jmill, joao, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 08:41:16PM +0000, Andrew Cooper wrote:
> On 13/02/2025 8:28 pm, Kees Cook wrote:
> > On Thu, Feb 13, 2025 at 01:31:30AM +0000, Andrew Cooper wrote:
> >>>> Assuming this is an issue you all feel is worth addressing, I will
> >>>> continue working on providing a patch. I'm concerned though that the
> >>>> overhead from adding a wrmsr on both syscall entry and exit to
> >>>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
> >>>> any feedback in regards to the approach or suggestions of alternate
> >>>> approaches to patching are welcome :) 
> >>> Since the kernel, as far as I understand, uses FineIBT without
> >>> backwards control flow protection (in other words, I think we assume
> >>> that the kernel stack is trusted?),
> >> This is fun indeed.  Linux cannot use supervisor shadow stacks because
> >> the mess around NMI re-entrancy (and IST more generally) requires ROP
> >> gadgets in order to function safely.  Implementing this with shadow
> >> stacks active, while not impossible, is deemed to be prohibitively
> >> complicated.
> > And just validate my understanding here, this attack is fundamentally
> > about FineIBT, not regular CFI (IBT or not), as the validation of target
> > addresses is done at indirect call time, yes?
> 
> I'm not sure I'd classify it like that.  As a pivot primitive, it works
> very widely.
> 
> FineIBT (more specifically any hybrid CFI scheme which includes CET-IBT)
> relies on hardware to do the course grain violation detection, and some
> software hash for fine grain violation detection.
> 
> In this case, the requirement for the SYSCALL entrypoint to have an
> ENDBR64 instruction means it passes the CET-IBT check (does not yield
> #CP), and then lacks the software hash check as well.
> 
> i.e. this renders FineIBT (and other hybrid CFI schemes) rather moot,
> because one hole is all the attacker needs to win, if they can control a
> function pointer / return address.  At which point it's a large overhead
> for no security benefit over simple CET-IBT.

Right, the "if they can control a function pointer" is the part I'm
focusing on. This attack depends on making an indirect call with a
controlled pointer. Non-FineIBT CFI will protect against that step,
so I think this is only an issue for IBT-only and FineIBT, but not CFI
nor CFI+IBT.

> The problem is that SYSCALL entry/exit is a toxic operating mode,
> because you only have to think about sneezing and another user->kernel
> priv-esc appears.

Yeah, once an attacker can make an indirect call to a controlled
address, everything falls apart. And using the entry just makes the
pivot all that much easier to find/use.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 20:53         ` Kees Cook
@ 2025-02-13 20:57           ` Jann Horn
  2025-02-16 23:42             ` Kees Cook
  2025-02-14  9:57           ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Jann Horn @ 2025-02-13 20:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Cooper, jmill, joao, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 9:53 PM Kees Cook <kees@kernel.org> wrote:
> On Thu, Feb 13, 2025 at 08:41:16PM +0000, Andrew Cooper wrote:
> > On 13/02/2025 8:28 pm, Kees Cook wrote:
> > > On Thu, Feb 13, 2025 at 01:31:30AM +0000, Andrew Cooper wrote:
> > >>>> Assuming this is an issue you all feel is worth addressing, I will
> > >>>> continue working on providing a patch. I'm concerned though that the
> > >>>> overhead from adding a wrmsr on both syscall entry and exit to
> > >>>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
> > >>>> any feedback in regards to the approach or suggestions of alternate
> > >>>> approaches to patching are welcome :)
> > >>> Since the kernel, as far as I understand, uses FineIBT without
> > >>> backwards control flow protection (in other words, I think we assume
> > >>> that the kernel stack is trusted?),
> > >> This is fun indeed.  Linux cannot use supervisor shadow stacks because
> > >> the mess around NMI re-entrancy (and IST more generally) requires ROP
> > >> gadgets in order to function safely.  Implementing this with shadow
> > >> stacks active, while not impossible, is deemed to be prohibitively
> > >> complicated.
> > > And just validate my understanding here, this attack is fundamentally
> > > about FineIBT, not regular CFI (IBT or not), as the validation of target
> > > addresses is done at indirect call time, yes?
> >
> > I'm not sure I'd classify it like that.  As a pivot primitive, it works
> > very widely.
> >
> > FineIBT (more specifically any hybrid CFI scheme which includes CET-IBT)
> > relies on hardware to do the course grain violation detection, and some
> > software hash for fine grain violation detection.
> >
> > In this case, the requirement for the SYSCALL entrypoint to have an
> > ENDBR64 instruction means it passes the CET-IBT check (does not yield
> > #CP), and then lacks the software hash check as well.
> >
> > i.e. this renders FineIBT (and other hybrid CFI schemes) rather moot,
> > because one hole is all the attacker needs to win, if they can control a
> > function pointer / return address.  At which point it's a large overhead
> > for no security benefit over simple CET-IBT.
>
> Right, the "if they can control a function pointer" is the part I'm
> focusing on. This attack depends on making an indirect call with a
> controlled pointer. Non-FineIBT CFI will protect against that step,
> so I think this is only an issue for IBT-only and FineIBT, but not CFI
> nor CFI+IBT.

To me, "CFI" is really just a fairly abstract concept; are you talking
specifically about the Clang scheme from
<https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html>, or
something else?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 19:23     ` Jann Horn
@ 2025-02-13 21:24       ` Andrew Cooper
  2025-02-13 23:24         ` Jennifer Miller
  2025-02-14 22:25       ` Josh Poimboeuf
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-02-13 21:24 UTC (permalink / raw)
  To: Jann Horn, Jennifer Miller
  Cc: Andy Lutomirski, linux-hardening, kees, joao, samitolvanen,
	kernel list

On 13/02/2025 7:23 pm, Jann Horn wrote:
> On Thu, Feb 13, 2025 at 7:15 AM Jennifer Miller <jmill@asu.edu> wrote:
>> Here is some napkin asm for this I wrote for the 64-bit syscall entrypoint,
>> I think more or less the same could be done for the other entrypoints.
>>
>> ```
>>     endbr64
>>     test rsp, rsp
>>     js slowpath
>>
>>     swapgs
>>     ~~fastpath continues~~
>>
>> ; path taken when rsp was a kernel address
>> ; we have no choice really but to switch to the stack from the untrusted
>> ; gsbase but after doing so we have to be careful about what we put on the
>> ; stack
>> slowpath:
>>     swapgs

I'm afraid I don't follow.  By this point, both basic blocks are the
same (a single swapgs).

Malicious userspace can get onto the slowpath by loading a kernel
pointer into %rsp.  Furthermore, if the origin of this really was in the
kernel, then ...

>>
>> ; swap stacks as normal
>>     mov    QWORD PTR gs:[rip+0x7f005f85],rsp       # 0x6014 <cpu_tss_rw+20>
>>     mov    rsp,QWORD PTR gs:[rip+0x7f02c56d]       # 0x2c618 <pcpu_hot+24>

... these are memory accesses using the user %gs.  As you note a few
lines lower, %gs isn't safe at this point.

A cunning attacker can make gs:[rip+0x7f02c56d] be a read-only mapping,
at point we'll have loaded an attacker controlled %rsp, then take #PF
trying to spill %rsp into pcpu_hot, and now we're running the pagefault
handler on an attacker controlled stack and gsbase.

>>     ~~normal push and clear GPRs sequence here~~
>>
>> ; we entered with an rsp in the kernel address range.
>> ; we already did swapgs but we don't know if we can trust our gsbase yet.
>> ; we should be able to trust the ro_after_init __per_cpu_offset array
>> ; though.
>>
>> ; check that gsbase is the expected value for our current cpu
>>     rdtscp
>>     mov rax, QWORD PTR [8*ecx-0x7d7be460] <__per_cpu_offset>
>>
>>     rdgsbase rbx
>>
>>     cmp rbx, rax
>>     je fastpath_after_regs_preserved
>>
>>     wrgsbase rax

Irrespective of other things, you'll need some compatibility strategy
for the fact that RDTSCP and {RD,WR}{FS,GS}BASE cannot be used
unconditionally in 64bit mode.  It might be as simple as making FineIBT
depend on their presence to activate, but taking a #UD exception in this
path is also a priv-esc vulnerability.

While all CET-IBT capable CPUs ought to have RDTSCP/*BASE, there are
virt environments where this implication does not hold.

>>
>> ; if we reach here we are being exploited and should explode or attempt
>> ; to recover
>> ```
>>
>> The unfortunate part is that it would still result in the register state
>> being dumped on top of some attacker controlled address, so if the error
>> path is recoverable someone could still use entrypoints to convert control
>> flow hijacking into memory corruption via register dump. So it would kill
>> the ability to get ROP but it would still be possible to dump regs over
>> modprobe_path, core_pattern, etc.
> It is annoying that we (as far as I know) don't have a nice clear
> security model for what exactly CFI in the kernel is supposed to
> achieve - though I guess that's partly because in its current version,
> it only happens to protect against cases where an attacker gets a
> function pointer overwrite, but not the probably more common cases
> where the attacker (also?) gets an object pointer overwrite...
>
>> Does this seem feasible and any better than the alternative of overwriting
>> and restoring KERNEL_GS_BASE?
> The syscall entry point is a hot path; my main reason for suggesting
> the RSP check is that I'm worried about the performance impact of the
> gsbase-overwriting approach, but I don't actually have numbers on
> that. I figure a test + conditional jump is about the cheapest we can
> do...

Yeah, this is the cheapest I can think of too.  TEST+JS has been able to
macrofuse since the Core2 era.

> Do we know how many cycles wrgsbase takes, and how serializing
> is it? Sadly Agner Fog's tables don't seem to list it...

Not (architecturally) serialising, and pretty quick IIRC.  It is
microcoded, but the segment registers are renamed so it can execute
speculatively.

~Andrew

>
> How would we actually do that overwriting and restoring of
> KERNEL_GS_BASE? Would we need a scratch register for that?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 21:24       ` Andrew Cooper
@ 2025-02-13 23:24         ` Jennifer Miller
  2025-02-13 23:43           ` Jann Horn
  2025-02-14 23:06           ` Andrew Cooper
  0 siblings, 2 replies; 40+ messages in thread
From: Jennifer Miller @ 2025-02-13 23:24 UTC (permalink / raw)
  To: Andrew Cooper, Jann Horn
  Cc: Andy Lutomirski, linux-hardening, kees, joao, samitolvanen,
	kernel list

On Thu, Feb 13, 2025 at 09:24:18PM +0000, Andrew Cooper wrote:
> On 13/02/2025 7:23 pm, Jann Horn wrote:
> > On Thu, Feb 13, 2025 at 7:15 AM Jennifer Miller <jmill@asu.edu> wrote:
> >> Here is some napkin asm for this I wrote for the 64-bit syscall entrypoint,
> >> I think more or less the same could be done for the other entrypoints.
> >>
> >> ```
> >>     endbr64
> >>     test rsp, rsp
> >>     js slowpath
> >>
> >>     swapgs
> >>     ~~fastpath continues~~
> >>
> >> ; path taken when rsp was a kernel address
> >> ; we have no choice really but to switch to the stack from the untrusted
> >> ; gsbase but after doing so we have to be careful about what we put on the
> >> ; stack
> >> slowpath:
> >>     swapgs
> 
> I'm afraid I don't follow.  By this point, both basic blocks are the
> same (a single swapgs).

Ah sure, the test/js could be moved occur after swapgs to save an 
instruction.

>
> Malicious userspace can get onto the slowpath by loading a kernel
> pointer into %rsp.  Furthermore, if the origin of this really was in the
> kernel, then ...
> 
> >>
> >> ; swap stacks as normal
> >>     mov    QWORD PTR gs:[rip+0x7f005f85],rsp       # 0x6014 <cpu_tss_rw+20>
> >>     mov    rsp,QWORD PTR gs:[rip+0x7f02c56d]       # 0x2c618 <pcpu_hot+24>
> 
> ... these are memory accesses using the user %gs.  As you note a few
> lines lower, %gs isn't safe at this point.
> 
> A cunning attacker can make gs:[rip+0x7f02c56d] be a read-only mapping,
> at point we'll have loaded an attacker controlled %rsp, then take #PF
> trying to spill %rsp into pcpu_hot, and now we're running the pagefault
> handler on an attacker controlled stack and gsbase.
> 

I don't follow, the spill of %rsp into pcpu_hot occurs first, before we
would move to the attacker controlled stack. This is Intel asm syntax,
sorry if that was unclear.

Still, I hadn't considered misusing readonly/unmapped pages on the GPR
register spill that follows. Could we enforce that the stack pointer we get
be page aligned to prevent this vector? So that if one were to attempt to
point the stack to readonly or unmapped memory they should be guaranteed to
double fault?

> >>     ~~normal push and clear GPRs sequence here~~
> >>
> >> ; we entered with an rsp in the kernel address range.
> >> ; we already did swapgs but we don't know if we can trust our gsbase yet.
> >> ; we should be able to trust the ro_after_init __per_cpu_offset array
> >> ; though.
> >>
> >> ; check that gsbase is the expected value for our current cpu
> >>     rdtscp
> >>     mov rax, QWORD PTR [8*ecx-0x7d7be460] <__per_cpu_offset>
> >>
> >>     rdgsbase rbx
> >>
> >>     cmp rbx, rax
> >>     je fastpath_after_regs_preserved
> >>
> >>     wrgsbase rax
> 
> Irrespective of other things, you'll need some compatibility strategy
> for the fact that RDTSCP and {RD,WR}{FS,GS}BASE cannot be used
> unconditionally in 64bit mode.  It might be as simple as making FineIBT
> depend on their presence to activate, but taking a #UD exception in this
> path is also a priv-esc vulnerability.

Sure, we could rdmsr IA32_TSC_AUX in place of rdtscsp. After the wrgsbase 
we could switch to the expected kernel stack now that gsbase is fixed 
before taking any #UD.

> 
> While all CET-IBT capable CPUs ought to have RDTSCP/*BASE, there are
> virt environments where this implication does not hold.
> 
> >>
> >> ; if we reach here we are being exploited and should explode or attempt
> >> ; to recover
> >> ```
> >>
> >> The unfortunate part is that it would still result in the register state
> >> being dumped on top of some attacker controlled address, so if the error
> >> path is recoverable someone could still use entrypoints to convert control
> >> flow hijacking into memory corruption via register dump. So it would kill
> >> the ability to get ROP but it would still be possible to dump regs over
> >> modprobe_path, core_pattern, etc.
> > It is annoying that we (as far as I know) don't have a nice clear
> > security model for what exactly CFI in the kernel is supposed to
> > achieve - though I guess that's partly because in its current version,
> > it only happens to protect against cases where an attacker gets a
> > function pointer overwrite, but not the probably more common cases
> > where the attacker (also?) gets an object pointer overwrite...
> >
> >> Does this seem feasible and any better than the alternative of overwriting
> >> and restoring KERNEL_GS_BASE?
> > The syscall entry point is a hot path; my main reason for suggesting
> > the RSP check is that I'm worried about the performance impact of the
> > gsbase-overwriting approach, but I don't actually have numbers on
> > that. I figure a test + conditional jump is about the cheapest we can
> > do...
> 
> Yeah, this is the cheapest I can think of too.  TEST+JS has been able to
> macrofuse since the Core2 era.
> 
> > Do we know how many cycles wrgsbase takes, and how serializing
> > is it? Sadly Agner Fog's tables don't seem to list it...
> 
> Not (architecturally) serialising, and pretty quick IIRC.  It is
> microcoded, but the segment registers are renamed so it can execute
> speculatively.
> 
> ~Andrew
> 
> >
> > How would we actually do that overwriting and restoring of
> > KERNEL_GS_BASE? Would we need a scratch register for that?
> 

I think we can do the overwrite at any point before actually calling into 
the individual syscall handlers, really anywhere before potentially 
hijacked indirect control flow can occur and then restore it just after 
those return e.g., for the 64-bit path I am currently overwriting it at the
start of do_syscall_64 and then restoring it just before 
syscall_exit_to_user_mode. I'm not sure if there is any reason to do it
sooner while we'd still be register constrained.

~Jennifer

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 23:24         ` Jennifer Miller
@ 2025-02-13 23:43           ` Jann Horn
  2025-02-14 23:06           ` Andrew Cooper
  1 sibling, 0 replies; 40+ messages in thread
From: Jann Horn @ 2025-02-13 23:43 UTC (permalink / raw)
  To: Jennifer Miller
  Cc: Andrew Cooper, Andy Lutomirski, linux-hardening, kees, joao,
	samitolvanen, kernel list

On Fri, Feb 14, 2025 at 12:24 AM Jennifer Miller <jmill@asu.edu> wrote:
> On Thu, Feb 13, 2025 at 09:24:18PM +0000, Andrew Cooper wrote:
> > On 13/02/2025 7:23 pm, Jann Horn wrote:
> > > How would we actually do that overwriting and restoring of
> > > KERNEL_GS_BASE? Would we need a scratch register for that?
> >
>
> I think we can do the overwrite at any point before actually calling into
> the individual syscall handlers, really anywhere before potentially
> hijacked indirect control flow can occur and then restore it just after
> those return e.g., for the 64-bit path I am currently overwriting it at the
> start of do_syscall_64 and then restoring it just before
> syscall_exit_to_user_mode. I'm not sure if there is any reason to do it
> sooner while we'd still be register constrained.

Right, makes sense - sorry, I misremembered the details of the
KERNEL_GS_BASE overwrite proposal, I had to re-read your first mail.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  1:31   ` Andrew Cooper
  2025-02-13  2:09     ` Jann Horn
  2025-02-13 20:28     ` Kees Cook
@ 2025-02-14  9:54     ` Peter Zijlstra
  2 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-14  9:54 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: jannh, jmill, joao, kees, linux-hardening, linux-kernel, luto,
	samitolvanen

On Thu, Feb 13, 2025 at 01:31:30AM +0000, Andrew Cooper wrote:

> But, FRED support is slated for PantherLake/DiamondRapids which haven't
> shipped yet, so are no use to the problem right now.

FRED also fixes this IBT 'oopsie' IIRC.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 20:53         ` Kees Cook
  2025-02-13 20:57           ` Jann Horn
@ 2025-02-14  9:57           ` Peter Zijlstra
  2025-02-15 21:07             ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-14  9:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Cooper, jannh, jmill, joao, linux-hardening, linux-kernel,
	luto, samitolvanen

On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:

> Right, the "if they can control a function pointer" is the part I'm
> focusing on. This attack depends on making an indirect call with a
> controlled pointer. Non-FineIBT CFI will protect against that step,
> so I think this is only an issue for IBT-only and FineIBT, but not CFI
> nor CFI+IBT.

Yes, the whole caller side validation should stop this.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 20:41       ` Andrew Cooper
  2025-02-13 20:53         ` Kees Cook
@ 2025-02-14 10:05         ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-14 10:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kees Cook, jannh, jmill, joao, linux-hardening, linux-kernel,
	luto, samitolvanen

On Thu, Feb 13, 2025 at 08:41:16PM +0000, Andrew Cooper wrote:

> The problem is that SYSCALL entry/exit is a toxic operating mode,
> because you only have to think about sneezing and another user->kernel
> priv-esc appears.

For a very brief moment I thought we could leave out the ENDBR there and
eat the #CP, but 1) slow, and 2) then #CP needs to be an IST and ARGHH.

So yeah, I didn't just suggest anything at all.

I hate all this.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 19:23     ` Jann Horn
  2025-02-13 21:24       ` Andrew Cooper
@ 2025-02-14 22:25       ` Josh Poimboeuf
  1 sibling, 0 replies; 40+ messages in thread
From: Josh Poimboeuf @ 2025-02-14 22:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jennifer Miller, Andy Lutomirski, linux-hardening, kees, joao,
	samitolvanen, kernel list, Andrew Cooper

On Thu, Feb 13, 2025 at 08:23:34PM +0100, Jann Horn wrote:
> On Thu, Feb 13, 2025 at 7:15 AM Jennifer Miller <jmill@asu.edu> wrote:
> > In short, we could have the slowpath branch as you suggested, in the
> > slowpath permit the stack switch and preserving of the registers on the
> > stack, but then do a sanity check according to the __per_cpu_offset array
> > and decide from there whether we should continue executing the entrypoint
> > or die/attempt to recover.
> 
> One ugly option to avoid the register spilling might be to say
> "userspace is not allowed to execute a SYSCALL instruction while RSP
> is a kernel address, and if userspace does it anyway, the kernel can
> kill the process". Then the slowpath could immediately start using the
> GPRs without having to worry about where to save their old values, and
> it could read the correct gsbase with the GET_PERCPU_BASE macro. It
> would be an ABI change, but one that is probably fairly unlikely to
> actually break stuff? But it would require a bit of extra kernel code
> on the slowpath, which is kinda annoying...

Could all this be made easier if we went back to having percpu entry
trampolines?  Then the trampoline could just use a PC-relative access to
get the kernel stack pointer without needing %gs.

I think the main reason the entry trampolines were removed was because
they needed an indirect branch to jump back to the global text.  But
they could be allocated within 2GB of the entry text and do a direct
jump.

-- 
Josh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 23:24         ` Jennifer Miller
  2025-02-13 23:43           ` Jann Horn
@ 2025-02-14 23:06           ` Andrew Cooper
  2025-02-15  0:07             ` Jennifer Miller
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-02-14 23:06 UTC (permalink / raw)
  To: Jennifer Miller, Jann Horn
  Cc: Andy Lutomirski, linux-hardening, kees, joao, samitolvanen,
	kernel list

On 13/02/2025 11:24 pm, Jennifer Miller wrote:
> On Thu, Feb 13, 2025 at 09:24:18PM +0000, Andrew Cooper wrote:
>>>> ; swap stacks as normal
>>>>     mov    QWORD PTR gs:[rip+0x7f005f85],rsp       # 0x6014 <cpu_tss_rw+20>
>>>>     mov    rsp,QWORD PTR gs:[rip+0x7f02c56d]       # 0x2c618 <pcpu_hot+24>
>> ... these are memory accesses using the user %gs.  As you note a few
>> lines lower, %gs isn't safe at this point.
>>
>> A cunning attacker can make gs:[rip+0x7f02c56d] be a read-only mapping,
>> at point we'll have loaded an attacker controlled %rsp, then take #PF
>> trying to spill %rsp into pcpu_hot, and now we're running the pagefault
>> handler on an attacker controlled stack and gsbase.
>>
> I don't follow, the spill of %rsp into pcpu_hot occurs first, before we
> would move to the attacker controlled stack. This is Intel asm syntax,
> sorry if that was unclear.

No, sorry.  It's clearly written; I simply wasn't paying enough attention.

> Still, I hadn't considered misusing readonly/unmapped pages on the GPR
> register spill that follows. Could we enforce that the stack pointer we get
> be page aligned to prevent this vector? So that if one were to attempt to
> point the stack to readonly or unmapped memory they should be guaranteed to
> double fault?

Hmm.

Espfix64 does involve #DF recovering from a write to a read-only stack. 
(This broken corner of x86 is also fixed in FRED.   We fixed a *lot* of
thing.)

As long the #DF handler can be updated to safely distinguish espfix64
from this entrypoint attack, this seems like it might mitigate the
read-only case.
> I think we can do the overwrite at any point before actually calling into 
> the individual syscall handlers, really anywhere before potentially 
> hijacked indirect control flow can occur and then restore it just after 
> those return e.g., for the 64-bit path I am currently overwriting it at the
> start of do_syscall_64 and then restoring it just before 
> syscall_exit_to_user_mode. I'm not sure if there is any reason to do it
> sooner while we'd still be register constrained.

I don't follow.  If any "bad" execution is found in an entrypoint, Linux
needs to panic().  Detecting the malice involves clobbering an in-use
stack, and there's no ability to safely recover.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-14 23:06           ` Andrew Cooper
@ 2025-02-15  0:07             ` Jennifer Miller
  2025-02-15  0:11               ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jennifer Miller @ 2025-02-15  0:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jann Horn, Andy Lutomirski, linux-hardening, kees, joao,
	samitolvanen, kernel list

On Fri, Feb 14, 2025 at 11:06:50PM +0000, Andrew Cooper wrote:
> On 13/02/2025 11:24 pm, Jennifer Miller wrote:
> > On Thu, Feb 13, 2025 at 09:24:18PM +0000, Andrew Cooper wrote:
> >>>> ; swap stacks as normal
> >>>>     mov    QWORD PTR gs:[rip+0x7f005f85],rsp       # 0x6014 <cpu_tss_rw+20>
> >>>>     mov    rsp,QWORD PTR gs:[rip+0x7f02c56d]       # 0x2c618 <pcpu_hot+24>
> >> ... these are memory accesses using the user %gs.  As you note a few
> >> lines lower, %gs isn't safe at this point.
> >>
> >> A cunning attacker can make gs:[rip+0x7f02c56d] be a read-only mapping,
> >> at point we'll have loaded an attacker controlled %rsp, then take #PF
> >> trying to spill %rsp into pcpu_hot, and now we're running the pagefault
> >> handler on an attacker controlled stack and gsbase.
> >>
> > I don't follow, the spill of %rsp into pcpu_hot occurs first, before we
> > would move to the attacker controlled stack. This is Intel asm syntax,
> > sorry if that was unclear.
> 
> No, sorry.  It's clearly written; I simply wasn't paying enough attention.
> 
> > Still, I hadn't considered misusing readonly/unmapped pages on the GPR
> > register spill that follows. Could we enforce that the stack pointer we get
> > be page aligned to prevent this vector? So that if one were to attempt to
> > point the stack to readonly or unmapped memory they should be guaranteed to
> > double fault?
> 
> Hmm.
> 
> Espfix64 does involve #DF recovering from a write to a read-only stack. 
> (This broken corner of x86 is also fixed in FRED.   We fixed a *lot* of
> thing.)

Interesting, I haven't gotten around to reading into how FRED works, it
sounds neat.

> 
> As long the #DF handler can be updated to safely distinguish espfix64
> from this entrypoint attack, this seems like it might mitigate the
> read-only case.
> > I think we can do the overwrite at any point before actually calling into 
> > the individual syscall handlers, really anywhere before potentially 
> > hijacked indirect control flow can occur and then restore it just after 
> > those return e.g., for the 64-bit path I am currently overwriting it at the
> > start of do_syscall_64 and then restoring it just before 
> > syscall_exit_to_user_mode. I'm not sure if there is any reason to do it
> > sooner while we'd still be register constrained.
> 
> I don't follow.  If any "bad" execution is found in an entrypoint, Linux
> needs to panic().  Detecting the malice involves clobbering an in-use
> stack, and there's no ability to safely recover.

Sorry, this was in response to Jann's question about the mitigation
strategy proposed in my initial email.

> 
> ~Andrew

~Jennifer

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-15  0:07             ` Jennifer Miller
@ 2025-02-15  0:11               ` Andrew Cooper
  2025-02-15  0:19                 ` Jennifer Miller
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-02-15  0:11 UTC (permalink / raw)
  To: Jennifer Miller
  Cc: Jann Horn, Andy Lutomirski, linux-hardening, kees, joao,
	samitolvanen, kernel list

On 15/02/2025 12:07 am, Jennifer Miller wrote:
> On Fri, Feb 14, 2025 at 11:06:50PM +0000, Andrew Cooper wrote:
>> On 13/02/2025 11:24 pm, Jennifer Miller wrote:
>>> On Thu, Feb 13, 2025 at 09:24:18PM +0000, Andrew Cooper wrote:
>>> Still, I hadn't considered misusing readonly/unmapped pages on the GPR
>>> register spill that follows. Could we enforce that the stack pointer we get
>>> be page aligned to prevent this vector? So that if one were to attempt to
>>> point the stack to readonly or unmapped memory they should be guaranteed to
>>> double fault?
>> Hmm.
>>
>> Espfix64 does involve #DF recovering from a write to a read-only stack. 
>> (This broken corner of x86 is also fixed in FRED.   We fixed a *lot* of
>> thing.)
> Interesting, I haven't gotten around to reading into how FRED works, it
> sounds neat.

Start with
https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing


Then
https://www.intel.com/content/www/us/en/content-details/779982/flexible-return-and-event-delivery-fred-specification.html

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-15  0:11               ` Andrew Cooper
@ 2025-02-15  0:19                 ` Jennifer Miller
  0 siblings, 0 replies; 40+ messages in thread
From: Jennifer Miller @ 2025-02-15  0:19 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jann Horn, Andy Lutomirski, linux-hardening, kees, joao,
	samitolvanen, kernel list

On Sat, Feb 15, 2025 at 12:11:17AM +0000, Andrew Cooper wrote:
> On 15/02/2025 12:07 am, Jennifer Miller wrote:
> > On Fri, Feb 14, 2025 at 11:06:50PM +0000, Andrew Cooper wrote:
> >> On 13/02/2025 11:24 pm, Jennifer Miller wrote:
> >>> On Thu, Feb 13, 2025 at 09:24:18PM +0000, Andrew Cooper wrote:
> >>> Still, I hadn't considered misusing readonly/unmapped pages on the GPR
> >>> register spill that follows. Could we enforce that the stack pointer we get
> >>> be page aligned to prevent this vector? So that if one were to attempt to
> >>> point the stack to readonly or unmapped memory they should be guaranteed to
> >>> double fault?
> >> Hmm.
> >>
> >> Espfix64 does involve #DF recovering from a write to a read-only stack. 
> >> (This broken corner of x86 is also fixed in FRED.   We fixed a *lot* of
> >> thing.)
> > Interesting, I haven't gotten around to reading into how FRED works, it
> > sounds neat.
> 
> Start with
> https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing
> 
> 
> Then
> https://www.intel.com/content/www/us/en/content-details/779982/flexible-return-and-event-delivery-fred-specification.html
> 
> ~Andrew

Thanks, I'll give those a read!

~Jennifer

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-14  9:57           ` Peter Zijlstra
@ 2025-02-15 21:07             ` Peter Zijlstra
  2025-02-16 23:51               ` Kees Cook
  2025-02-17 13:06               ` David Laight
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-15 21:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Cooper, jannh, jmill, joao, linux-hardening, linux-kernel,
	luto, samitolvanen, scott.d.constable, x86

On Fri, Feb 14, 2025 at 10:57:51AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:
> 
> > Right, the "if they can control a function pointer" is the part I'm
> > focusing on. This attack depends on making an indirect call with a
> > controlled pointer. Non-FineIBT CFI will protect against that step,
> > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > nor CFI+IBT.
> 
> Yes, the whole caller side validation should stop this.

And I think we can retro-fit that in FineIBT. Notably the current call
sites look like:

0000000000000060 <fineibt_caller>:
  60:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
  66:   49 83 eb 10             sub    $0x10,%r11
  6a:   0f 1f 40 00             nopl   0x0(%rax)
  6e:   41 ff d3                call   *%r11
  71:   0f 1f 00                nopl   (%rax)

Of which the last 6 bytes are the retpoline site (starting at 0x6e). It
is trivially possible to re-arrange things to have both nops next to one
another, giving us 7 bytes to muck about with.

And I think we can just about manage to do a caller side hash validation
in them bytes like:

0000000000000080 <fineibt_paranoid>:
  80:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
  86:   49 83 eb 10             sub    $0x10,%r11
  8a:   45 3b 53 07             cmp    0x7(%r11),%r10d
  8e:   74 01                   je     91 <fineibt_paranoid+0x11>
  90:   ea                      (bad)
  91:   41 ff d3                call   *%r11

And while this is somewhat daft, it would close the hole vs this entry
point swizzle afaict, no?

Patch against tip/x86/core (which includes the latest ibt bits as per
this morning).

Boots and builds the next kernel on my ADL.

---
 arch/x86/include/asm/bug.h    |   1 +
 arch/x86/include/asm/cfi.h    |   8 ++--
 arch/x86/kernel/alternative.c | 107 +++++++++++++++++++++++++++++++++++++++---
 arch/x86/kernel/cfi.c         |   4 +-
 arch/x86/kernel/traps.c       |  13 ++++-
 5 files changed, 120 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 1a5e4b372694..bc8a2ca3c82e 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -25,6 +25,7 @@
 #define BUG_UD2			0xfffe
 #define BUG_UD1			0xfffd
 #define BUG_UD1_UBSAN		0xfffc
+#define BUG_EA			0xffea
 
 #ifdef CONFIG_GENERIC_BUG
 
diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
index 7dd5ab239c87..550f75450e43 100644
--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -104,7 +104,7 @@ extern enum cfi_mode cfi_mode;
 struct pt_regs;
 
 #ifdef CONFIG_CFI_CLANG
-enum bug_trap_type handle_cfi_failure(struct pt_regs *regs);
+enum bug_trap_type handle_cfi_failure(int ud_type, struct pt_regs *regs);
 #define __bpfcall
 extern u32 cfi_bpf_hash;
 extern u32 cfi_bpf_subprog_hash;
@@ -127,10 +127,10 @@ static inline int cfi_get_offset(void)
 extern u32 cfi_get_func_hash(void *func);
 
 #ifdef CONFIG_FINEIBT
-extern bool decode_fineibt_insn(struct pt_regs *regs, unsigned long *target, u32 *type);
+extern bool decode_fineibt_insn(int ud_type, struct pt_regs *regs, unsigned long *target, u32 *type);
 #else
 static inline bool
-decode_fineibt_insn(struct pt_regs *regs, unsigned long *target, u32 *type)
+decode_fineibt_insn(int ud_type, struct pt_regs *regs, unsigned long *target, u32 *type)
 {
 	return false;
 }
@@ -138,7 +138,7 @@ decode_fineibt_insn(struct pt_regs *regs, unsigned long *target, u32 *type)
 #endif
 
 #else
-static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs)
+static inline enum bug_trap_type handle_cfi_failure(int ud_type, struct pt_regs *regs)
 {
 	return BUG_TRAP_TYPE_NONE;
 }
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 247ee5ffbff4..9e327b5e9f75 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -741,6 +741,11 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
 		op2 = insn.opcode.bytes[1];
 
 		switch (op1) {
+		case 0x70 ... 0x7f:	/* Jcc.d8 */
+			/* See cfi_paranoid. */
+			WARN_ON_ONCE(cfi_mode != CFI_FINEIBT);
+			continue;
+
 		case CALL_INSN_OPCODE:
 		case JMP32_INSN_OPCODE:
 			break;
@@ -983,6 +988,8 @@ u32 cfi_get_func_hash(void *func)
 static bool cfi_rand __ro_after_init = true;
 static u32  cfi_seed __ro_after_init;
 
+static bool cfi_paranoid __ro_after_init = false;
+
 /*
  * Re-hash the CFI hash with a boot-time seed while making sure the result is
  * not a valid ENDBR instruction.
@@ -1022,6 +1029,8 @@ static __init int cfi_parse_cmdline(char *str)
 			cfi_mode = CFI_FINEIBT;
 		} else if (!strcmp(str, "norand")) {
 			cfi_rand = false;
+		} else if (!strcmp(str, "paranoid")) {
+			cfi_paranoid = true;
 		} else {
 			pr_err("Ignoring unknown cfi option (%s).", str);
 		}
@@ -1097,6 +1106,29 @@ extern u8 fineibt_caller_end[];
 
 #define fineibt_caller_jmp (fineibt_caller_size - 2)
 
+asm(	".pushsection .rodata			\n"
+	"fineibt_paranoid_start:		\n"
+	"	movl	$0x12345678, %r10d	\n"
+	"	sub	$16, %r11		\n"
+	"	cmpl	7(%r11), %r10d		\n"
+	"	je	fineibt_paranoid_call	\n"
+	"fineibt_paranoid_trap:			\n"
+	"	.byte	0xea			\n"
+	"fineibt_paranoid_call:			\n"
+	"	call	*%r11			\n"
+	"fineibt_paranoid_end:			\n"
+	".popsection				\n"
+);
+
+extern u8 fineibt_paranoid_start[];
+extern u8 fineibt_paranoid_trap[];
+extern u8 fineibt_paranoid_call[];
+extern u8 fineibt_paranoid_end[];
+
+#define fineibt_paranoid_size (fineibt_paranoid_end - fineibt_paranoid_start)
+#define fineibt_paranoid_ud   (fineibt_paranoid_trap - fineibt_paranoid_start)
+#define fineibt_paranoid_ind  (fineibt_paranoid_call - fineibt_paranoid_start)
+
 static u32 decode_preamble_hash(void *addr)
 {
 	u8 *p = addr;
@@ -1260,18 +1292,48 @@ static int cfi_rewrite_callers(s32 *start, s32 *end)
 {
 	s32 *s;
 
+	BUG_ON(fineibt_paranoid_size != 20);
+
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
+		struct insn insn;
+		u8 bytes[20];
 		u32 hash;
+		int ret;
+		u8 op;
 
 		addr -= fineibt_caller_size;
 		hash = decode_caller_hash(addr);
-		if (hash) {
+		if (!hash)
+			continue;
+
+		if (!cfi_paranoid) {
 			text_poke_early(addr, fineibt_caller_start, fineibt_caller_size);
 			WARN_ON(*(u32 *)(addr + fineibt_caller_hash) != 0x12345678);
 			text_poke_early(addr + fineibt_caller_hash, &hash, 4);
+			/* rely on apply_retpolines() */
+			continue;
 		}
-		/* rely on apply_retpolines() */
+
+		/* cfi_paranoid */
+		ret = insn_decode_kernel(&insn, addr + fineibt_caller_size);
+		if (WARN_ON_ONCE(ret < 0))
+			continue;
+
+		op = insn.opcode.bytes[0];
+		if (op != CALL_INSN_OPCODE && op != JMP32_INSN_OPCODE) {
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		memcpy(bytes, fineibt_paranoid_start, fineibt_paranoid_size);
+		memcpy(bytes + fineibt_caller_hash, &hash, 4);
+
+		ret = emit_indirect(op, 11, bytes + fineibt_paranoid_ind);
+		if (WARN_ON_ONCE(ret != 3))
+			continue;
+
+		text_poke_early(addr, bytes, fineibt_paranoid_size);
 	}
 
 	return 0;
@@ -1288,8 +1350,11 @@ static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
 
 	if (cfi_mode == CFI_AUTO) {
 		cfi_mode = CFI_KCFI;
-		if (HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT))
+		if (HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT)) {
+			if (!cpu_feature_enabled(X86_FEATURE_FRED))
+				cfi_paranoid = true;
 			cfi_mode = CFI_FINEIBT;
+		}
 	}
 
 	/*
@@ -1346,8 +1411,10 @@ static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
 		/* now that nobody targets func()+0, remove ENDBR there */
 		cfi_rewrite_endbr(start_cfi, end_cfi);
 
-		if (builtin)
-			pr_info("Using FineIBT CFI\n");
+		if (builtin) {
+			pr_info("Using FineIBT %s CFI\n",
+				cfi_paranoid ? "paranoid" : "");
+		}
 		return;
 
 	default:
@@ -1420,7 +1487,8 @@ static void poison_cfi(void *addr)
  * We check the preamble by checking for the ENDBR instruction relative to the
  * UD2 instruction.
  */
-bool decode_fineibt_insn(struct pt_regs *regs, unsigned long *target, u32 *type)
+static bool decode_fineibt_preamble(int ud_type, struct pt_regs *regs,
+				    unsigned long *target, u32 *type)
 {
 	unsigned long addr = regs->ip - fineibt_preamble_ud2;
 	u32 endbr, hash;
@@ -1440,6 +1508,33 @@ bool decode_fineibt_insn(struct pt_regs *regs, unsigned long *target, u32 *type)
 	return false;
 }
 
+/*
+ * regs->ip points to a 0xea instruction from the fineibt_paranoid_start[]
+ * sequence.
+ */
+static bool decode_fineibt_paranoid(int ud_type, struct pt_regs *regs,
+				    unsigned long *target, u32 *type)
+{
+	unsigned long addr = regs->ip - fineibt_paranoid_ud;
+	u32 hash;
+
+	__get_kernel_nofault(&hash, addr + fineibt_caller_hash, u32, Efault);
+	*target = regs->r11 + 16;
+	*type = regs->r10;
+	return true;
+
+Efault:
+	return false;
+}
+
+bool decode_fineibt_insn(int ud_type, struct pt_regs *regs,
+			 unsigned long *target, u32 *type)
+{
+	if (ud_type == BUG_EA)
+		return decode_fineibt_paranoid(ud_type, regs, target, type);
+	return decode_fineibt_preamble(ud_type, regs, target, type);
+}
+
 #else
 
 static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
diff --git a/arch/x86/kernel/cfi.c b/arch/x86/kernel/cfi.c
index f6905bef0af8..f9eb7465eec6 100644
--- a/arch/x86/kernel/cfi.c
+++ b/arch/x86/kernel/cfi.c
@@ -65,7 +65,7 @@ static bool decode_cfi_insn(struct pt_regs *regs, unsigned long *target,
  * Checks if a ud2 trap is because of a CFI failure, and handles the trap
  * if needed. Returns a bug_trap_type value similarly to report_bug.
  */
-enum bug_trap_type handle_cfi_failure(struct pt_regs *regs)
+enum bug_trap_type handle_cfi_failure(int ud_type, struct pt_regs *regs)
 {
 	unsigned long target;
 	u32 type;
@@ -81,7 +81,7 @@ enum bug_trap_type handle_cfi_failure(struct pt_regs *regs)
 		break;
 
 	case CFI_FINEIBT:
-		if (!decode_fineibt_insn(regs, &target, &type))
+		if (!decode_fineibt_insn(ud_type, regs, &target, &type))
 			return BUG_TRAP_TYPE_NONE;
 
 		break;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 05b86c05e446..500030ab8036 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -113,6 +113,10 @@ __always_inline int decode_bug(unsigned long addr, s32 *imm, int *len)
 	v = *(u8 *)(addr++);
 	if (v == INSN_ASOP)
 		v = *(u8 *)(addr++);
+	if (v == 0xea) {
+		*len = addr - start;
+		return BUG_EA;
+	}
 	if (v != OPCODE_ESCAPE)
 		return BUG_NONE;
 
@@ -308,9 +312,16 @@ static noinstr bool handle_bug(struct pt_regs *regs)
 		raw_local_irq_enable();
 
 	switch (ud_type) {
+	case BUG_EA:
+		if (handle_cfi_failure(ud_type, regs) == BUG_TRAP_TYPE_WARN) {
+			regs->ip += ud_len;
+			handled = true;
+		}
+		break;
+
 	case BUG_UD2:
 		if (report_bug(regs->ip, regs) == BUG_TRAP_TYPE_WARN ||
-		    handle_cfi_failure(regs) == BUG_TRAP_TYPE_WARN) {
+		    handle_cfi_failure(ud_type, regs) == BUG_TRAP_TYPE_WARN) {
 			regs->ip += ud_len;
 			handled = true;
 		}

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13 20:57           ` Jann Horn
@ 2025-02-16 23:42             ` Kees Cook
  0 siblings, 0 replies; 40+ messages in thread
From: Kees Cook @ 2025-02-16 23:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Cooper, jmill, joao, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 09:57:37PM +0100, Jann Horn wrote:
> On Thu, Feb 13, 2025 at 9:53 PM Kees Cook <kees@kernel.org> wrote:
> > On Thu, Feb 13, 2025 at 08:41:16PM +0000, Andrew Cooper wrote:
> > > On 13/02/2025 8:28 pm, Kees Cook wrote:
> > > > On Thu, Feb 13, 2025 at 01:31:30AM +0000, Andrew Cooper wrote:
> > > >>>> Assuming this is an issue you all feel is worth addressing, I will
> > > >>>> continue working on providing a patch. I'm concerned though that the
> > > >>>> overhead from adding a wrmsr on both syscall entry and exit to
> > > >>>> overwrite and restore the KERNEL_GS_BASE MSR may be quite high, so
> > > >>>> any feedback in regards to the approach or suggestions of alternate
> > > >>>> approaches to patching are welcome :)
> > > >>> Since the kernel, as far as I understand, uses FineIBT without
> > > >>> backwards control flow protection (in other words, I think we assume
> > > >>> that the kernel stack is trusted?),
> > > >> This is fun indeed.  Linux cannot use supervisor shadow stacks because
> > > >> the mess around NMI re-entrancy (and IST more generally) requires ROP
> > > >> gadgets in order to function safely.  Implementing this with shadow
> > > >> stacks active, while not impossible, is deemed to be prohibitively
> > > >> complicated.
> > > > And just validate my understanding here, this attack is fundamentally
> > > > about FineIBT, not regular CFI (IBT or not), as the validation of target
> > > > addresses is done at indirect call time, yes?
> > >
> > > I'm not sure I'd classify it like that.  As a pivot primitive, it works
> > > very widely.
> > >
> > > FineIBT (more specifically any hybrid CFI scheme which includes CET-IBT)
> > > relies on hardware to do the course grain violation detection, and some
> > > software hash for fine grain violation detection.
> > >
> > > In this case, the requirement for the SYSCALL entrypoint to have an
> > > ENDBR64 instruction means it passes the CET-IBT check (does not yield
> > > #CP), and then lacks the software hash check as well.
> > >
> > > i.e. this renders FineIBT (and other hybrid CFI schemes) rather moot,
> > > because one hole is all the attacker needs to win, if they can control a
> > > function pointer / return address.  At which point it's a large overhead
> > > for no security benefit over simple CET-IBT.
> >
> > Right, the "if they can control a function pointer" is the part I'm
> > focusing on. This attack depends on making an indirect call with a
> > controlled pointer. Non-FineIBT CFI will protect against that step,
> > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > nor CFI+IBT.
> 
> To me, "CFI" is really just a fairly abstract concept; are you talking
> specifically about the Clang scheme from
> <https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html>, or
> something else?

Ah, sorry, I mean KCFI (and note that FineIBT is a run-time alternatives
pass that transforms the "stock" KCFI):

https://clang.llvm.org/docs/ControlFlowIntegrity.html#fsanitize-kcfi
https://lpc.events/event/16/contributions/1315/
https://www.youtube.com/watch?v=bmv6blX_F_g

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-15 21:07             ` Peter Zijlstra
@ 2025-02-16 23:51               ` Kees Cook
  2025-02-17 10:39                 ` Peter Zijlstra
  2025-02-17 13:06               ` David Laight
  1 sibling, 1 reply; 40+ messages in thread
From: Kees Cook @ 2025-02-16 23:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Cooper, jannh, jmill, joao, linux-hardening, linux-kernel,
	luto, samitolvanen, scott.d.constable, x86

On Sat, Feb 15, 2025 at 10:07:29PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 14, 2025 at 10:57:51AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:
> > 
> > > Right, the "if they can control a function pointer" is the part I'm
> > > focusing on. This attack depends on making an indirect call with a
> > > controlled pointer. Non-FineIBT CFI will protect against that step,
> > > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > > nor CFI+IBT.
> > 
> > Yes, the whole caller side validation should stop this.
> 
> And I think we can retro-fit that in FineIBT. Notably the current call
> sites look like:
> 
> 0000000000000060 <fineibt_caller>:
>   60:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
>   66:   49 83 eb 10             sub    $0x10,%r11
>   6a:   0f 1f 40 00             nopl   0x0(%rax)
>   6e:   41 ff d3                call   *%r11
>   71:   0f 1f 00                nopl   (%rax)
> 
> Of which the last 6 bytes are the retpoline site (starting at 0x6e). It
> is trivially possible to re-arrange things to have both nops next to one
> another, giving us 7 bytes to muck about with.
> 
> And I think we can just about manage to do a caller side hash validation
> in them bytes like:
> 
> 0000000000000080 <fineibt_paranoid>:
>   80:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
>   86:   49 83 eb 10             sub    $0x10,%r11
>   8a:   45 3b 53 07             cmp    0x7(%r11),%r10d
>   8e:   74 01                   je     91 <fineibt_paranoid+0x11>
>   90:   ea                      (bad)
>   91:   41 ff d3                call   *%r11

Ah nice! Yes, that would be great and removes all my concerns about
FineIBT. :) (And you went with EA just to distinguish it more easily?
Can't we still use the UD2 bug tables to find this like normal?)

> And while this is somewhat daft, it would close the hole vs this entry
> point swizzle afaict, no?
> 
> Patch against tip/x86/core (which includes the latest ibt bits as per
> this morning).
> 
> Boots and builds the next kernel on my ADL.

Lovely! Based on the patch, I assume you were testing CFI crash location
reporting too?

I'll try to get this spun up for testing here too.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-16 23:51               ` Kees Cook
@ 2025-02-17 10:39                 ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-17 10:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Cooper, jannh, jmill, joao, linux-hardening, linux-kernel,
	luto, samitolvanen, scott.d.constable, x86

On Sun, Feb 16, 2025 at 03:51:27PM -0800, Kees Cook wrote:
> On Sat, Feb 15, 2025 at 10:07:29PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 14, 2025 at 10:57:51AM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:
> > > 
> > > > Right, the "if they can control a function pointer" is the part I'm
> > > > focusing on. This attack depends on making an indirect call with a
> > > > controlled pointer. Non-FineIBT CFI will protect against that step,
> > > > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > > > nor CFI+IBT.
> > > 
> > > Yes, the whole caller side validation should stop this.
> > 
> > And I think we can retro-fit that in FineIBT. Notably the current call
> > sites look like:
> > 
> > 0000000000000060 <fineibt_caller>:
> >   60:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
> >   66:   49 83 eb 10             sub    $0x10,%r11
> >   6a:   0f 1f 40 00             nopl   0x0(%rax)
> >   6e:   41 ff d3                call   *%r11
> >   71:   0f 1f 00                nopl   (%rax)
> > 
> > Of which the last 6 bytes are the retpoline site (starting at 0x6e). It
> > is trivially possible to re-arrange things to have both nops next to one
> > another, giving us 7 bytes to muck about with.
> > 
> > And I think we can just about manage to do a caller side hash validation
> > in them bytes like:
> > 
> > 0000000000000080 <fineibt_paranoid>:
> >   80:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
> >   86:   49 83 eb 10             sub    $0x10,%r11
> >   8a:   45 3b 53 07             cmp    0x7(%r11),%r10d
> >   8e:   74 01                   je     91 <fineibt_paranoid+0x11>
> >   90:   ea                      (bad)
> >   91:   41 ff d3                call   *%r11
> 
> Ah nice! Yes, that would be great and removes all my concerns about
> FineIBT. :) 

Excellent!

> (And you went with EA just to distinguish it more easily?
> Can't we still use the UD2 bug tables to find this like normal?)

No space; UD2 is a 2 byte instruction. IIUC all the single byte
instructions that trip #UD are more or less 'reserved' and we shouldn't
be using them, but I think we can use 0xEA here since it is specific to
the paranoid FineIBT thing -- and if people want to reclaim the usage,
all they need to do is fix IBT :-) -- which as I said before should be
done once FRED happens.

(/me makes note to go read the very latest FRED spec -- its been a
while).

> > And while this is somewhat daft, it would close the hole vs this entry
> > point swizzle afaict, no?
> > 
> > Patch against tip/x86/core (which includes the latest ibt bits as per
> > this morning).
> > 
> > Boots and builds the next kernel on my ADL.
> 
> Lovely! Based on the patch, I assume you were testing CFI crash location
> reporting too?

Sami was, he reminded me I forgot to hook up FineIBT, so I did :-)

> I'll try to get this spun up for testing here too.

Thanks!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-15 21:07             ` Peter Zijlstra
  2025-02-16 23:51               ` Kees Cook
@ 2025-02-17 13:06               ` David Laight
  2025-02-17 13:13                 ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: David Laight @ 2025-02-17 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kees Cook, Andrew Cooper, jannh, jmill, joao, linux-hardening,
	linux-kernel, luto, samitolvanen, scott.d.constable, x86

On Sat, 15 Feb 2025 22:07:29 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Feb 14, 2025 at 10:57:51AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:
> >   
> > > Right, the "if they can control a function pointer" is the part I'm
> > > focusing on. This attack depends on making an indirect call with a
> > > controlled pointer. Non-FineIBT CFI will protect against that step,
> > > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > > nor CFI+IBT.  
> > 
> > Yes, the whole caller side validation should stop this.  
> 
> And I think we can retro-fit that in FineIBT. Notably the current call
> sites look like:
> 
> 0000000000000060 <fineibt_caller>:
>   60:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
>   66:   49 83 eb 10             sub    $0x10,%r11
>   6a:   0f 1f 40 00             nopl   0x0(%rax)
>   6e:   41 ff d3                call   *%r11
>   71:   0f 1f 00                nopl   (%rax)

I tried building a fineibt kernel (without LTO) and that isn't what I
see in the object files.
(I not trying to run it, just do some analysis.)
While the call targets have a 16 byte preamble it is all nops apart
from a final 'mov $hash,%rax'.
The call site loads $-hash and adds -4(target) and checks for zero.
It is too small to be patchable into the above.

There are far too many TLA (and ETLA) to follow all the options. 

I did notice that although objtool seems to have code to remove 'spare'
endbra, the 'mov %rax,$hash' was present on all external functions.
Some 1600 are void fn(void) - there are high counts of others.


> Of which the last 6 bytes are the retpoline site (starting at 0x6e). It
> is trivially possible to re-arrange things to have both nops next to one
> another, giving us 7 bytes to muck about with.
> 
> And I think we can just about manage to do a caller side hash validation
> in them bytes like:
> 
> 0000000000000080 <fineibt_paranoid>:
>   80:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
>   86:   49 83 eb 10             sub    $0x10,%r11
>   8a:   45 3b 53 07             cmp    0x7(%r11),%r10d
>   8e:   74 01                   je     91 <fineibt_paranoid+0x11>
>   90:   ea                      (bad)
>   91:   41 ff d3                call   *%r11
> 
> And while this is somewhat daft, it would close the hole vs this entry
> point swizzle afaict, no?

Doesn't it have the problem that it includes the value of the hash?
So you can arrange to jump directly into the sequence itself.

	David


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-17 13:06               ` David Laight
@ 2025-02-17 13:13                 ` Peter Zijlstra
  2025-02-17 18:38                   ` David Laight
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-17 13:13 UTC (permalink / raw)
  To: David Laight
  Cc: Kees Cook, Andrew Cooper, jannh, jmill, joao, linux-hardening,
	linux-kernel, luto, samitolvanen, scott.d.constable, x86

On Mon, Feb 17, 2025 at 01:06:29PM +0000, David Laight wrote:
> On Sat, 15 Feb 2025 22:07:29 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Feb 14, 2025 at 10:57:51AM +0100, Peter Zijlstra wrote:
> > > On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:
> > >   
> > > > Right, the "if they can control a function pointer" is the part I'm
> > > > focusing on. This attack depends on making an indirect call with a
> > > > controlled pointer. Non-FineIBT CFI will protect against that step,
> > > > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > > > nor CFI+IBT.  
> > > 
> > > Yes, the whole caller side validation should stop this.  
> > 
> > And I think we can retro-fit that in FineIBT. Notably the current call
> > sites look like:
> > 
> > 0000000000000060 <fineibt_caller>:
> >   60:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
> >   66:   49 83 eb 10             sub    $0x10,%r11
> >   6a:   0f 1f 40 00             nopl   0x0(%rax)
> >   6e:   41 ff d3                call   *%r11
> >   71:   0f 1f 00                nopl   (%rax)
> 
> I tried building a fineibt kernel (without LTO) and that isn't what I
> see in the object files.
> (I not trying to run it, just do some analysis.)
> While the call targets have a 16 byte preamble it is all nops apart
> from a final 'mov $hash,%rax'.
> The call site loads $-hash and adds -4(target) and checks for zero.
> It is too small to be patchable into the above.

Right after that comes the retpoline site, which is another 6 bytes
(assuming you have indirect-branch-cs-prefix, which all kCFI enabled
compilers should have).

You need to go read arch/x86/kernel/alternative.c search for FineIBT


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-17 13:13                 ` Peter Zijlstra
@ 2025-02-17 18:38                   ` David Laight
  2025-02-17 18:54                     ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: David Laight @ 2025-02-17 18:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kees Cook, Andrew Cooper, jannh, jmill, joao, linux-hardening,
	linux-kernel, luto, samitolvanen, scott.d.constable, x86

On Mon, 17 Feb 2025 14:13:21 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 17, 2025 at 01:06:29PM +0000, David Laight wrote:
> > On Sat, 15 Feb 2025 22:07:29 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > On Fri, Feb 14, 2025 at 10:57:51AM +0100, Peter Zijlstra wrote:  
> > > > On Thu, Feb 13, 2025 at 12:53:28PM -0800, Kees Cook wrote:
> > > >     
> > > > > Right, the "if they can control a function pointer" is the part I'm
> > > > > focusing on. This attack depends on making an indirect call with a
> > > > > controlled pointer. Non-FineIBT CFI will protect against that step,
> > > > > so I think this is only an issue for IBT-only and FineIBT, but not CFI
> > > > > nor CFI+IBT.    
> > > > 
> > > > Yes, the whole caller side validation should stop this.    
> > > 
> > > And I think we can retro-fit that in FineIBT. Notably the current call
> > > sites look like:
> > > 
> > > 0000000000000060 <fineibt_caller>:
> > >   60:   41 ba 78 56 34 12       mov    $0x12345678,%r10d
> > >   66:   49 83 eb 10             sub    $0x10,%r11
> > >   6a:   0f 1f 40 00             nopl   0x0(%rax)
> > >   6e:   41 ff d3                call   *%r11
> > >   71:   0f 1f 00                nopl   (%rax)  
> > 
> > I tried building a fineibt kernel (without LTO) and that isn't what I
> > see in the object files.
> > (I not trying to run it, just do some analysis.)
> > While the call targets have a 16 byte preamble it is all nops apart
> > from a final 'mov $hash,%rax'.
> > The call site loads $-hash and adds -4(target) and checks for zero.
> > It is too small to be patchable into the above.  
> 
> Right after that comes the retpoline site, which is another 6 bytes
> (assuming you have indirect-branch-cs-prefix, which all kCFI enabled
> compilers should have).

I'm building with clang 18.1.18 - should be new enough.
I may not have retpolines enabled, a typical call site is (from vmlinux.o):
    3628:       48 89 c6                mov    %rax,%rsi
    362b:       41 ba 83 c5 2c af       mov    $0xaf2cc583,%r10d
    3631:       44 03 51 fc             add    -0x4(%rcx),%r10d
    3635:       74 02                   je     3639 <vc_handle_exitcode+0x739>
    3637:       0f 0b                   ud2
    3639:       ff d1                   call   *%rcx
    363b:       4c 89 f6                mov    %r14,%rsi

That one has three targets, one is:
000000000008a5c0 <__cfi_kvm_sev_es_hcall_prepare>:
   8a5c0:       90                      nop
   8a5c1:       90                      nop
   8a5c2:       90                      nop    
   8a5c3:       90                      nop    
   8a5c4:       90                      nop    
   8a5c5:       90                      nop    
   8a5c6:       90                      nop    
   8a5c7:       90                      nop    
   8a5c8:       90                      nop    
   8a5c9:       90                      nop    
   8a5ca:       90                      nop
   8a5cb:       b8 7d 3a d3 50          mov    $0x50d33a7d,%eax
    
000000000008a5d0 <kvm_sev_es_hcall_prepare>:
   8a5d0:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1) 8a5d1: R_X86_64_NONE    __fentry__-0x4
   8a5d5:       48 8b 46 28             mov    0x28(%rsi),%rax

I think that if I had endbra enabled objtool would remove them from non-exported
functions whose address isn't taken.
But none of the 'mov $hash,%eax' get removed - and I think they should suffer
the same fate.

I'm not sure why I don't have endbra though.
I did remove a lot of the mitigations from the config I copied to add the caller
side fineibt (I think) hash checks.
After all this is a local system I want to run fast, not a semi-public one
someone might try to hack.

> You need to go read arch/x86/kernel/alternative.c search for FineIBT

I found some stuff in one of the docs.
Didn't read that bit of source.

What I was hoping to obtain was a list of the valid target functions for
each indirect call site.
With the stack offset of the call (which objtool knows) and a lot of 'shaking'
an real estimate of max stack depth can be determined.
(and recursive loops found.)

	David
 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-17 18:38                   ` David Laight
@ 2025-02-17 18:54                     ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2025-02-17 18:54 UTC (permalink / raw)
  To: David Laight
  Cc: Kees Cook, Andrew Cooper, jannh, jmill, joao, linux-hardening,
	linux-kernel, luto, samitolvanen, scott.d.constable, x86

On Mon, Feb 17, 2025 at 06:38:27PM +0000, David Laight wrote:

> I may not have retpolines enabled, a typical call site is (from vmlinux.o):

Make sure CONFIG_FINEIBT=y, otherwise there is no point in talking about
this. This requires KERNEL_IBT=y RETPOLINE=y CALL_PADDING=y CFI_CLANG=y.

Then look at arch/x86/include/asm/cfi.h and make sure to read the
comment, and then read arch/x86/kernel/alternative.c:__apply_fineibt().

Which ever way around you're going to turn this, you'll never find the
fineibt code in the object files.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  2:42       ` Andrew Cooper
@ 2025-02-22 20:43         ` Rudolf Marek
  2025-02-25 18:10           ` Andrew Cooper
  2025-02-28 12:13         ` Florian Weimer
  1 sibling, 1 reply; 40+ messages in thread
From: Rudolf Marek @ 2025-02-22 20:43 UTC (permalink / raw)
  To: Andrew Cooper, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

Hi,

Dne 13. 02. 25 v 3:42 Andrew Cooper napsal(a):
> The SYSCALL behaviour TLDR is:
> 
>      %rcx = %rip
>      %r11 = %eflags
>      %cs = fixed attr
>      %ss = fixed attr
>      %rip = MSR_LSTAR
> 
> which means that %rcx (old rip) is the only piece of state which
> userspace can't feasibly forge (and therefore could distinguish a
> SYSCALL from user vs kernel mode), yet if we're talking about a JOP
> chain to get here, then %rcx is under attacker control too.

The SYSCALL instruction also provides means to create "incoherent" state of the processor selectors
where the value of selector do not match pre-loaded values in the descriptor caches.

Would it work to have KERNEL_CS as last entry in the GDT table? Therefore executing SYSCALL would set the CS as usual,
but the numeric value of SS selector would be larger than GDT limit?

That would mean that "impossible" selector is loaded into SS if we came from usermode,
but operation with stack would still work as the descriptor caches will be sane.
The "impossible" selector value can be fixed by loading SS with NULL which is cheap.

The check in hotpath could maybe use VERR %SS which would fail because of GDT limit is reached. The VERR with mem operand
does not use any GPR!

Or simply check for "impossible" selector would work if we misuse zeros in high32 of R11 (usermode rflags) maybe like:

entry:
endbr64
rol $32, %r11
movw %ss, %r11w
cmpw $IMPOSSIBLE_SEL, %r11w
jnz panic
; load null to SS, fix R11 and pretend above never happened

If attacker would execute SYSCALL in the kernel, likely we could check the %RCX if it is OK or not?

Bit variation to this "theme" would be to have SYSCALL SS GDT entry still in the GDT but set as "not present".
Another brainstorm idea would be to misuse RFLAGS.ID and clear it in MSR FMASK but run kernel or most of it with RFLAGS.ID set.
I don't know what is the threat model you are trying to fix.

Lets fight x86 insanity with yet another x86 insanity - I think it is fair.

I hope above helps or at least I will learn why not if I overseen something obvious!

I tried to CC all the lists. I'm not subscribed.

Thanks,
Rudolf

PS: I'm leaving as an exercise to a reader NMI and #MC handling!


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-22 20:43         ` Rudolf Marek
@ 2025-02-25 18:10           ` Andrew Cooper
  2025-02-25 20:06             ` Rudolf Marek
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-02-25 18:10 UTC (permalink / raw)
  To: Rudolf Marek, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

On 22/02/2025 8:43 pm, Rudolf Marek wrote:
> Hi,
>
> Dne 13. 02. 25 v 3:42 Andrew Cooper napsal(a):
>> The SYSCALL behaviour TLDR is:
>>
>>      %rcx = %rip
>>      %r11 = %eflags
>>      %cs = fixed attr
>>      %ss = fixed attr
>>      %rip = MSR_LSTAR
>>
>> which means that %rcx (old rip) is the only piece of state which
>> userspace can't feasibly forge (and therefore could distinguish a
>> SYSCALL from user vs kernel mode), yet if we're talking about a JOP
>> chain to get here, then %rcx is under attacker control too.
>
> The SYSCALL instruction also provides means to create "incoherent"
> state of the processor selectors
> where the value of selector do not match pre-loaded values in the
> descriptor caches.

Very cunning.  Yes it does, but the state needs to be safe to IRET back
to, and ...

> Would it work to have KERNEL_CS as last entry in the GDT table?
> Therefore executing SYSCALL would set the CS as usual,
> but the numeric value of SS selector would be larger than GDT limit?

... this isn't safe.  Any exception/interrupt will yield #SS when trying
to load an out-of-limit %ss.

i.e. a wrongly-timed NMI will take out the system with a very bizarre
looking oops.


You can do this in a less fatal way by e.g. having in-GDT form have a
segment limit, but any exception/interrupt will resync the out-of-sync
state, and break detection.  Also it would make the segment unusable for
compatibility userspace, where the limit would take effect.

Finally, while this potentially gives us an option for SYSCALL and maybe
SYSENTER, it doesn't help with any of the main IDT entrypoints which can
also be attacked.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-25 18:10           ` Andrew Cooper
@ 2025-02-25 20:06             ` Rudolf Marek
  2025-02-25 21:14               ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Rudolf Marek @ 2025-02-25 20:06 UTC (permalink / raw)
  To: Andrew Cooper, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

Hi Andrew,

Dne 25. 02. 25 v 19:10 Andrew Cooper napsal(a):
> Very cunning.  Yes it does, but the state needs to be safe to IRET back
> to, and ...

... And intellectually very pleasing!

>> Would it work to have KERNEL_CS as last entry in the GDT table?
>> Therefore executing SYSCALL would set the CS as usual,
>> but the numeric value of SS selector would be larger than GDT limit?
> 
> ... this isn't safe.  Any exception/interrupt will yield #SS when trying
> to load an out-of-limit %ss.> i.e. a wrongly-timed NMI will take out the system with a very bizarre
> looking oops.

Hmm I was hoping that "the reader" will perform this NMI/#MC exercise :)

The SYSCALL/SYSENTER startup has interrupts disabled, so it is the problem of NMI/#MC
handler which would need deal with the normal case and attack case.

It would need to check if it was executing that critical part of syscall64 entry
from endbr64 to checkselector section, and if yes, the saved %ss needs to be
"impossible" one. If it isn't -> panic.

For non-attack case it just needs to forward RIP after the check...

> You can do this in a less fatal way by e.g. having in-GDT form have a
> segment limit, but any exception/interrupt will resync the out-of-sync
> state, and break detection.  Also it would make the segment unusable for
> compatibility userspace, where the limit would take effect.

Yeah couldn't figure out what else could work "vice-versa" :(
  
> Finally, while this potentially gives us an option for SYSCALL and maybe
> SYSENTER, it doesn't help with any of the main IDT entrypoints which can
> also be attacked.

I see, sorry I wasn't aware of this. But if I recall correctly only "paranoid"
IDT entries do something with swapgs. But is there also some stack pivot where
it would depend on GS? Or is it somewhat unrelated issue, that you might just
redirect to "any endbr64" which are IDT entrypoints?

Maybe you can share some details of how the attack would work in this case,
or point me somewhere where I can read about it.

If it is "any endbr64" case, would it work to just do "sanity check" of the exception stackframe?

I mean if it is real or some random kernel stack state?

1) check %RSP alignment if it was ok
2) check if %ss and %cs for all possible valid values (16 bit)

Unfortunately I think intel is not clearing high 48 bits of saved selector, AMD is.

3) check if %rip is kernel range
4) check if %rflags is sane (bit1 is 1)

Because if the attacker has no or limited control on the stack content, it would be difficult to fake it.

Thanks,
Rudolf


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-25 20:06             ` Rudolf Marek
@ 2025-02-25 21:14               ` Andrew Cooper
  2025-02-26  2:55                 ` Kees Cook
  2025-02-26 22:48                 ` Rudolf Marek
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-02-25 21:14 UTC (permalink / raw)
  To: Rudolf Marek, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

On 25/02/2025 8:06 pm, Rudolf Marek wrote:
> Hi Andrew,
>
> Dne 25. 02. 25 v 19:10 Andrew Cooper napsal(a):
>> Very cunning.  Yes it does, but the state needs to be safe to IRET back
>> to, and ...
>
> ... And intellectually very pleasing!
>
>>> Would it work to have KERNEL_CS as last entry in the GDT table?
>>> Therefore executing SYSCALL would set the CS as usual,
>>> but the numeric value of SS selector would be larger than GDT limit?
>>
>> ... this isn't safe.  Any exception/interrupt will yield #SS when trying
>> to load an out-of-limit %ss.> i.e. a wrongly-timed NMI will take out
>> the system with a very bizarre
>> looking oops.
>
> Hmm I was hoping that "the reader" will perform this NMI/#MC exercise :)

As stand-in for "the reader", I'll point out that you need to add #DB to
that list or you're in for a rude surprise when running the x86 selftests.

>
> The SYSCALL/SYSENTER startup has interrupts disabled, so it is the
> problem of NMI/#MC
> handler which would need deal with the normal case and attack case.

Right, but in the case of the attack, regular interrupts are most likely
enabled too.  And writing this has just caused me to realise a
yet-more-fun case.

An interrupt hitting the syscall entry path (prior to SWAPGS) will cause
the interrupt handler's CPL check and conditional SWAPGS to do the wrong
thing and switch onto the user GS base too.  (Prior research e.g.
GhostRace has shown how to get an hrtimer to reliably hit an instruction
boundary.)

i.e. you'd need paranoid_entry on every vector, not just the IST ones.

>
> It would need to check if it was executing that critical part of
> syscall64 entry
> from endbr64 to checkselector section, and if yes, the saved %ss needs
> to be
> "impossible" one. If it isn't -> panic.
>
> For non-attack case it just needs to forward RIP after the check...
>
>> You can do this in a less fatal way by e.g. having in-GDT form have a
>> segment limit, but any exception/interrupt will resync the out-of-sync
>> state, and break detection.  Also it would make the segment unusable for
>> compatibility userspace, where the limit would take effect.
>
> Yeah couldn't figure out what else could work "vice-versa" :(
>  
>> Finally, while this potentially gives us an option for SYSCALL and maybe
>> SYSENTER, it doesn't help with any of the main IDT entrypoints which can
>> also be attacked.
>
> I see, sorry I wasn't aware of this. But if I recall correctly only
> "paranoid"
> IDT entries do something with swapgs. But is there also some stack
> pivot where
> it would depend on GS? Or is it somewhat unrelated issue, that you
> might just
> redirect to "any endbr64" which are IDT entrypoints?
>
> Maybe you can share some details of how the attack would work in this
> case,
> or point me somewhere where I can read about it.
>
> If it is "any endbr64" case, would it work to just do "sanity check"
> of the exception stackframe?

The problem is type confusion.  Because ENDBR marks both the regular
function callees, and the system entrypoints (256*IDT + 2*SYSCALL +
SYSENTER), a function pointer corrupted to refer to a system entrypoint
will pass the CET-IBT check and not yield #CP.

All entrypoints then conditionally (IDT) or unconditionally
(SYSCALL/SYSENTER) SWAPGS.  For the attack case, this switches back onto
the user gs base.

Interrupts and exceptions look at %cs in the IRET frame to judge whether
to SWAPGS or not (and this is one of the main things that paranoid_entry
does differently).  In the case of the attack, there's no IRET frame
pushed on the stack and the read of %cs is out-of-bounds, most likely
the stack frame of the function which followed the corrupt function pointer.

The SYSCALL entrypoint is simply the easiest to pivot on, but all can be
attacked in this manner.  Fixing only the SYSCALL entrypoint doesn't
improve things much.

Peter Zijlstra has added a FineIBT=paranoid mode which performs the hash
check ahead of calling the function pointer, which ought to mitigate
this but at even higher overhead.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-25 21:14               ` Andrew Cooper
@ 2025-02-26  2:55                 ` Kees Cook
  2025-02-26 22:48                 ` Rudolf Marek
  1 sibling, 0 replies; 40+ messages in thread
From: Kees Cook @ 2025-02-26  2:55 UTC (permalink / raw)
  To: Andrew Cooper, Rudolf Marek, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers



On February 25, 2025 1:14:01 PM PST, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>Peter Zijlstra has added a FineIBT=paranoid mode which performs the hash
>check ahead of calling the function pointer, which ought to mitigate
>this but at even higher overhead.

Was kCFI vs FineIBT perf ever measured? Is the assumption of higher overhead based on kCFI filling dcache in addition to icache, whereas FineIBT only fills icache?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-25 21:14               ` Andrew Cooper
  2025-02-26  2:55                 ` Kees Cook
@ 2025-02-26 22:48                 ` Rudolf Marek
  2025-02-27  0:41                   ` Andrew Cooper
  1 sibling, 1 reply; 40+ messages in thread
From: Rudolf Marek @ 2025-02-26 22:48 UTC (permalink / raw)
  To: Andrew Cooper, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

Hi Andrew,

Dne 25. 02. 25 v 22:14 Andrew Cooper napsal(a):
> As stand-in for "the reader", I'll point out that you need to add #DB to
> that list or you're in for a rude surprise when running the x86 selftests.

Thanks for pointing this out. I forgot about the interrupt shadow on SYSCALL
and possibly some breakpoints possibilities in the kernel.

>> The SYSCALL/SYSENTER startup has interrupts disabled, so it is the
>> problem of NMI/#MC
>> handler which would need deal with the normal case and attack case.
> 
> Right, but in the case of the attack, regular interrupts are most likely
> enabled too.  And writing this has just caused me to realise a
> yet-more-fun case.
> An interrupt hitting the syscall entry path (prior to SWAPGS) will cause
> the interrupt handler's CPL check and conditional SWAPGS to do the wrong
> thing and switch onto the user GS base too.  (Prior research e.g.
> GhostRace has shown how to get an hrtimer to reliably hit an instruction
> boundary.)

I don't see it, because if attacker starts at syscall entry and interrupts are 
enabled and the interrupt happens right there the handler will just see proper 
IRET frame with %cs of kernel and will not perform swapgs. I will try to think 
about it again tomorrow I likely missed something.

> Interrupts and exceptions look at %cs in the IRET frame to judge whether
> to SWAPGS or not (and this is one of the main things that paranoid_entry
> does differently).  In the case of the attack, there's no IRET frame
> pushed on the stack and the read of %cs is out-of-bounds, most likely
> the stack frame of the function which followed the corrupt function pointer.

Thank you for your detailed explanation.

> The SYSCALL entrypoint is simply the easiest to pivot on, but all can be
> attacked in this manner.  Fixing only the SYSCALL entrypoint doesn't
> improve things much.

Maybe more elegant and cheap check on IDT entry "authenticity" would be to check 
for current %ss which needs to be NULL and possibly check the %CS on stack frame
by checking kernel %cs and not just two CPL bits and/or perform more checks.

Another ideas if you think it is still worth to discuss this topic:

What about to use completely different %CS selector for all entry code? The 
early entry code would check the %cs selector and panic if it is wrong one.

After swapgs dance, we need to perform far jump to normal kernel %CS, which 
might cost something.

To fix the interrupt on fake entry problem, we could check in relevant IDT 
handlers that we never come from "completely different" %CS used above for the 
early entry code.

And very last idea would be to somehow persuade the Last Branch Recording to 
record exception entries only and just check it from MSR. But maybe it is too 
costly and/or not possible.

Thanks,
Rudolf



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-26 22:48                 ` Rudolf Marek
@ 2025-02-27  0:41                   ` Andrew Cooper
  2025-03-01 22:48                     ` Rudolf Marek
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2025-02-27  0:41 UTC (permalink / raw)
  To: Rudolf Marek, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

On 26/02/2025 10:48 pm, Rudolf Marek wrote:
> Hi Andrew,
>
> Dne 25. 02. 25 v 22:14 Andrew Cooper napsal(a):
>> As stand-in for "the reader", I'll point out that you need to add #DB to
>> that list or you're in for a rude surprise when running the x86
>> selftests.
>
> Thanks for pointing this out. I forgot about the interrupt shadow on
> SYSCALL
> and possibly some breakpoints possibilities in the kernel.

Isn't x86 lovely.  This is yet another thing fixed in FRED; a CPL change
cancels pending_dbg.

>
>>> The SYSCALL/SYSENTER startup has interrupts disabled, so it is the
>>> problem of NMI/#MC
>>> handler which would need deal with the normal case and attack case.
>>
>> Right, but in the case of the attack, regular interrupts are most likely
>> enabled too.  And writing this has just caused me to realise a
>> yet-more-fun case.
>> An interrupt hitting the syscall entry path (prior to SWAPGS) will cause
>> the interrupt handler's CPL check and conditional SWAPGS to do the wrong
>> thing and switch onto the user GS base too.  (Prior research e.g.
>> GhostRace has shown how to get an hrtimer to reliably hit an instruction
>> boundary.)
>
> I don't see it, because if attacker starts at syscall entry and
> interrupts are enabled and the interrupt happens right there the
> handler will just see proper IRET frame with %cs of kernel and will
> not perform swapgs. I will try to think about it again tomorrow I
> likely missed something.

Nope, you're correct.  I meant (after the SWAPGS).

The linear sequence of actions is:

* Follow bad fnptr to the SYSCALL entry
* SWAPGS (now on user gs)
* Interrupt. Handler sees %cs == kernel, so doesn't SWAPGS again
* Interrupt handler runs fully on user gs.

>
>> Interrupts and exceptions look at %cs in the IRET frame to judge whether
>> to SWAPGS or not (and this is one of the main things that paranoid_entry
>> does differently).  In the case of the attack, there's no IRET frame
>> pushed on the stack and the read of %cs is out-of-bounds, most likely
>> the stack frame of the function which followed the corrupt function
>> pointer.
>
> Thank you for your detailed explanation.
>
>> The SYSCALL entrypoint is simply the easiest to pivot on, but all can be
>> attacked in this manner.  Fixing only the SYSCALL entrypoint doesn't
>> improve things much.
>
> Maybe more elegant and cheap check on IDT entry "authenticity" would
> be to check for current %ss which needs to be NULL and possibly check
> the %CS on stack frame
> by checking kernel %cs and not just two CPL bits and/or perform more
> checks.
>
> Another ideas if you think it is still worth to discuss this topic:
>
> What about to use completely different %CS selector for all entry
> code? The early entry code would check the %cs selector and panic if
> it is wrong one.
>
> After swapgs dance, we need to perform far jump to normal kernel %CS,
> which might cost something.
>
> To fix the interrupt on fake entry problem, we could check in relevant
> IDT handlers that we never come from "completely different" %CS used
> above for the early entry code.

Ooh, this looks promising.

For IDT it's quite easy.  Have a separate DPL0 %cs in the GDT, and write
it into the IDT.

For SYSCALL/SYSENTER it's a little more complicated.  I think you want
to move the selectors so they don't alias __KERN_CS directly, so you can
then move back to __KERN_CS in a similar way.

Give or take paranoid_entry for the IST vectors, any entrypoint that
finds itself on __KERN_CS did not get there through the CPU loading a
new context.

It would depend on an attacker not being able to include a FAR CALL into
their exploit chain, or be able toe write the IDT.  I don't know how
reasonable that would be if we're ruling out all architectural paths not
beginning with an ENDBR, but FAR CALLs are rare in general owing to them
being dog slow in general, and an attacker who can write the IDT doesn't
need these kinds of games to pivot.

We do need at least one scratch register to check %cs.  For IDT and
SYSENTER entries, we can reasonably well spill to the stack (again, an
attacker that can modify the stack has won without playing these games),
and for SYSCALL, we can use the low part of %r11 as you already
demonstrated.

Anyone fancy doing a prototype of this?

>
> And very last idea would be to somehow persuade the Last Branch
> Recording to record exception entries only and just check it from MSR.
> But maybe it is too costly and/or not possible.

This doesn't cover all cases, I don't think.  It also won't work under
virt, where LBR isn't reliably available.  Also LBR is reasonably full
of errata, and quite slow.

Also VMX clears it unilaterally on vmexit, and at least we don't have an
ENDBR in that path to worry about.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-13  2:42       ` Andrew Cooper
  2025-02-22 20:43         ` Rudolf Marek
@ 2025-02-28 12:13         ` Florian Weimer
  1 sibling, 0 replies; 40+ messages in thread
From: Florian Weimer @ 2025-02-28 12:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jann Horn, jmill, joao, kees, linux-hardening, linux-kernel, luto,
	samitolvanen, Peter Zijlstra (Intel)

* Andrew Cooper:

> The SYSCALL behaviour TLDR is:
>
>     %rcx = %rip
>     %r11 = %eflags
>     %cs = fixed attr
>     %ss = fixed attr
>     %rip = MSR_LSTAR
>
> which means that %rcx (old rip) is the only piece of state which
> userspace can't feasibly forge (and therefore could distinguish a
> SYSCALL from user vs kernel mode), yet if we're talking about a JOP
> chain to get here, then %rcx is under attacker control too.

Will the syscall handler do anything useful if called with an invalid
system call number?

If not, and if you can changed the FineIBT cookie register to %rax,
would that address this particular gap?  As long as the cookies do not
overlap with valid system call numbers?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-02-27  0:41                   ` Andrew Cooper
@ 2025-03-01 22:48                     ` Rudolf Marek
  2025-03-02 19:16                       ` Rudolf Marek
  0 siblings, 1 reply; 40+ messages in thread
From: Rudolf Marek @ 2025-03-01 22:48 UTC (permalink / raw)
  To: Andrew Cooper, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

Hi Andrew,

Dne 27. 02. 25 v 1:41 Andrew Cooper napsal(a):
> For SYSCALL/SYSENTER it's a little more complicated.  I think you want
> to move the selectors so they don't alias __KERN_CS directly, so you can
> then move back to __KERN_CS in a similar way

Yes I thought the CHECK_CS could be right before KERN_DS so at least kernel SS 
is right.

> Give or take paranoid_entry for the IST vectors, any entrypoint that
> finds itself on __KERN_CS did not get there through the CPU loading a
> new context.

Yes

> It would depend on an attacker not being able to include a FAR CALL into
> their exploit chain, or be able toe write the IDT.  I don't know how
> reasonable that would be if we're ruling out all architectural paths not
> beginning with an ENDBR, but FAR CALLs are rare in general owing to them
> being dog slow in general, and an attacker who can write the IDT doesn't
> need these kinds of games to pivot.

In fact I wanted to use far jump, but is it OK? On 64-bit architecture, there is 
no absolute direct jump with CS change, only indirect one. Do all CPUs with 
FineIBT somehow reasonably handle all the spectre v2 and various other indirect 
branch speculation problems? To speed it up we can use "fallthrough" speculation 
to our advantage and include the target right after the instruction.

> Anyone fancy doing a prototype of this?

Maybe we can discuss following before, if you find this conversation still 
entertaining :)

1) Implement the different %cs for entry points

Looks non-trivial for an attacker to obtain right %cs before landing on the 
IDT/SYSCALL entrypoints.

Each entrypoint would check if current %cs is __KERN_CHECK_CS, and if not panic. 
Then it would change the %CS back to __KERN_CS via far jump.

I don't know how slow is to do the jump back via far jump.

2) Implement some weaker version of what I was proposing and mostly checking the 
%ss. The attacker would need to control/load %SS before jumping to endbr64 or 
provide a reasonable exception stack

SYSCALL:
- maybe do "cli" to avoid issues with interrupts/nesting
- would use valid but different %ss selector from __KERN_DS
- would check if %ss == __KERN_CHECK_DS, if not panic
- reload %SS with __KERN_DS selector

IDT entrypoints:
- maybe do "cli" to avoid issues with interrupts/nesting
- if %SS == 0, skip other checks because CPL changed (maybe too weak?)
- perform more sanity checks on exception stack maybe in a direction what I 
proposed in other email - depends if it makes attacker life miserable or not
- reload %SS with __KERN_DS selector if CPL changed (maybe needed?)

>> And very last idea would be to somehow persuade the Last Branch
>> Recording to record exception entries only and just check it from MSR.
>> But maybe it is too costly and/or not possible.
> 
> This doesn't cover all cases, I don't think.  It also won't work under
> virt, where LBR isn't reliably available.  Also LBR is reasonably full
> of errata, and quite slow.

OK thanks, it was just an idea.

Thanks,
Rudolf


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-03-01 22:48                     ` Rudolf Marek
@ 2025-03-02 19:16                       ` Rudolf Marek
  2025-03-02 22:31                         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Rudolf Marek @ 2025-03-02 19:16 UTC (permalink / raw)
  To: Andrew Cooper, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

Dne 01. 03. 25 v 23:48 Rudolf Marek napsal(a):
> I don't know how slow is to do the jump back via far jump.

I did some micro benchmark on Raptorlake platform using other operating system I'm very familiar with.

I added following sequence to the SYSCALL64 entrypoint:

  .balign 16
syscallentry64:
     .byte 0x48
     ljmp *jmpaddr(%rip)
continuehere:
      swapgs
<...>

jmpaddr:
.quad continuehere
.word KERN_OTHER_CS << 3

And well, it is  1.5x slower. Unmodified syscall benchmark took on avg 261 cycles / 104 ns and the one with the indirect jump with %cs change took
386 cycles/ 154 ns.

This whole thing is quite literally a trap next to a trap, because GAS wasn't adding REX.W prefix and somehow complained about ljmpq.

Thanks,
Rudolf


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC] Circumventing FineIBT Via Entrypoints
  2025-03-02 19:16                       ` Rudolf Marek
@ 2025-03-02 22:31                         ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2025-03-02 22:31 UTC (permalink / raw)
  To: Rudolf Marek, Jann Horn
  Cc: jmill, joao, luto, samitolvanen, Peter Zijlstra (Intel),
	linux-hardening, lkml, x86 maintainers

On 02/03/2025 7:16 pm, Rudolf Marek wrote:
> Dne 01. 03. 25 v 23:48 Rudolf Marek napsal(a):
>> I don't know how slow is to do the jump back via far jump.
>
> I did some micro benchmark on Raptorlake platform using other
> operating system I'm very familiar with.
>
> I added following sequence to the SYSCALL64 entrypoint:
>
>  .balign 16
> syscallentry64:
>     .byte 0x48
>     ljmp *jmpaddr(%rip)
> continuehere:
>      swapgs
> <...>
>
> jmpaddr:
> .quad continuehere
> .word KERN_OTHER_CS << 3
>
> And well, it is  1.5x slower. Unmodified syscall benchmark took on avg
> 261 cycles / 104 ns and the one with the indirect jump with %cs change
> took
> 386 cycles/ 154 ns.
>
> This whole thing is quite literally a trap next to a trap, because GAS
> wasn't adding REX.W prefix and somehow complained about ljmpq.

(I've not finished replying to your other email, but here's one bit
brought forward)

Sadly far jumps and calls are where Intel and AMD CPUs disagree on how
to decode the instruction stream.  Intel CPUs obey REX prefix for
operand size, while AMD do not.  i.e. AMD CPUs cannot far transfer to
kernel addresses, at all.

This is why you only see far returns generally, which do behave the same
between vendors but require a stack.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-03-02 22:31 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Z60NwR4w/28Z7XUa@ubun>
2025-02-12 22:29 ` [RFC] Circumventing FineIBT Via Entrypoints Jann Horn
2025-02-13  1:31   ` Andrew Cooper
2025-02-13  2:09     ` Jann Horn
2025-02-13  2:42       ` Andrew Cooper
2025-02-22 20:43         ` Rudolf Marek
2025-02-25 18:10           ` Andrew Cooper
2025-02-25 20:06             ` Rudolf Marek
2025-02-25 21:14               ` Andrew Cooper
2025-02-26  2:55                 ` Kees Cook
2025-02-26 22:48                 ` Rudolf Marek
2025-02-27  0:41                   ` Andrew Cooper
2025-03-01 22:48                     ` Rudolf Marek
2025-03-02 19:16                       ` Rudolf Marek
2025-03-02 22:31                         ` Andrew Cooper
2025-02-28 12:13         ` Florian Weimer
2025-02-13 20:28     ` Kees Cook
2025-02-13 20:41       ` Andrew Cooper
2025-02-13 20:53         ` Kees Cook
2025-02-13 20:57           ` Jann Horn
2025-02-16 23:42             ` Kees Cook
2025-02-14  9:57           ` Peter Zijlstra
2025-02-15 21:07             ` Peter Zijlstra
2025-02-16 23:51               ` Kees Cook
2025-02-17 10:39                 ` Peter Zijlstra
2025-02-17 13:06               ` David Laight
2025-02-17 13:13                 ` Peter Zijlstra
2025-02-17 18:38                   ` David Laight
2025-02-17 18:54                     ` Peter Zijlstra
2025-02-14 10:05         ` Peter Zijlstra
2025-02-14  9:54     ` Peter Zijlstra
2025-02-13  6:15   ` Jennifer Miller
2025-02-13 19:23     ` Jann Horn
2025-02-13 21:24       ` Andrew Cooper
2025-02-13 23:24         ` Jennifer Miller
2025-02-13 23:43           ` Jann Horn
2025-02-14 23:06           ` Andrew Cooper
2025-02-15  0:07             ` Jennifer Miller
2025-02-15  0:11               ` Andrew Cooper
2025-02-15  0:19                 ` Jennifer Miller
2025-02-14 22:25       ` Josh Poimboeuf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox