[PATCH 00/14] x86/ibt: FineIBT-BHI

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/14] x86/ibt: FineIBT-BHI
@ 2024-09-27 19:48 Peter Zijlstra
  2024-09-27 19:48 ` [PATCH 01/14] x86/cfi: Wreck things Peter Zijlstra
                   ` (13 more replies)
  0 siblings, 14 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Hi all,

Since I got back from Vienna sick as a dog, I figured I'd do something simple
like rebase some previous work...

The thing I picked was FineIBT-BHI, an alternative mitigation for the
native-BHI issue, something that I implemented somewhere late last year while
the whole thing was still embargoed.

Specifically the paper mentions that FineIBT should avoid the problem, but
showed that the IBT implementation of current chips falls somewhat short of
the desired semantics and there's still a hole.

This here scheme (by Alyssa?) closes the hole and performs favourably
(Scott?).

The major stumbling block is that it relies on a compiler extension of the
kCFI infrastructure -- and while I did me a llvm POC, a real compiler person
should very much take the idea and do it right (Joao?).

Notably, GCC doesn't currently implement even basic kCFI:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107048
  https://lore.kernel.org/lkml/20230325081117.93245-1-ashimida.1990@gmail.com/

(an interested GCC person wanting to help finish that patchset and add x86-64
support would be much appreciated)

Anyway.... obviously this wasn't at all simple, nothing ever is.

The first many (10) patches are massaging/hardening the existing (Fine)IBT
implementation and all that should be mergable without fancy new compiler
stuff.

The last few (4) patches, starting with the llvm patch, is the actual
FineIBT-BHI mitigation.

The kernel patches are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/ibt
  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/fineibt-bhi

The whole series boots and builds the next kernel on my ADL -- using the custom LLVM.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/14] x86/cfi: Wreck things...
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
@ 2024-09-27 19:48 ` Peter Zijlstra
  2024-09-27 23:15   ` Josh Poimboeuf
  2024-09-27 19:48 ` [PATCH 02/14] x86/boot: Mark start_secondary() with __noendbr Peter Zijlstra
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

With the introduction of kCFI the addition of ENDBR to
SYM_FUNC_START* no longer suffices to make the function indirectly
callable. This now requires the use of SYM_TYPED_FUNC_START.

As such, remove the implicit ENDBR from SYM_FUNC_START* and add some
explicit annotations to (mostly) fix things up again.

This leaves:

vmlinux.o: warning: objtool: .export_symbol+0xc8: data relocation to !ENDBR: entry_ibpb+0x0
vmlinux.o: warning: objtool: .export_symbol+0xe8: data relocation to !ENDBR: asm_load_gs_index+0x0
vmlinux.o: warning: objtool: .export_symbol+0xf8: data relocation to !ENDBR: clear_bhb_loop+0x0
vmlinux.o: warning: objtool: .export_symbol+0x16218: data relocation to !ENDBR: rdmsr_safe_regs+0x0
vmlinux.o: warning: objtool: .export_symbol+0x16228: data relocation to !ENDBR: wrmsr_safe_regs+0x0
vmlinux.o: warning: objtool: .export_symbol+0x16238: data relocation to !ENDBR: __sw_hweight32+0x0
vmlinux.o: warning: objtool: .export_symbol+0x16248: data relocation to !ENDBR: __sw_hweight64+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c450: data relocation to !ENDBR: clear_page_rep+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c460: data relocation to !ENDBR: clear_page_orig+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c470: data relocation to !ENDBR: clear_page_erms+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c480: data relocation to !ENDBR: rep_stos_alternative+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c4a0: data relocation to !ENDBR: copy_page+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c4b0: data relocation to !ENDBR: rep_movs_alternative+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c4c0: data relocation to !ENDBR: __copy_user_nocache+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c580: data relocation to !ENDBR: __get_user_1+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c590: data relocation to !ENDBR: __get_user_2+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c5a0: data relocation to !ENDBR: __get_user_4+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c5b0: data relocation to !ENDBR: __get_user_8+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c5c0: data relocation to !ENDBR: __get_user_nocheck_1+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c5d0: data relocation to !ENDBR: __get_user_nocheck_2+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c5e0: data relocation to !ENDBR: __get_user_nocheck_4+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c5f0: data relocation to !ENDBR: __get_user_nocheck_8+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c620: data relocation to !ENDBR: memmove+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c630: data relocation to !ENDBR: memmove+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c640: data relocation to !ENDBR: memset+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c650: data relocation to !ENDBR: memset+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c660: data relocation to !ENDBR: __put_user_1+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c670: data relocation to !ENDBR: __put_user_nocheck_1+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c680: data relocation to !ENDBR: __put_user_2+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c690: data relocation to !ENDBR: __put_user_nocheck_2+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c6a0: data relocation to !ENDBR: __put_user_4+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c6b0: data relocation to !ENDBR: __put_user_nocheck_4+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c6c0: data relocation to !ENDBR: __put_user_8+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c6d0: data relocation to !ENDBR: __put_user_nocheck_8+0x0
vmlinux.o: warning: objtool: .export_symbol+0x3c9f0: data relocation to !ENDBR: entry_untrain_ret+0x0

Which states that while these functions are exported and (directly)
callable, they cannot be called indirectly. There are two solutions:

 - exclude the .export_symbol section from validation; effectively
   saying that having linkable but not indirectly callable exports are
   fine by default, or

 - make all of those use SYM_TYPED_FUNC_START to restore the
   traditional (and expected, but less secure?) behaviour.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/crypto/aesni-intel_asm.S     |    2 ++
 arch/x86/entry/calling.h              |    1 +
 arch/x86/entry/entry_64.S             |    1 +
 arch/x86/include/asm/linkage.h        |   18 ++++++------------
 arch/x86/include/asm/paravirt_types.h |   12 +++++++++++-
 arch/x86/kernel/acpi/madt_playdead.S  |    1 +
 arch/x86/kernel/acpi/wakeup_64.S      |    1 +
 arch/x86/kernel/ftrace_64.S           |    4 ++++
 arch/x86/kernel/irqflags.S            |    1 +
 arch/x86/kernel/paravirt.c            |   14 ++++++++++++--
 arch/x86/mm/mem_encrypt_boot.S        |    1 +
 arch/x86/power/hibernate_asm_64.S     |    2 ++
 arch/x86/xen/xen-asm.S                |    5 +++++
 13 files changed, 48 insertions(+), 15 deletions(-)

--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -17,6 +17,7 @@
  */
 
 #include <linux/linkage.h>
+#include <linux/objtool.h>
 #include <asm/frame.h>
 
 #define STATE1	%xmm0
@@ -1071,6 +1072,7 @@ SYM_FUNC_END(_aesni_inc)
  *		      size_t len, u8 *iv)
  */
 SYM_FUNC_START(aesni_ctr_enc)
+	ANNOTATE_NOENDBR
 	FRAME_BEGIN
 	cmp $16, LEN
 	jb .Lctr_enc_just_ret
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -431,6 +431,7 @@ For 32-bit we have the following convent
 /* rdi:	arg1 ... normal C conventions. rax is saved/restored. */
 .macro THUNK name, func
 SYM_FUNC_START(\name)
+	ANNOTATE_NOENDBR
 	pushq %rbp
 	movq %rsp, %rbp
 
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -175,6 +175,7 @@ SYM_CODE_END(entry_SYSCALL_64)
  */
 .pushsection .text, "ax"
 SYM_FUNC_START(__switch_to_asm)
+	ANNOTATE_NOENDBR
 	/*
 	 * Save callee-saved registers
 	 * This must match the order in inactive_task_frame
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -119,33 +119,27 @@
 
 /* SYM_FUNC_START -- use for global functions */
 #define SYM_FUNC_START(name)				\
-	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)
 
 /* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */
 #define SYM_FUNC_START_NOALIGN(name)			\
-	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)	\
-	ENDBR
+	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)
 
 /* SYM_FUNC_START_LOCAL -- use for local functions */
 #define SYM_FUNC_START_LOCAL(name)			\
-	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)	\
-	ENDBR
+	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)
 
 /* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */
 #define SYM_FUNC_START_LOCAL_NOALIGN(name)		\
-	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)	\
-	ENDBR
+	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)
 
 /* SYM_FUNC_START_WEAK -- use for weak functions */
 #define SYM_FUNC_START_WEAK(name)			\
-	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)	\
-	ENDBR
+	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)
 
 /* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */
 #define SYM_FUNC_START_WEAK_NOALIGN(name)		\
-	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)		\
-	ENDBR
+	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)
 
 #endif /* _ASM_X86_LINKAGE_H */
 
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -246,7 +246,17 @@ extern struct paravirt_patch_template pv
 
 int paravirt_disable_iospace(void);
 
-/* This generates an indirect call based on the operation type number. */
+/*
+ * This generates an indirect call based on the operation type number.
+ *
+ * Since alternatives run after enabling CET/IBT -- the latter setting/clearing
+ * capabilities and the former requiring all capabilities being finalized --
+ * these indirect calls are subject to IBT and the paravirt stubs should have
+ * ENDBR on.
+ *
+ * OTOH since this is effectively a __nocfi indirect call, the paravirt stubs
+ * don't need to bother with CFI prefixes.
+ */
 #define PARAVIRT_CALL					\
 	ANNOTATE_RETPOLINE_SAFE				\
 	"call *%[paravirt_opptr];"
--- a/arch/x86/kernel/acpi/madt_playdead.S
+++ b/arch/x86/kernel/acpi/madt_playdead.S
@@ -14,6 +14,7 @@
  * rsi: PGD of the identity mapping
  */
 SYM_FUNC_START(asm_acpi_mp_play_dead)
+	ANNOTATE_NOENDBR
 	/* Turn off global entries. Following CR3 write will flush them. */
 	movq	%cr4, %rdx
 	andq	$~(X86_CR4_PGE), %rdx
--- a/arch/x86/kernel/acpi/wakeup_64.S
+++ b/arch/x86/kernel/acpi/wakeup_64.S
@@ -17,6 +17,7 @@
 	 * Hooray, we are in Long 64-bit mode (but still running in low memory)
 	 */
 SYM_FUNC_START(wakeup_long64)
+	ANNOTATE_NOENDBR
 	movq	saved_magic(%rip), %rax
 	movq	$0x123456789abcdef0, %rdx
 	cmpq	%rdx, %rax
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -146,12 +146,14 @@ SYM_FUNC_END(ftrace_stub_graph)
 #ifdef CONFIG_DYNAMIC_FTRACE
 
 SYM_FUNC_START(__fentry__)
+	ANNOTATE_NOENDBR
 	CALL_DEPTH_ACCOUNT
 	RET
 SYM_FUNC_END(__fentry__)
 EXPORT_SYMBOL(__fentry__)
 
 SYM_FUNC_START(ftrace_caller)
+	ANNOTATE_NOENDBR
 	/* save_mcount_regs fills in first two parameters */
 	save_mcount_regs
 
@@ -197,6 +199,7 @@ SYM_FUNC_END(ftrace_caller);
 STACK_FRAME_NON_STANDARD_FP(ftrace_caller)
 
 SYM_FUNC_START(ftrace_regs_caller)
+	ANNOTATE_NOENDBR
 	/* Save the current flags before any operations that can change them */
 	pushfq
 
@@ -317,6 +320,7 @@ SYM_FUNC_END(ftrace_stub_direct_tramp)
 #else /* ! CONFIG_DYNAMIC_FTRACE */
 
 SYM_FUNC_START(__fentry__)
+	ANNOTATE_NOENDBR
 	CALL_DEPTH_ACCOUNT
 
 	cmpq $ftrace_stub, ftrace_trace_function
--- a/arch/x86/kernel/irqflags.S
+++ b/arch/x86/kernel/irqflags.S
@@ -9,6 +9,7 @@
  */
 .pushsection .noinstr.text, "ax"
 SYM_FUNC_START(native_save_fl)
+	ENDBR
 	pushf
 	pop %_ASM_AX
 	RET
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -106,6 +106,16 @@ static noinstr void pv_native_write_cr2(
 	native_write_cr2(val);
 }
 
+static noinstr unsigned long pv_native_read_cr3(void)
+{
+	return __native_read_cr3();
+}
+
+static noinstr void pv_native_write_cr3(unsigned long cr3)
+{
+	native_write_cr3(cr3);
+}
+
 static noinstr unsigned long pv_native_get_debugreg(int regno)
 {
 	return native_get_debugreg(regno);
@@ -199,8 +209,8 @@ struct paravirt_patch_template pv_ops =
 #ifdef CONFIG_PARAVIRT_XXL
 	.mmu.read_cr2		= __PV_IS_CALLEE_SAVE(pv_native_read_cr2),
 	.mmu.write_cr2		= pv_native_write_cr2,
-	.mmu.read_cr3		= __native_read_cr3,
-	.mmu.write_cr3		= native_write_cr3,
+	.mmu.read_cr3		= pv_native_read_cr3,
+	.mmu.write_cr3		= pv_native_write_cr3,
 
 	.mmu.pgd_alloc		= __paravirt_pgd_alloc,
 	.mmu.pgd_free		= paravirt_nop,
--- a/arch/x86/mm/mem_encrypt_boot.S
+++ b/arch/x86/mm/mem_encrypt_boot.S
@@ -72,6 +72,7 @@ SYM_FUNC_START(sme_encrypt_execute)
 SYM_FUNC_END(sme_encrypt_execute)
 
 SYM_FUNC_START(__enc_copy)
+	ANNOTATE_NOENDBR
 /*
  * Routine used to encrypt memory in place.
  *   This routine must be run outside of the kernel proper since
--- a/arch/x86/power/hibernate_asm_64.S
+++ b/arch/x86/power/hibernate_asm_64.S
@@ -26,6 +26,7 @@
 	 /* code below belongs to the image kernel */
 	.align PAGE_SIZE
 SYM_FUNC_START(restore_registers)
+	ANNOTATE_NOENDBR
 	/* go back to the original page tables */
 	movq    %r9, %cr3
 
@@ -119,6 +120,7 @@ SYM_FUNC_END(restore_image)
 
 	/* code below has been relocated to a safe page */
 SYM_FUNC_START(core_restore_code)
+	ANNOTATE_NOENDBR
 	/* switch to temporary page tables */
 	movq	%rax, %cr3
 	/* flush TLB */
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -28,6 +28,7 @@
  * non-zero.
  */
 SYM_FUNC_START(xen_irq_disable_direct)
+	ENDBR
 	movb $1, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	RET
 SYM_FUNC_END(xen_irq_disable_direct)
@@ -67,6 +68,7 @@ SYM_FUNC_END(check_events)
  * then enter the hypervisor to get them handled.
  */
 SYM_FUNC_START(xen_irq_enable_direct)
+	ENDBR
 	FRAME_BEGIN
 	/* Unmask events */
 	movb $0, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
@@ -97,6 +99,7 @@ SYM_FUNC_END(xen_irq_enable_direct)
  * x86 use opposite senses (mask vs enable).
  */
 SYM_FUNC_START(xen_save_fl_direct)
+	ENDBR
 	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	setz %ah
 	addb %ah, %ah
@@ -104,6 +107,7 @@ SYM_FUNC_START(xen_save_fl_direct)
 SYM_FUNC_END(xen_save_fl_direct)
 
 SYM_FUNC_START(xen_read_cr2)
+	ENDBR
 	FRAME_BEGIN
 	_ASM_MOV PER_CPU_VAR(xen_vcpu), %_ASM_AX
 	_ASM_MOV XEN_vcpu_info_arch_cr2(%_ASM_AX), %_ASM_AX
@@ -112,6 +116,7 @@ SYM_FUNC_START(xen_read_cr2)
 SYM_FUNC_END(xen_read_cr2);
 
 SYM_FUNC_START(xen_read_cr2_direct)
+	ENDBR
 	FRAME_BEGIN
 	_ASM_MOV PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_arch_cr2), %_ASM_AX
 	FRAME_END



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/14] x86/cfi: Wreck things...
  2024-09-27 19:48 ` [PATCH 01/14] x86/cfi: Wreck things Peter Zijlstra
@ 2024-09-27 23:15   ` Josh Poimboeuf
  2024-09-28 13:31     ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-27 23:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:48:57PM +0200, Peter Zijlstra wrote:
> vmlinux.o: warning: objtool: .export_symbol+0x3c9f0: data relocation to !ENDBR: entry_untrain_ret+0x0
> 
> Which states that while these functions are exported and (directly)
> callable, they cannot be called indirectly. There are two solutions:

IIRC, exported symbols are by far the most common "need" for ENDBR.  But
presumably the vast majority of them aren't being indirect called.

>  - exclude the .export_symbol section from validation; effectively
>    saying that having linkable but not indirectly callable exports are
>    fine by default, or

This is confusingly inconsistent IMO.

>  - make all of those use SYM_TYPED_FUNC_START to restore the
>    traditional (and expected, but less secure?) behaviour.

Why not just make SYM_FUNC_START imply "typed"?  That's consistent with
what the compiler does anyway right?

Even better, require exported+indirect-called symbols to use
EXPORT_SYMBOL_TYPED, otherwise they get sealed.  I suppose we'd need to
add some module-to-vmlinux ENDBR validation to make sure modules don't
get broken by this.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/14] x86/cfi: Wreck things...
  2024-09-27 23:15   ` Josh Poimboeuf
@ 2024-09-28 13:31     ` Peter Zijlstra
  2024-09-30 21:42       ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-28 13:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 04:15:27PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:48:57PM +0200, Peter Zijlstra wrote:
> > vmlinux.o: warning: objtool: .export_symbol+0x3c9f0: data relocation to !ENDBR: entry_untrain_ret+0x0
> > 
> > Which states that while these functions are exported and (directly)
> > callable, they cannot be called indirectly. There are two solutions:
> 
> IIRC, exported symbols are by far the most common "need" for ENDBR.  But
> presumably the vast majority of them aren't being indirect called.
> 
> >  - exclude the .export_symbol section from validation; effectively
> >    saying that having linkable but not indirectly callable exports are
> >    fine by default, or
> 
> This is confusingly inconsistent IMO.

Yes. OTOH less indirectly callable functions is more better, no?

> >  - make all of those use SYM_TYPED_FUNC_START to restore the
> >    traditional (and expected, but less secure?) behaviour.
> 
> Why not just make SYM_FUNC_START imply "typed"?  That's consistent with
> what the compiler does anyway right?

It is indeed what the compiler does (unless __nocfi attribute is
employed), but it requires that there is a C declaration of the function
-- which is true for all exported functions but not necessary for all
SYM_FUNC_START() symbols per-se.

Also, it would make all ASM functions indirectly callable by default --
which I'm not sure is a good idea, I would much rather we keep this
explicit.

> Even better, require exported+indirect-called symbols to use
> EXPORT_SYMBOL_TYPED, otherwise they get sealed.  I suppose we'd need to
> add some module-to-vmlinux ENDBR validation to make sure modules don't
> get broken by this.

So I like this idea. but yeah, this is going to be a bit tricky to
validate.

Anyway, I think I'll do an initial patch moving all the EXPORT'ed
symbols over to SYM_TYPED_FUNC_START() for now, and we can look at
adding extra EXPORT magic on top of all that later on.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/14] x86/cfi: Wreck things...
  2024-09-28 13:31     ` Peter Zijlstra
@ 2024-09-30 21:42       ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-30 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Sat, Sep 28, 2024 at 03:31:14PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2024 at 04:15:27PM -0700, Josh Poimboeuf wrote:
> > Even better, require exported+indirect-called symbols to use
> > EXPORT_SYMBOL_TYPED, otherwise they get sealed.  I suppose we'd need to
> > add some module-to-vmlinux ENDBR validation to make sure modules don't
> > get broken by this.
> 
> So I like this idea. but yeah, this is going to be a bit tricky to
> validate.

If Module.symvers had EXPORT_SYMBOL[_GPL][_TYPED], objtool could read
that to decide whether a given module indirect branch is allowed.

Objtool is going to be getting support for reading Module.symvers anyway
for klp-build so it should actually be pretty easy.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 02/14] x86/boot: Mark start_secondary() with __noendbr
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
  2024-09-27 19:48 ` [PATCH 01/14] x86/cfi: Wreck things Peter Zijlstra
@ 2024-09-27 19:48 ` Peter Zijlstra
  2024-09-27 19:48 ` [PATCH 03/14] x86/alternative: Simplify callthunk patching Peter Zijlstra
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

The handoff between the boot stubs and start_secondary() are before IBT is
enabled and is definitely not subject to kCFI. As such, suppress all that for
this function.

Notably when the ENDBR poison would become fatal (ud1 instead of nop) this will
trigger a tripple fault because we haven't set up the IDT to handle #UD yet.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/smpboot.c |    3 ++-
 include/linux/objtool.h   |   13 ++++++++++---
 2 files changed, 12 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -228,7 +228,7 @@ static void ap_calibrate_delay(void)
 /*
  * Activate a secondary processor.
  */
-static void notrace start_secondary(void *unused)
+static void notrace __noendbr start_secondary(void *unused)
 {
 	/*
 	 * Don't put *anything* except direct CPU state initialization
@@ -313,6 +313,7 @@ static void notrace start_secondary(void
 	wmb();
 	cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
 }
+NOENDBR_SYMBOL(start_secondary);
 
 /*
  * The bootstrap kernel entry code has set these up. Save them for
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -45,12 +45,18 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #endif
 
-#define ANNOTATE_NOENDBR					\
-	"986: \n\t"						\
+#define __ANNOTATE_NOENDBR(label)				\
 	".pushsection .discard.noendbr\n\t"			\
-	".long 986b\n\t"					\
+	".long " #label "\n\t"					\
 	".popsection\n\t"
 
+#define NOENDBR_SYMBOL(func)					\
+	asm(__ANNOTATE_NOENDBR(func))
+
+#define ANNOTATE_NOENDBR					\
+	"986: \n\t"						\
+	__ANNOTATE_NOENDBR(986b)
+
 #define ASM_REACHABLE							\
 	"998:\n\t"							\
 	".pushsection .discard.reachable\n\t"				\
@@ -157,6 +163,7 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #define ANNOTATE_NOENDBR
 #define ASM_REACHABLE
+#define NOENDBR_SYMBOL(func)
 #else
 #define ANNOTATE_INTRA_FUNCTION_CALL
 .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 03/14] x86/alternative: Simplify callthunk patching
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
  2024-09-27 19:48 ` [PATCH 01/14] x86/cfi: Wreck things Peter Zijlstra
  2024-09-27 19:48 ` [PATCH 02/14] x86/boot: Mark start_secondary() with __noendbr Peter Zijlstra
@ 2024-09-27 19:48 ` Peter Zijlstra
  2024-09-27 23:27   ` Josh Poimboeuf
  2024-09-27 19:49 ` [PATCH 04/14] objtool/x86: Add .tail_call_sites Peter Zijlstra
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Now that paravirt call patching is implemented using alternatives, it
is possible to avoid having to patch the alternative sites by
including the altinstr_replacement calls in the call_sites list.

This means we're now stacking relative adjustments like so:

  callthunks_patch_builtin_calls():
    patches all function calls to target: func() -> func()-10
    since the CALL accounting lives in the CALL_PADDING.

    This explicitly includes .altinstr_replacement

  alt_replace_call():
    patches: x86_BUG() -> target()

    this patching is done in a relative manner, and will preserve
    the above adjustment, meaning that with calldepth patching it
    will do: x86_BUG()-10 -> target()-10

  apply_relocation():
    does code relocation, and adjusts all RIP-relative instructions
    to the new location, also in a relative manner.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/alternative.h |    1 -
 arch/x86/kernel/alternative.c      |    8 ++++----
 arch/x86/kernel/callthunks.c       |   13 -------------
 arch/x86/kernel/module.c           |   17 ++++++-----------
 tools/objtool/arch/x86/decode.c    |    1 +
 tools/objtool/check.c              |   12 ++----------
 6 files changed, 13 insertions(+), 39 deletions(-)

--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -109,7 +109,6 @@ struct module;
 
 struct callthunk_sites {
 	s32				*call_start, *call_end;
-	struct alt_instr		*alt_start, *alt_end;
 };
 
 #ifdef CONFIG_CALL_THUNKS
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1701,14 +1701,14 @@ void __init alternative_instructions(voi
 	apply_retpolines(__retpoline_sites, __retpoline_sites_end);
 	apply_returns(__return_sites, __return_sites_end);
 
-	apply_alternatives(__alt_instructions, __alt_instructions_end);
-
 	/*
-	 * Now all calls are established. Apply the call thunks if
-	 * required.
+	 * Adjust call CALL instructions to point to func()-10, including those
+	 * in .altinstr_replacement.
 	 */
 	callthunks_patch_builtin_calls();
 
+	apply_alternatives(__alt_instructions, __alt_instructions_end);
+
 	/*
 	 * Seal all functions that do not have their address taken.
 	 */
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -239,21 +239,10 @@ patch_call_sites(s32 *start, s32 *end, c
 }
 
 static __init_or_module void
-patch_alt_call_sites(struct alt_instr *start, struct alt_instr *end,
-		     const struct core_text *ct)
-{
-	struct alt_instr *a;
-
-	for (a = start; a < end; a++)
-		patch_call((void *)&a->instr_offset + a->instr_offset, ct);
-}
-
-static __init_or_module void
 callthunks_setup(struct callthunk_sites *cs, const struct core_text *ct)
 {
 	prdbg("Patching call sites %s\n", ct->name);
 	patch_call_sites(cs->call_start, cs->call_end, ct);
-	patch_alt_call_sites(cs->alt_start, cs->alt_end, ct);
 	prdbg("Patching call sites done%s\n", ct->name);
 }
 
@@ -262,8 +251,6 @@ void __init callthunks_patch_builtin_cal
 	struct callthunk_sites cs = {
 		.call_start	= __call_sites,
 		.call_end	= __call_sites_end,
-		.alt_start	= __alt_instructions,
-		.alt_end	= __alt_instructions_end
 	};
 
 	if (!cpu_feature_enabled(X86_FEATURE_CALL_DEPTH))
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -275,12 +275,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *rseg = (void *)returns->sh_addr;
 		apply_returns(rseg, rseg + returns->sh_size);
 	}
-	if (alt) {
-		/* patch .altinstructions */
-		void *aseg = (void *)alt->sh_addr;
-		apply_alternatives(aseg, aseg + alt->sh_size);
-	}
-	if (calls || alt) {
+	if (calls) {
 		struct callthunk_sites cs = {};
 
 		if (calls) {
@@ -288,13 +283,13 @@ int module_finalize(const Elf_Ehdr *hdr,
 			cs.call_end = (void *)calls->sh_addr + calls->sh_size;
 		}
 
-		if (alt) {
-			cs.alt_start = (void *)alt->sh_addr;
-			cs.alt_end = (void *)alt->sh_addr + alt->sh_size;
-		}
-
 		callthunks_patch_module_calls(&cs, me);
 	}
+	if (alt) {
+		/* patch .altinstructions */
+		void *aseg = (void *)alt->sh_addr;
+		apply_alternatives(aseg, aseg + alt->sh_size);
+	}
 	if (ibt_endbr) {
 		void *iseg = (void *)ibt_endbr->sh_addr;
 		apply_seal_endbr(iseg, iseg + ibt_endbr->sh_size);
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -845,5 +845,6 @@ bool arch_is_rethunk(struct symbol *sym)
 bool arch_is_embedded_insn(struct symbol *sym)
 {
 	return !strcmp(sym->name, "retbleed_return_thunk") ||
+	       !strcmp(sym->name, "srso_alias_safe_ret") ||
 	       !strcmp(sym->name, "srso_safe_ret");
 }
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1360,15 +1360,6 @@ static void annotate_call_site(struct ob
 	if (!sym)
 		sym = reloc->sym;
 
-	/*
-	 * Alternative replacement code is just template code which is
-	 * sometimes copied to the original instruction. For now, don't
-	 * annotate it. (In the future we might consider annotating the
-	 * original instruction if/when it ever makes sense to do so.)
-	 */
-	if (!strcmp(insn->sec->name, ".altinstr_replacement"))
-		return;
-
 	if (sym->static_call_tramp) {
 		list_add_tail(&insn->call_node, &file->static_call_list);
 		return;
@@ -1426,7 +1417,8 @@ static void annotate_call_site(struct ob
 		return;
 	}
 
-	if (insn->type == INSN_CALL && !insn->sec->init)
+	if (insn->type == INSN_CALL && !insn->sec->init &&
+	    !insn->_call_dest->embedded_insn)
 		list_add_tail(&insn->call_node, &file->call_list);
 
 	if (!sibling && dead_end_function(file, sym))



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/14] x86/alternative: Simplify callthunk patching
  2024-09-27 19:48 ` [PATCH 03/14] x86/alternative: Simplify callthunk patching Peter Zijlstra
@ 2024-09-27 23:27   ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-27 23:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:48:59PM +0200, Peter Zijlstra wrote:
> -	 * Now all calls are established. Apply the call thunks if
> -	 * required.
> +	 * Adjust call CALL instructions to point to func()-10, including those
> +	 * in .altinstr_replacement.

A couple of tweaks:

"Adjust all CALL instructions to point to func()-0x10..."

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 04/14] objtool/x86: Add .tail_call_sites
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (2 preceding siblings ...)
  2024-09-27 19:48 ` [PATCH 03/14] x86/alternative: Simplify callthunk patching Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-27 23:42   ` Josh Poimboeuf
  2024-09-27 19:49 ` [PATCH 05/14] objtool: Rename the skylake hack to --direct-call Peter Zijlstra
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

In order to allow rewriting all direct (tail) calls to ENDBR to go +4, also
generate a tail-call sites section.

This must be different from the .call_sites, because call depth tracking
specifically cares about the CALL instruction, but not JMP instructions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/vmlinux.lds.S           |    9 ++++-
 tools/objtool/check.c                   |   52 +++++++++++++++++++++++++++++---
 tools/objtool/include/objtool/objtool.h |    1 
 tools/objtool/objtool.c                 |    1 
 4 files changed, 58 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -285,6 +285,7 @@ SECTIONS
 		*(.return_sites)
 		__return_sites_end = .;
 	}
+#endif
 
 	. = ALIGN(8);
 	.call_sites : AT(ADDR(.call_sites) - LOAD_OFFSET) {
@@ -292,7 +293,13 @@ SECTIONS
 		*(.call_sites)
 		__call_sites_end = .;
 	}
-#endif
+
+	. = ALIGN(8);
+	.tail_call_sites : AT(ADDR(.tail_call_sites) - LOAD_OFFSET) {
+		__tail_call_sites = .;
+		*(.tail_call_sites)
+		__tail_call_sites_end = .;
+	}
 
 #ifdef CONFIG_X86_KERNEL_IBT
 	. = ALIGN(8);
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -893,7 +893,6 @@ static int create_cfi_sections(struct ob
 
 	sec = find_section_by_name(file->elf, ".cfi_sites");
 	if (sec) {
-		INIT_LIST_HEAD(&file->call_list);
 		WARN("file already has .cfi_sites section, skipping");
 		return 0;
 	}
@@ -1018,6 +1017,45 @@ static int create_direct_call_sections(s
 	return 0;
 }
 
+static int create_direct_tail_call_sections(struct objtool_file *file)
+{
+	struct instruction *insn;
+	struct section *sec;
+	int idx;
+
+	sec = find_section_by_name(file->elf, ".tail_call_sites");
+	if (sec) {
+		INIT_LIST_HEAD(&file->tail_call_list);
+		WARN("file already has .tail_call_sites section, skipping");
+		return 0;
+	}
+
+	if (list_empty(&file->tail_call_list))
+		return 0;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->tail_call_list, call_node)
+		idx++;
+
+	sec = elf_create_section_pair(file->elf, ".tail_call_sites",
+				      sizeof(unsigned int), idx, idx);
+	if (!sec)
+		return -1;
+
+	idx = 0;
+	list_for_each_entry(insn, &file->tail_call_list, call_node) {
+
+		if (!elf_init_reloc_text_sym(file->elf, sec,
+					     idx * sizeof(unsigned int), idx,
+					     insn->sec, insn->offset))
+			return -1;
+
+		idx++;
+	}
+
+	return 0;
+}
+
 /*
  * Warnings shouldn't be reported for ignored functions.
  */
@@ -1417,9 +1455,12 @@ static void annotate_call_site(struct ob
 		return;
 	}
 
-	if (insn->type == INSN_CALL && !insn->sec->init &&
-	    !insn->_call_dest->embedded_insn)
-		list_add_tail(&insn->call_node, &file->call_list);
+	if (!insn->sec->init && !insn->_call_dest->embedded_insn) {
+		if (insn->type == INSN_CALL)
+			list_add_tail(&insn->call_node, &file->call_list);
+		else
+			list_add_tail(&insn->call_node, &file->tail_call_list);
+	}
 
 	if (!sibling && dead_end_function(file, sym))
 		insn->dead_end = true;
@@ -4802,6 +4843,9 @@ int check(struct objtool_file *file)
 			ret = create_direct_call_sections(file);
 			if (ret < 0)
 				goto out;
+			ret = create_direct_tail_call_sections(file);
+			if (ret < 0)
+				goto out;
 			warnings += ret;
 		}
 	}
--- a/tools/objtool/include/objtool/objtool.h
+++ b/tools/objtool/include/objtool/objtool.h
@@ -28,6 +28,7 @@ struct objtool_file {
 	struct list_head mcount_loc_list;
 	struct list_head endbr_list;
 	struct list_head call_list;
+	struct list_head tail_call_list;
 	bool ignore_unreachables, hints, rodata;
 
 	unsigned int nr_endbr;
--- a/tools/objtool/objtool.c
+++ b/tools/objtool/objtool.c
@@ -106,6 +106,7 @@ struct objtool_file *objtool_open_read(c
 	INIT_LIST_HEAD(&file.mcount_loc_list);
 	INIT_LIST_HEAD(&file.endbr_list);
 	INIT_LIST_HEAD(&file.call_list);
+	INIT_LIST_HEAD(&file.tail_call_list);
 	file.ignore_unreachables = opts.no_unreachable;
 	file.hints = false;
 



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] objtool/x86: Add .tail_call_sites
  2024-09-27 19:49 ` [PATCH 04/14] objtool/x86: Add .tail_call_sites Peter Zijlstra
@ 2024-09-27 23:42   ` Josh Poimboeuf
  2024-10-09 15:25     ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-27 23:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:49:00PM +0200, Peter Zijlstra wrote:
> @@ -893,7 +893,6 @@ static int create_cfi_sections(struct ob
>  
>  	sec = find_section_by_name(file->elf, ".cfi_sites");
>  	if (sec) {
> -		INIT_LIST_HEAD(&file->call_list);

Hm, why exactly are we re-initing the list head anyway in these
boilerplate create_*_sections() functions?  I'm guessing that backfired
here.  I can't figure out a reason why we'd doing that anyway.

I'm also wondering why we made these boilerplate function names plural
"sections" when they only create a single section :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] objtool/x86: Add .tail_call_sites
  2024-09-27 23:42   ` Josh Poimboeuf
@ 2024-10-09 15:25     ` Peter Zijlstra
  2024-10-10  4:55       ` Josh Poimboeuf
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-10-09 15:25 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 04:42:47PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:00PM +0200, Peter Zijlstra wrote:
> > @@ -893,7 +893,6 @@ static int create_cfi_sections(struct ob
> >  
> >  	sec = find_section_by_name(file->elf, ".cfi_sites");
> >  	if (sec) {
> > -		INIT_LIST_HEAD(&file->call_list);
> 
> Hm, why exactly are we re-initing the list head anyway in these
> boilerplate create_*_sections() functions?  I'm guessing that backfired
> here.  I can't figure out a reason why we'd doing that anyway.

Yeah, I can't remember where that came from, nor why I removed this
particular one :/

> I'm also wondering why we made these boilerplate function names plural
> "sections" when they only create a single section :-)

Because elf_create_section_pair() creates two section_s_, right?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] objtool/x86: Add .tail_call_sites
  2024-10-09 15:25     ` Peter Zijlstra
@ 2024-10-10  4:55       ` Josh Poimboeuf
  0 siblings, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2024-10-10  4:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Wed, Oct 09, 2024 at 05:25:03PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2024 at 04:42:47PM -0700, Josh Poimboeuf wrote:
> > On Fri, Sep 27, 2024 at 09:49:00PM +0200, Peter Zijlstra wrote:
> > > @@ -893,7 +893,6 @@ static int create_cfi_sections(struct ob
> > >  
> > >  	sec = find_section_by_name(file->elf, ".cfi_sites");
> > >  	if (sec) {
> > > -		INIT_LIST_HEAD(&file->call_list);
> > 
> > Hm, why exactly are we re-initing the list head anyway in these
> > boilerplate create_*_sections() functions?  I'm guessing that backfired
> > here.  I can't figure out a reason why we'd doing that anyway.
> 
> Yeah, I can't remember where that came from, nor why I removed this
> particular one :/

I think you removed this one because file->call_list is also used by
create_direct_call_sections() so it's not wise to clear it if it was
already initalized and needed for another purpose.

> 
> > I'm also wondering why we made these boilerplate function names plural
> > "sections" when they only create a single section :-)
> 
> Because elf_create_section_pair() creates two section_s_, right?

Ah right.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 05/14] objtool: Rename the skylake hack to --direct-call
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (3 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 04/14] objtool/x86: Add .tail_call_sites Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-27 19:49 ` [PATCH 06/14] x86/traps: Prepare for ENDBR poison UD1 usage Peter Zijlstra
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

In order to use the .call_sites and .tail_call_sites sections beyond the
call-depth-tracking skylake thing, give it a more generic name.

Specifically it will be used to 'relink' all direct calls to avoid ENDBR.

A compiler could only hope to do this using an LTO pass, and even then the
dynamic linker would need to get involved.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 scripts/Makefile.lib                    |    2 +-
 tools/objtool/builtin-check.c           |    8 ++------
 tools/objtool/check.c                   |   18 +++++++++---------
 tools/objtool/include/objtool/builtin.h |    2 +-
 4 files changed, 13 insertions(+), 17 deletions(-)

--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -267,7 +267,7 @@ objtool := $(objtree)/tools/objtool/objt
 
 objtool-args-$(CONFIG_HAVE_JUMP_LABEL_HACK)		+= --hacks=jump_label
 objtool-args-$(CONFIG_HAVE_NOINSTR_HACK)		+= --hacks=noinstr
-objtool-args-$(CONFIG_MITIGATION_CALL_DEPTH_TRACKING)	+= --hacks=skylake
+objtool-args-$(CONFIG_MITIGATION_CALL_DEPTH_TRACKING)	+= --direct-call
 objtool-args-$(CONFIG_X86_KERNEL_IBT)			+= --ibt
 objtool-args-$(CONFIG_FINEIBT)				+= --cfi
 objtool-args-$(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL)	+= --mcount
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -57,17 +57,13 @@ static int parse_hacks(const struct opti
 		found = true;
 	}
 
-	if (!str || strstr(str, "skylake")) {
-		opts.hack_skylake = true;
-		found = true;
-	}
-
 	return found ? 0 : -1;
 }
 
 static const struct option check_options[] = {
 	OPT_GROUP("Actions:"),
-	OPT_CALLBACK_OPTARG('h', "hacks", NULL, NULL, "jump_label,noinstr,skylake", "patch toolchain bugs/limitations", parse_hacks),
+	OPT_CALLBACK_OPTARG('h', "hacks", NULL, NULL, "jump_label,noinstr", "patch toolchain bugs/limitations", parse_hacks),
+	OPT_BOOLEAN(0,   "direct-call", &opts.direct_call, "annotate direct calls"),
 	OPT_BOOLEAN('i', "ibt", &opts.ibt, "validate and annotate IBT"),
 	OPT_BOOLEAN('m', "mcount", &opts.mcount, "annotate mcount/fentry calls for ftrace"),
 	OPT_BOOLEAN('n', "noinstr", &opts.noinstr, "validate noinstr rules"),
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -4838,16 +4838,16 @@ int check(struct objtool_file *file)
 		if (ret < 0)
 			goto out;
 		warnings += ret;
+	}
 
-		if (opts.hack_skylake) {
-			ret = create_direct_call_sections(file);
-			if (ret < 0)
-				goto out;
-			ret = create_direct_tail_call_sections(file);
-			if (ret < 0)
-				goto out;
-			warnings += ret;
-		}
+	if (opts.direct_call) {
+		ret = create_direct_call_sections(file);
+		if (ret < 0)
+			goto out;
+		ret = create_direct_tail_call_sections(file);
+		if (ret < 0)
+			goto out;
+		warnings += ret;
 	}
 
 	if (opts.mcount) {
--- a/tools/objtool/include/objtool/builtin.h
+++ b/tools/objtool/include/objtool/builtin.h
@@ -12,7 +12,7 @@ struct opts {
 	bool dump_orc;
 	bool hack_jump_label;
 	bool hack_noinstr;
-	bool hack_skylake;
+	bool direct_call;
 	bool ibt;
 	bool mcount;
 	bool noinstr;



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 06/14] x86/traps: Prepare for ENDBR poison UD1 usage
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (4 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 05/14] objtool: Rename the skylake hack to --direct-call Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-27 19:49 ` [PATCH 07/14] x86/ibt: Clean up is_endbr() Peter Zijlstra
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Prepare for ENDBR poison to become UD1 by making #UD recognise it as a valid
BUG address so we get a nice BUG splat when we trip on one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/bug.h |    6 ++-
 arch/x86/include/asm/ibt.h |    8 +++-
 arch/x86/kernel/cet.c      |   18 ++++++---
 arch/x86/kernel/traps.c    |   89 +++++++++++++++++++++++++++++++++------------
 4 files changed, 90 insertions(+), 31 deletions(-)

--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -22,8 +22,10 @@
 #define SECOND_BYTE_OPCODE_UD2	0x0b
 
 #define BUG_NONE		0xffff
-#define BUG_UD1			0xfffe
-#define BUG_UD2			0xfffd
+#define BUG_UD2			0xfffe
+#define BUG_UD1			0xfffd
+#define BUG_UD1_UBSAN		0xfffc
+#define BUG_UD1_ENDBR		0xfffb
 
 #ifdef CONFIG_GENERIC_BUG
 
--- a/arch/x86/include/asm/ibt.h
+++ b/arch/x86/include/asm/ibt.h
@@ -41,7 +41,7 @@
 	_ASM_PTR fname "\n\t"				\
 	".popsection\n\t"
 
-static inline __attribute_const__ u32 gen_endbr(void)
+static __always_inline __attribute_const__ u32 gen_endbr(void)
 {
 	u32 endbr;
 
@@ -56,7 +56,7 @@ static inline __attribute_const__ u32 ge
 	return endbr;
 }
 
-static inline __attribute_const__ u32 gen_endbr_poison(void)
+static __always_inline __attribute_const__ u32 gen_endbr_poison(void)
 {
 	/*
 	 * 4 byte NOP that isn't NOP4 (in fact it is OSP NOP3), such that it
@@ -77,6 +77,10 @@ static inline bool is_endbr(u32 val)
 extern __noendbr u64 ibt_save(bool disable);
 extern __noendbr void ibt_restore(u64 save);
 
+struct pt_regs;
+
+extern bool __do_kernel_cp_fault(struct pt_regs *regs);
+
 #else /* __ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -81,6 +81,17 @@ static void do_user_cp_fault(struct pt_r
 
 static __ro_after_init bool ibt_fatal = true;
 
+bool __do_kernel_cp_fault(struct pt_regs *regs)
+{
+	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
+	if (!ibt_fatal) {
+		printk(KERN_DEFAULT CUT_HERE);
+		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+		return true;
+	}
+	return false;
+}
+
 static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
 {
 	if ((error_code & CP_EC) != CP_ENDBR) {
@@ -93,12 +104,9 @@ static void do_kernel_cp_fault(struct pt
 		return;
 	}
 
-	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
-	if (!ibt_fatal) {
-		printk(KERN_DEFAULT CUT_HERE);
-		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+	if (__do_kernel_cp_fault(regs))
 		return;
-	}
+
 	BUG();
 }
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -94,10 +94,18 @@ __always_inline int is_valid_bugaddr(uns
 
 /*
  * Check for UD1 or UD2, accounting for Address Size Override Prefixes.
- * If it's a UD1, get the ModRM byte to pass along to UBSan.
+ * If it's a UD1, further decode to determine its use:
+ *
+ * UBSan{0}:     67 0f b9 00             ud1    (%eax),%eax
+ * UBSan{10}:    67 0f b9 40 10          ud1    0x10(%eax),%eax
+ * static_call:  0f b9 cc                ud1    %esp,%ecx
+ * ENDBR Poison: 0f b9 48 00             ud1    0(%eax),%edx
+ *
+ * Notably UBSAN uses EAX, static_call uses ECX and ENDBR uses EDX.
  */
-__always_inline int decode_bug(unsigned long addr, u32 *imm)
+__always_inline int decode_bug(unsigned long addr, s32 *imm, int *len)
 {
+	unsigned long start = addr;
 	u8 v;
 
 	if (addr < TASK_SIZE_MAX)
@@ -110,24 +118,41 @@ __always_inline int decode_bug(unsigned
 		return BUG_NONE;
 
 	v = *(u8 *)(addr++);
-	if (v == SECOND_BYTE_OPCODE_UD2)
+	if (v == SECOND_BYTE_OPCODE_UD2) {
+		*len = addr - start;
 		return BUG_UD2;
+	}
 
-	if (!IS_ENABLED(CONFIG_UBSAN_TRAP) || v != SECOND_BYTE_OPCODE_UD1)
+	if (v != SECOND_BYTE_OPCODE_UD1)
 		return BUG_NONE;
 
-	/* Retrieve the immediate (type value) for the UBSAN UD1 */
-	v = *(u8 *)(addr++);
-	if (X86_MODRM_RM(v) == 4)
-		addr++;
-
 	*imm = 0;
-	if (X86_MODRM_MOD(v) == 1)
-		*imm = *(u8 *)addr;
-	else if (X86_MODRM_MOD(v) == 2)
-		*imm = *(u32 *)addr;
-	else
-		WARN_ONCE(1, "Unexpected MODRM_MOD: %u\n", X86_MODRM_MOD(v));
+	v = *(u8 *)(addr++);		/* ModRM */
+
+	/* Decode immediate, if present */
+	if (X86_MODRM_MOD(v) != 3) {
+		if (X86_MODRM_RM(v) == 4)
+			addr++;		/* Skip SIB byte */
+
+		if (X86_MODRM_MOD(v) == 1) {
+			*imm = *(s8 *)addr;
+			addr += 1;
+
+		} else if (X86_MODRM_MOD(v) == 2) {
+			*imm = *(s32 *)addr;
+			addr += 4;
+		}
+	}
+
+	/* record instruction length */
+	*len = addr - start;
+
+	if (X86_MODRM_REG(v) == 0)	/* EAX */
+		return BUG_UD1_UBSAN;
+
+	if (X86_MODRM_REG(v) == 2 &&	/* RDX */
+	    *len == ENDBR_INSN_SIZE)
+		return BUG_UD1_ENDBR;
 
 	return BUG_UD1;
 }
@@ -258,8 +283,8 @@ static inline void handle_invalid_op(str
 static noinstr bool handle_bug(struct pt_regs *regs)
 {
 	bool handled = false;
-	int ud_type;
-	u32 imm;
+	int ud_type, ud_len;
+	s32 ud_imm;
 
 	/*
 	 * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug()
@@ -267,7 +292,7 @@ static noinstr bool handle_bug(struct pt
 	 * irqentry_enter().
 	 */
 	kmsan_unpoison_entry_regs(regs);
-	ud_type = decode_bug(regs->ip, &imm);
+	ud_type = decode_bug(regs->ip, &ud_imm, &ud_len);
 	if (ud_type == BUG_NONE)
 		return handled;
 
@@ -281,15 +306,35 @@ static noinstr bool handle_bug(struct pt
 	 */
 	if (regs->flags & X86_EFLAGS_IF)
 		raw_local_irq_enable();
-	if (ud_type == BUG_UD2) {
+
+	switch (ud_type) {
+	case BUG_UD2:
 		if (report_bug(regs->ip, regs) == BUG_TRAP_TYPE_WARN ||
 		    handle_cfi_failure(regs) == BUG_TRAP_TYPE_WARN) {
-			regs->ip += LEN_UD2;
+			regs->ip += ud_len;
+			handled = true;
+		}
+		break;
+
+	case BUG_UD1_UBSAN:
+		if (IS_ENABLED(CONFIG_UBSAN_TRAP)) {
+			pr_crit("%s at %pS\n",
+				report_ubsan_failure(regs, ud_imm),
+				(void *)regs->ip);
+		}
+		break;
+
+	case BUG_UD1_ENDBR:
+		if (__do_kernel_cp_fault(regs)) {
+			regs->ip += ud_len;
 			handled = true;
 		}
-	} else if (IS_ENABLED(CONFIG_UBSAN_TRAP)) {
-		pr_crit("%s at %pS\n", report_ubsan_failure(regs, imm), (void *)regs->ip);
+		break;
+
+	default:
+		break;
 	}
+
 	if (regs->flags & X86_EFLAGS_IF)
 		raw_local_irq_disable();
 	instrumentation_end();



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (5 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 06/14] x86/traps: Prepare for ENDBR poison UD1 usage Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-28  0:04   ` Josh Poimboeuf
  2024-09-29 17:32   ` Alexei Starovoitov
  2024-09-27 19:49 ` [PATCH 08/14] x86/ibt: Clean up poison_endbr() Peter Zijlstra
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov,
	Andrii Nakryiko

Pretty much every caller of is_endbr() actually wants to test something at an
address and ends up doing get_kernel_nofault(). Fold the lot into a more
convenient helper.

Note: this effectively reverts commit a8497506cd2c ("bpf: Avoid
get_kernel_nofault() to fetch kprobe entry IP") which was entirely the
wrong way to go about doing things. The right solution is to optimize
get_kernel_nofault() itself, it really doesn't need STAC/CLAC nor the
speculation barrier. Using __get_user is a historical hack, not a
requirement.

Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c         |    2 +-
 arch/x86/include/asm/ibt.h     |    5 +++--
 arch/x86/kernel/alternative.c  |   19 +++++++++++++------
 arch/x86/kernel/kprobes/core.c |   11 +----------
 arch/x86/net/bpf_jit_comp.c    |    4 ++--
 kernel/trace/bpf_trace.c       |   14 ++------------
 6 files changed, 22 insertions(+), 33 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,7 +2845,7 @@ static bool is_uprobe_at_func_entry(stru
 		return true;
 
 	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr(*(u32 *)auprobe->insn))
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
 		return true;
 
 	return false;
--- a/arch/x86/include/asm/ibt.h
+++ b/arch/x86/include/asm/ibt.h
@@ -65,7 +65,7 @@ static __always_inline __attribute_const
 	return 0x001f0f66; /* osp nopl (%rax) */
 }
 
-static inline bool is_endbr(u32 val)
+static inline bool __is_endbr(u32 val)
 {
 	if (val == gen_endbr_poison())
 		return true;
@@ -74,6 +74,7 @@ static inline bool is_endbr(u32 val)
 	return val == gen_endbr();
 }
 
+extern __noendbr bool is_endbr(u32 *val);
 extern __noendbr u64 ibt_save(bool disable);
 extern __noendbr void ibt_restore(u64 save);
 
@@ -102,7 +103,7 @@ extern bool __do_kernel_cp_fault(struct
 
 #define __noendbr
 
-static inline bool is_endbr(u32 val) { return false; }
+static inline bool is_endbr(u32 *val) { return false; }
 
 static inline u64 ibt_save(bool disable) { return 0; }
 static inline void ibt_restore(u64 save) { }
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -852,16 +852,23 @@ void __init_or_module noinline apply_ret
 
 #ifdef CONFIG_X86_KERNEL_IBT
 
+__noendbr bool is_endbr(u32 *val)
+{
+	u32 endbr;
+
+	if (get_kernel_nofault(endbr, val))
+		return false;
+
+	return __is_endbr(endbr);
+}
+
 static void poison_cfi(void *addr);
 
 static void __init_or_module poison_endbr(void *addr, bool warn)
 {
-	u32 endbr, poison = gen_endbr_poison();
-
-	if (WARN_ON_ONCE(get_kernel_nofault(endbr, addr)))
-		return;
+	u32 poison = gen_endbr_poison();
 
-	if (!is_endbr(endbr)) {
+	if (!is_endbr(addr)) {
 		WARN_ON_ONCE(warn);
 		return;
 	}
@@ -988,7 +995,7 @@ static u32  cfi_seed __ro_after_init;
 static u32 cfi_rehash(u32 hash)
 {
 	hash ^= cfi_seed;
-	while (unlikely(is_endbr(hash) || is_endbr(-hash))) {
+	while (unlikely(__is_endbr(hash) || __is_endbr(-hash))) {
 		bool lsb = hash & 1;
 		hash >>= 1;
 		if (lsb)
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -373,16 +373,7 @@ static bool can_probe(unsigned long padd
 kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offset,
 					 bool *on_func_entry)
 {
-	u32 insn;
-
-	/*
-	 * Since 'addr' is not guaranteed to be safe to access, use
-	 * copy_from_kernel_nofault() to read the instruction:
-	 */
-	if (copy_from_kernel_nofault(&insn, (void *)addr, sizeof(u32)))
-		return NULL;
-
-	if (is_endbr(insn)) {
+	if (is_endbr((u32 *)addr)) {
 		*on_func_entry = !offset || offset == 4;
 		if (*on_func_entry)
 			offset = 4;
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -625,7 +625,7 @@ int bpf_arch_text_poke(void *ip, enum bp
 	 * See emit_prologue(), for IBT builds the trampoline hook is preceded
 	 * with an ENDBR instruction.
 	 */
-	if (is_endbr(*(u32 *)ip))
+	if (is_endbr(ip))
 		ip += ENDBR_INSN_SIZE;
 
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr);
@@ -2971,7 +2971,7 @@ static int __arch_prepare_bpf_trampoline
 		/* skip patched call instruction and point orig_call to actual
 		 * body of the kernel function.
 		 */
-		if (is_endbr(*(u32 *)orig_call))
+		if (is_endbr(orig_call))
 			orig_call += ENDBR_INSN_SIZE;
 		orig_call += X86_PATCH_SIZE;
 	}
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
 #ifdef CONFIG_X86_KERNEL_IBT
 static unsigned long get_entry_ip(unsigned long fentry_ip)
 {
-	u32 instr;
+	if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
+		return fentry_ip - ENDBR_INSN_SIZE;
 
-	/* We want to be extra safe in case entry ip is on the page edge,
-	 * but otherwise we need to avoid get_kernel_nofault()'s overhead.
-	 */
-	if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
-		if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
-			return fentry_ip;
-	} else {
-		instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
-	}
-	if (is_endbr(instr))
-		fentry_ip -= ENDBR_INSN_SIZE;
 	return fentry_ip;
 }
 #else



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-27 19:49 ` [PATCH 07/14] x86/ibt: Clean up is_endbr() Peter Zijlstra
@ 2024-09-28  0:04   ` Josh Poimboeuf
  2024-09-28 13:08     ` Peter Zijlstra
  2024-09-29 17:32   ` Alexei Starovoitov
  1 sibling, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-28  0:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov,
	Andrii Nakryiko

On Fri, Sep 27, 2024 at 09:49:03PM +0200, Peter Zijlstra wrote:
> Pretty much every caller of is_endbr() actually wants to test something at an
> address and ends up doing get_kernel_nofault(). Fold the lot into a more
> convenient helper.
> 
> Note: this effectively reverts commit a8497506cd2c ("bpf: Avoid
> get_kernel_nofault() to fetch kprobe entry IP") which was entirely the
> wrong way to go about doing things. The right solution is to optimize
> get_kernel_nofault() itself, it really doesn't need STAC/CLAC nor the
> speculation barrier. Using __get_user is a historical hack, not a
> requirement.

But these patches don't actually optimize get_kernel_nofault()?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-28  0:04   ` Josh Poimboeuf
@ 2024-09-28 13:08     ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-28 13:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov,
	Andrii Nakryiko

On Fri, Sep 27, 2024 at 05:04:44PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:03PM +0200, Peter Zijlstra wrote:
> > Pretty much every caller of is_endbr() actually wants to test something at an
> > address and ends up doing get_kernel_nofault(). Fold the lot into a more
> > convenient helper.
> > 
> > Note: this effectively reverts commit a8497506cd2c ("bpf: Avoid
> > get_kernel_nofault() to fetch kprobe entry IP") which was entirely the
> > wrong way to go about doing things. The right solution is to optimize
> > get_kernel_nofault() itself, it really doesn't need STAC/CLAC nor the
> > speculation barrier. Using __get_user is a historical hack, not a
> > requirement.
> 
> But these patches don't actually optimize get_kernel_nofault()?

No, I figured there was enough there already. Also, given the state I
was in, I'd probably get it wrong.

I have it on a todo list somewhere though. It shouldn't be too hard.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-27 19:49 ` [PATCH 07/14] x86/ibt: Clean up is_endbr() Peter Zijlstra
  2024-09-28  0:04   ` Josh Poimboeuf
@ 2024-09-29 17:32   ` Alexei Starovoitov
  2024-09-30  8:30     ` Peter Zijlstra
  1 sibling, 1 reply; 60+ messages in thread
From: Alexei Starovoitov @ 2024-09-29 17:32 UTC (permalink / raw)
  To: Peter Zijlstra, Andrii Nakryiko, Jiri Olsa
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Pretty much every caller of is_endbr() actually wants to test something at an
> address and ends up doing get_kernel_nofault(). Fold the lot into a more
> convenient helper.
>
> Note: this effectively reverts commit a8497506cd2c ("bpf: Avoid
> get_kernel_nofault() to fetch kprobe entry IP") which was entirely the
> wrong way to go about doing things. The right solution is to optimize
> get_kernel_nofault() itself, it really doesn't need STAC/CLAC nor the
> speculation barrier. Using __get_user is a historical hack, not a
> requirement.
>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/events/core.c         |    2 +-
>  arch/x86/include/asm/ibt.h     |    5 +++--
>  arch/x86/kernel/alternative.c  |   19 +++++++++++++------
>  arch/x86/kernel/kprobes/core.c |   11 +----------
>  arch/x86/net/bpf_jit_comp.c    |    4 ++--
>  kernel/trace/bpf_trace.c       |   14 ++------------
>  6 files changed, 22 insertions(+), 33 deletions(-)
>
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2845,7 +2845,7 @@ static bool is_uprobe_at_func_entry(stru
>                 return true;
>
>         /* endbr64 (64-bit only) */
> -       if (user_64bit_mode(regs) && is_endbr(*(u32 *)auprobe->insn))
> +       if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
>                 return true;
>
>         return false;
> --- a/arch/x86/include/asm/ibt.h
> +++ b/arch/x86/include/asm/ibt.h
> @@ -65,7 +65,7 @@ static __always_inline __attribute_const
>         return 0x001f0f66; /* osp nopl (%rax) */
>  }
>
> -static inline bool is_endbr(u32 val)
> +static inline bool __is_endbr(u32 val)
>  {
>         if (val == gen_endbr_poison())
>                 return true;
> @@ -74,6 +74,7 @@ static inline bool is_endbr(u32 val)
>         return val == gen_endbr();
>  }
>
> +extern __noendbr bool is_endbr(u32 *val);
>  extern __noendbr u64 ibt_save(bool disable);
>  extern __noendbr void ibt_restore(u64 save);
>
> @@ -102,7 +103,7 @@ extern bool __do_kernel_cp_fault(struct
>
>  #define __noendbr
>
> -static inline bool is_endbr(u32 val) { return false; }
> +static inline bool is_endbr(u32 *val) { return false; }
>
>  static inline u64 ibt_save(bool disable) { return 0; }
>  static inline void ibt_restore(u64 save) { }
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -852,16 +852,23 @@ void __init_or_module noinline apply_ret
>
>  #ifdef CONFIG_X86_KERNEL_IBT
>
> +__noendbr bool is_endbr(u32 *val)
> +{
> +       u32 endbr;
> +
> +       if (get_kernel_nofault(endbr, val))
> +               return false;
> +
> +       return __is_endbr(endbr);
> +}
> +
>  static void poison_cfi(void *addr);
>
>  static void __init_or_module poison_endbr(void *addr, bool warn)
>  {
> -       u32 endbr, poison = gen_endbr_poison();
> -
> -       if (WARN_ON_ONCE(get_kernel_nofault(endbr, addr)))
> -               return;
> +       u32 poison = gen_endbr_poison();
>
> -       if (!is_endbr(endbr)) {
> +       if (!is_endbr(addr)) {
>                 WARN_ON_ONCE(warn);
>                 return;
>         }
> @@ -988,7 +995,7 @@ static u32  cfi_seed __ro_after_init;
>  static u32 cfi_rehash(u32 hash)
>  {
>         hash ^= cfi_seed;
> -       while (unlikely(is_endbr(hash) || is_endbr(-hash))) {
> +       while (unlikely(__is_endbr(hash) || __is_endbr(-hash))) {
>                 bool lsb = hash & 1;
>                 hash >>= 1;
>                 if (lsb)
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -373,16 +373,7 @@ static bool can_probe(unsigned long padd
>  kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offset,
>                                          bool *on_func_entry)
>  {
> -       u32 insn;
> -
> -       /*
> -        * Since 'addr' is not guaranteed to be safe to access, use
> -        * copy_from_kernel_nofault() to read the instruction:
> -        */
> -       if (copy_from_kernel_nofault(&insn, (void *)addr, sizeof(u32)))
> -               return NULL;
> -
> -       if (is_endbr(insn)) {
> +       if (is_endbr((u32 *)addr)) {
>                 *on_func_entry = !offset || offset == 4;
>                 if (*on_func_entry)
>                         offset = 4;
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -625,7 +625,7 @@ int bpf_arch_text_poke(void *ip, enum bp
>          * See emit_prologue(), for IBT builds the trampoline hook is preceded
>          * with an ENDBR instruction.
>          */
> -       if (is_endbr(*(u32 *)ip))
> +       if (is_endbr(ip))
>                 ip += ENDBR_INSN_SIZE;
>
>         return __bpf_arch_text_poke(ip, t, old_addr, new_addr);
> @@ -2971,7 +2971,7 @@ static int __arch_prepare_bpf_trampoline
>                 /* skip patched call instruction and point orig_call to actual
>                  * body of the kernel function.
>                  */
> -               if (is_endbr(*(u32 *)orig_call))
> +               if (is_endbr(orig_call))
>                         orig_call += ENDBR_INSN_SIZE;
>                 orig_call += X86_PATCH_SIZE;
>         }
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
>  #ifdef CONFIG_X86_KERNEL_IBT
>  static unsigned long get_entry_ip(unsigned long fentry_ip)
>  {
> -       u32 instr;
> +       if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
> +               return fentry_ip - ENDBR_INSN_SIZE;
>
> -       /* We want to be extra safe in case entry ip is on the page edge,
> -        * but otherwise we need to avoid get_kernel_nofault()'s overhead.
> -        */
> -       if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
> -               if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
> -                       return fentry_ip;
> -       } else {
> -               instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
> -       }
> -       if (is_endbr(instr))
> -               fentry_ip -= ENDBR_INSN_SIZE;
>         return fentry_ip;

Pls don't.

This re-introduces the overhead that we want to avoid.

Just call __is_endbr() here and keep the optimization.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-29 17:32   ` Alexei Starovoitov
@ 2024-09-30  8:30     ` Peter Zijlstra
  2024-09-30  9:33       ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-30  8:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Jiri Olsa, X86 ML, LKML, alyssa.milburn,
	scott.d.constable, Joao Moreira, Andrew Cooper, Josh Poimboeuf,
	Jose E. Marchesi, H.J. Lu, Nick Desaulniers, Sami Tolvanen,
	Nathan Chancellor, ojeda, Kees Cook

On Sun, Sep 29, 2024 at 10:32:38AM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
> >  #ifdef CONFIG_X86_KERNEL_IBT
> >  static unsigned long get_entry_ip(unsigned long fentry_ip)
> >  {
> > -       u32 instr;
> > +       if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
> > +               return fentry_ip - ENDBR_INSN_SIZE;
> >
> > -       /* We want to be extra safe in case entry ip is on the page edge,
> > -        * but otherwise we need to avoid get_kernel_nofault()'s overhead.
> > -        */
> > -       if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
> > -               if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
> > -                       return fentry_ip;
> > -       } else {
> > -               instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
> > -       }
> > -       if (is_endbr(instr))
> > -               fentry_ip -= ENDBR_INSN_SIZE;
> >         return fentry_ip;
> 
> Pls don't.
> 
> This re-introduces the overhead that we want to avoid.
> 
> Just call __is_endbr() here and keep the optimization.

Well, I could do that ofcourse, but as I wrote elsewhere, the right
thing to do is to optimize get_kernel_nofault(), its current
implementation is needlessly expensive. All we really need is a load
with an exception entry, the STAC/CLAC and speculation nonsense should
not be needed.

Fixing get_kernel_nofault() benefits all users.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-30  8:30     ` Peter Zijlstra
@ 2024-09-30  9:33       ` Peter Zijlstra
  2024-09-30 16:43         ` Alexei Starovoitov
  2024-09-30 20:58         ` Andrii Nakryiko
  0 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-30  9:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Jiri Olsa, X86 ML, LKML, alyssa.milburn,
	scott.d.constable, Joao Moreira, Andrew Cooper, Josh Poimboeuf,
	Jose E. Marchesi, H.J. Lu, Nick Desaulniers, Sami Tolvanen,
	Nathan Chancellor, ojeda, Kees Cook

On Mon, Sep 30, 2024 at 10:30:26AM +0200, Peter Zijlstra wrote:
> On Sun, Sep 29, 2024 at 10:32:38AM -0700, Alexei Starovoitov wrote:
> > On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > --- a/kernel/trace/bpf_trace.c
> > > +++ b/kernel/trace/bpf_trace.c
> > > @@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
> > >  #ifdef CONFIG_X86_KERNEL_IBT
> > >  static unsigned long get_entry_ip(unsigned long fentry_ip)
> > >  {
> > > -       u32 instr;
> > > +       if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
> > > +               return fentry_ip - ENDBR_INSN_SIZE;
> > >
> > > -       /* We want to be extra safe in case entry ip is on the page edge,
> > > -        * but otherwise we need to avoid get_kernel_nofault()'s overhead.
> > > -        */
> > > -       if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
> > > -               if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
> > > -                       return fentry_ip;
> > > -       } else {
> > > -               instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
> > > -       }
> > > -       if (is_endbr(instr))
> > > -               fentry_ip -= ENDBR_INSN_SIZE;
> > >         return fentry_ip;
> > 
> > Pls don't.
> > 
> > This re-introduces the overhead that we want to avoid.
> > 
> > Just call __is_endbr() here and keep the optimization.
> 
> Well, I could do that ofcourse, but as I wrote elsewhere, the right
> thing to do is to optimize get_kernel_nofault(), its current
> implementation is needlessly expensive. All we really need is a load
> with an exception entry, the STAC/CLAC and speculation nonsense should
> not be needed.

Looking at things, something like the below actually generates sane
code:

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a582cd25ca87..84f65ee9736c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1029,17 +1029,10 @@ static unsigned long get_entry_ip(unsigned long fentry_ip)
 {
 	u32 instr;
 
-	/* We want to be extra safe in case entry ip is on the page edge,
-	 * but otherwise we need to avoid get_kernel_nofault()'s overhead.
-	 */
-	if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
-		if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
-			return fentry_ip;
-	} else {
-		instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
-	}
+	__get_kernel_nofault(&instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE), u32, Efault);
 	if (is_endbr(instr))
 		fentry_ip -= ENDBR_INSN_SIZE;
+Efault:
 	return fentry_ip;
 }
 #else


Which then leads to me rewriting the proposed patch as...

---
Subject: x86/ibt: Clean up is_endbr()
From: Peter Zijlstra <peterz@infradead.org>

Pretty much every caller of is_endbr() actually wants to test something at an
address and ends up doing get_kernel_nofault(). Fold the lot into a more
convenient helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c         |    2 +-
 arch/x86/include/asm/ibt.h     |    5 +++--
 arch/x86/kernel/alternative.c  |   20 ++++++++++++++------
 arch/x86/kernel/kprobes/core.c |   11 +----------
 arch/x86/net/bpf_jit_comp.c    |    4 ++--
 kernel/trace/bpf_trace.c       |   14 ++------------
 6 files changed, 23 insertions(+), 33 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,7 +2845,7 @@ static bool is_uprobe_at_func_entry(stru
 		return true;
 
 	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr(*(u32 *)auprobe->insn))
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
 		return true;
 
 	return false;
--- a/arch/x86/include/asm/ibt.h
+++ b/arch/x86/include/asm/ibt.h
@@ -65,7 +65,7 @@ static __always_inline __attribute_const
 	return 0x001f0f66; /* osp nopl (%rax) */
 }
 
-static inline bool is_endbr(u32 val)
+static inline bool __is_endbr(u32 val)
 {
 	if (val == gen_endbr_poison())
 		return true;
@@ -74,6 +74,7 @@ static inline bool is_endbr(u32 val)
 	return val == gen_endbr();
 }
 
+extern __noendbr bool is_endbr(u32 *val);
 extern __noendbr u64 ibt_save(bool disable);
 extern __noendbr void ibt_restore(u64 save);
 
@@ -102,7 +103,7 @@ extern bool __do_kernel_cp_fault(struct
 
 #define __noendbr
 
-static inline bool is_endbr(u32 val) { return false; }
+static inline bool is_endbr(u32 *val) { return false; }
 
 static inline u64 ibt_save(bool disable) { return 0; }
 static inline void ibt_restore(u64 save) { }
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -852,16 +852,24 @@ void __init_or_module noinline apply_ret
 
 #ifdef CONFIG_X86_KERNEL_IBT
 
+__noendbr bool is_endbr(u32 *val)
+{
+	u32 endbr;
+
+	__get_kernel_nofault(&endbr, val, u32, Efault);
+	return __is_endbr(endbr);
+
+Efault:
+	return false;
+}
+
 static void poison_cfi(void *addr);
 
 static void __init_or_module poison_endbr(void *addr, bool warn)
 {
-	u32 endbr, poison = gen_endbr_poison();
-
-	if (WARN_ON_ONCE(get_kernel_nofault(endbr, addr)))
-		return;
+	u32 poison = gen_endbr_poison();
 
-	if (!is_endbr(endbr)) {
+	if (!is_endbr(addr)) {
 		WARN_ON_ONCE(warn);
 		return;
 	}
@@ -988,7 +996,7 @@ static u32  cfi_seed __ro_after_init;
 static u32 cfi_rehash(u32 hash)
 {
 	hash ^= cfi_seed;
-	while (unlikely(is_endbr(hash) || is_endbr(-hash))) {
+	while (unlikely(__is_endbr(hash) || __is_endbr(-hash))) {
 		bool lsb = hash & 1;
 		hash >>= 1;
 		if (lsb)
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -373,16 +373,7 @@ static bool can_probe(unsigned long padd
 kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offset,
 					 bool *on_func_entry)
 {
-	u32 insn;
-
-	/*
-	 * Since 'addr' is not guaranteed to be safe to access, use
-	 * copy_from_kernel_nofault() to read the instruction:
-	 */
-	if (copy_from_kernel_nofault(&insn, (void *)addr, sizeof(u32)))
-		return NULL;
-
-	if (is_endbr(insn)) {
+	if (is_endbr((u32 *)addr)) {
 		*on_func_entry = !offset || offset == 4;
 		if (*on_func_entry)
 			offset = 4;
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -625,7 +625,7 @@ int bpf_arch_text_poke(void *ip, enum bp
 	 * See emit_prologue(), for IBT builds the trampoline hook is preceded
 	 * with an ENDBR instruction.
 	 */
-	if (is_endbr(*(u32 *)ip))
+	if (is_endbr(ip))
 		ip += ENDBR_INSN_SIZE;
 
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr);
@@ -2971,7 +2971,7 @@ static int __arch_prepare_bpf_trampoline
 		/* skip patched call instruction and point orig_call to actual
 		 * body of the kernel function.
 		 */
-		if (is_endbr(*(u32 *)orig_call))
+		if (is_endbr(orig_call))
 			orig_call += ENDBR_INSN_SIZE;
 		orig_call += X86_PATCH_SIZE;
 	}
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
 #ifdef CONFIG_X86_KERNEL_IBT
 static unsigned long get_entry_ip(unsigned long fentry_ip)
 {
-	u32 instr;
+	if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
+		return fentry_ip - ENDBR_INSN_SIZE;
 
-	/* We want to be extra safe in case entry ip is on the page edge,
-	 * but otherwise we need to avoid get_kernel_nofault()'s overhead.
-	 */
-	if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
-		if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
-			return fentry_ip;
-	} else {
-		instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
-	}
-	if (is_endbr(instr))
-		fentry_ip -= ENDBR_INSN_SIZE;
 	return fentry_ip;
 }
 #else

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-30  9:33       ` Peter Zijlstra
@ 2024-09-30 16:43         ` Alexei Starovoitov
  2024-09-30 20:58         ` Andrii Nakryiko
  1 sibling, 0 replies; 60+ messages in thread
From: Alexei Starovoitov @ 2024-09-30 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Jiri Olsa, X86 ML, LKML, alyssa.milburn,
	scott.d.constable, Joao Moreira, Andrew Cooper, Josh Poimboeuf,
	Jose E. Marchesi, H.J. Lu, Nick Desaulniers, Sami Tolvanen,
	Nathan Chancellor, ojeda, Kees Cook

On Mon, Sep 30, 2024 at 2:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Which then leads to me rewriting the proposed patch as...

...

> +__noendbr bool is_endbr(u32 *val)
> +{
> +       u32 endbr;
> +
> +       __get_kernel_nofault(&endbr, val, u32, Efault);
> +       return __is_endbr(endbr);
> +
> +Efault:
> +       return false;
> +}

That looks much better.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] x86/ibt: Clean up is_endbr()
  2024-09-30  9:33       ` Peter Zijlstra
  2024-09-30 16:43         ` Alexei Starovoitov
@ 2024-09-30 20:58         ` Andrii Nakryiko
  1 sibling, 0 replies; 60+ messages in thread
From: Andrii Nakryiko @ 2024-09-30 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, Andrii Nakryiko, Jiri Olsa, X86 ML, LKML,
	alyssa.milburn, scott.d.constable, Joao Moreira, Andrew Cooper,
	Josh Poimboeuf, Jose E. Marchesi, H.J. Lu, Nick Desaulniers,
	Sami Tolvanen, Nathan Chancellor, ojeda, Kees Cook

On Mon, Sep 30, 2024 at 2:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Sep 30, 2024 at 10:30:26AM +0200, Peter Zijlstra wrote:
> > On Sun, Sep 29, 2024 at 10:32:38AM -0700, Alexei Starovoitov wrote:
> > > On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > --- a/kernel/trace/bpf_trace.c
> > > > +++ b/kernel/trace/bpf_trace.c
> > > > @@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
> > > >  #ifdef CONFIG_X86_KERNEL_IBT
> > > >  static unsigned long get_entry_ip(unsigned long fentry_ip)
> > > >  {
> > > > -       u32 instr;
> > > > +       if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
> > > > +               return fentry_ip - ENDBR_INSN_SIZE;
> > > >
> > > > -       /* We want to be extra safe in case entry ip is on the page edge,
> > > > -        * but otherwise we need to avoid get_kernel_nofault()'s overhead.
> > > > -        */
> > > > -       if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
> > > > -               if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
> > > > -                       return fentry_ip;
> > > > -       } else {
> > > > -               instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
> > > > -       }
> > > > -       if (is_endbr(instr))
> > > > -               fentry_ip -= ENDBR_INSN_SIZE;
> > > >         return fentry_ip;
> > >
> > > Pls don't.
> > >
> > > This re-introduces the overhead that we want to avoid.
> > >
> > > Just call __is_endbr() here and keep the optimization.
> >
> > Well, I could do that ofcourse, but as I wrote elsewhere, the right
> > thing to do is to optimize get_kernel_nofault(), its current
> > implementation is needlessly expensive. All we really need is a load
> > with an exception entry, the STAC/CLAC and speculation nonsense should
> > not be needed.
>
> Looking at things, something like the below actually generates sane
> code:
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index a582cd25ca87..84f65ee9736c 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1029,17 +1029,10 @@ static unsigned long get_entry_ip(unsigned long fentry_ip)
>  {
>         u32 instr;
>
> -       /* We want to be extra safe in case entry ip is on the page edge,
> -        * but otherwise we need to avoid get_kernel_nofault()'s overhead.
> -        */
> -       if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
> -               if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
> -                       return fentry_ip;
> -       } else {
> -               instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
> -       }
> +       __get_kernel_nofault(&instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE), u32, Efault);
>         if (is_endbr(instr))
>                 fentry_ip -= ENDBR_INSN_SIZE;
> +Efault:
>         return fentry_ip;
>  }
>  #else
>
>
> Which then leads to me rewriting the proposed patch as...
>
> ---
> Subject: x86/ibt: Clean up is_endbr()
> From: Peter Zijlstra <peterz@infradead.org>
>
> Pretty much every caller of is_endbr() actually wants to test something at an
> address and ends up doing get_kernel_nofault(). Fold the lot into a more
> convenient helper.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/events/core.c         |    2 +-
>  arch/x86/include/asm/ibt.h     |    5 +++--
>  arch/x86/kernel/alternative.c  |   20 ++++++++++++++------
>  arch/x86/kernel/kprobes/core.c |   11 +----------
>  arch/x86/net/bpf_jit_comp.c    |    4 ++--
>  kernel/trace/bpf_trace.c       |   14 ++------------
>  6 files changed, 23 insertions(+), 33 deletions(-)
>
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2845,7 +2845,7 @@ static bool is_uprobe_at_func_entry(stru
>                 return true;
>
>         /* endbr64 (64-bit only) */
> -       if (user_64bit_mode(regs) && is_endbr(*(u32 *)auprobe->insn))
> +       if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
>                 return true;
>
>         return false;
> --- a/arch/x86/include/asm/ibt.h
> +++ b/arch/x86/include/asm/ibt.h
> @@ -65,7 +65,7 @@ static __always_inline __attribute_const
>         return 0x001f0f66; /* osp nopl (%rax) */
>  }
>
> -static inline bool is_endbr(u32 val)
> +static inline bool __is_endbr(u32 val)
>  {
>         if (val == gen_endbr_poison())
>                 return true;
> @@ -74,6 +74,7 @@ static inline bool is_endbr(u32 val)
>         return val == gen_endbr();
>  }
>
> +extern __noendbr bool is_endbr(u32 *val);
>  extern __noendbr u64 ibt_save(bool disable);
>  extern __noendbr void ibt_restore(u64 save);
>
> @@ -102,7 +103,7 @@ extern bool __do_kernel_cp_fault(struct
>
>  #define __noendbr
>
> -static inline bool is_endbr(u32 val) { return false; }
> +static inline bool is_endbr(u32 *val) { return false; }
>
>  static inline u64 ibt_save(bool disable) { return 0; }
>  static inline void ibt_restore(u64 save) { }
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -852,16 +852,24 @@ void __init_or_module noinline apply_ret
>
>  #ifdef CONFIG_X86_KERNEL_IBT
>
> +__noendbr bool is_endbr(u32 *val)
> +{
> +       u32 endbr;
> +
> +       __get_kernel_nofault(&endbr, val, u32, Efault);
> +       return __is_endbr(endbr);
> +
> +Efault:
> +       return false;
> +}
> +
>  static void poison_cfi(void *addr);
>
>  static void __init_or_module poison_endbr(void *addr, bool warn)
>  {
> -       u32 endbr, poison = gen_endbr_poison();
> -
> -       if (WARN_ON_ONCE(get_kernel_nofault(endbr, addr)))
> -               return;
> +       u32 poison = gen_endbr_poison();
>
> -       if (!is_endbr(endbr)) {
> +       if (!is_endbr(addr)) {
>                 WARN_ON_ONCE(warn);
>                 return;
>         }
> @@ -988,7 +996,7 @@ static u32  cfi_seed __ro_after_init;
>  static u32 cfi_rehash(u32 hash)
>  {
>         hash ^= cfi_seed;
> -       while (unlikely(is_endbr(hash) || is_endbr(-hash))) {
> +       while (unlikely(__is_endbr(hash) || __is_endbr(-hash))) {
>                 bool lsb = hash & 1;
>                 hash >>= 1;
>                 if (lsb)
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -373,16 +373,7 @@ static bool can_probe(unsigned long padd
>  kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offset,
>                                          bool *on_func_entry)
>  {
> -       u32 insn;
> -
> -       /*
> -        * Since 'addr' is not guaranteed to be safe to access, use
> -        * copy_from_kernel_nofault() to read the instruction:
> -        */
> -       if (copy_from_kernel_nofault(&insn, (void *)addr, sizeof(u32)))
> -               return NULL;
> -
> -       if (is_endbr(insn)) {
> +       if (is_endbr((u32 *)addr)) {
>                 *on_func_entry = !offset || offset == 4;
>                 if (*on_func_entry)
>                         offset = 4;
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -625,7 +625,7 @@ int bpf_arch_text_poke(void *ip, enum bp
>          * See emit_prologue(), for IBT builds the trampoline hook is preceded
>          * with an ENDBR instruction.
>          */
> -       if (is_endbr(*(u32 *)ip))
> +       if (is_endbr(ip))
>                 ip += ENDBR_INSN_SIZE;
>
>         return __bpf_arch_text_poke(ip, t, old_addr, new_addr);
> @@ -2971,7 +2971,7 @@ static int __arch_prepare_bpf_trampoline
>                 /* skip patched call instruction and point orig_call to actual
>                  * body of the kernel function.
>                  */
> -               if (is_endbr(*(u32 *)orig_call))
> +               if (is_endbr(orig_call))
>                         orig_call += ENDBR_INSN_SIZE;
>                 orig_call += X86_PATCH_SIZE;
>         }
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1027,19 +1027,9 @@ static const struct bpf_func_proto bpf_g
>  #ifdef CONFIG_X86_KERNEL_IBT
>  static unsigned long get_entry_ip(unsigned long fentry_ip)
>  {
> -       u32 instr;
> +       if (is_endbr((void *)(fentry_ip - ENDBR_INSN_SIZE)))
> +               return fentry_ip - ENDBR_INSN_SIZE;
>
> -       /* We want to be extra safe in case entry ip is on the page edge,
> -        * but otherwise we need to avoid get_kernel_nofault()'s overhead.
> -        */
> -       if ((fentry_ip & ~PAGE_MASK) < ENDBR_INSN_SIZE) {
> -               if (get_kernel_nofault(instr, (u32 *)(fentry_ip - ENDBR_INSN_SIZE)))
> -                       return fentry_ip;
> -       } else {
> -               instr = *(u32 *)(fentry_ip - ENDBR_INSN_SIZE);
> -       }
> -       if (is_endbr(instr))
> -               fentry_ip -= ENDBR_INSN_SIZE;
>         return fentry_ip;
>  }
>  #else

LGTM.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 08/14] x86/ibt: Clean up poison_endbr()
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (6 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 07/14] x86/ibt: Clean up is_endbr() Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-27 19:49 ` [PATCH 09/14] x86/ibt: Implement IBT+ Peter Zijlstra
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Basically, get rid of the .warn argument and explicitly don't call the
function when we know there isn't an endbr. This makes the calling
code clearer.

Nota: perhaps don't add functions to .cfi_sites when the function
doesn't have endbr -- OTOH why would the compiler emit the prefix if
it has already determined there are no indirect callers and has
omitted the ENDBR instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/alternative.c |   42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -864,14 +864,12 @@ __noendbr bool is_endbr(u32 *val)
 
 static void poison_cfi(void *addr);
 
-static void __init_or_module poison_endbr(void *addr, bool warn)
+static void __init_or_module poison_endbr(void *addr)
 {
 	u32 poison = gen_endbr_poison();
 
-	if (!is_endbr(addr)) {
-		WARN_ON_ONCE(warn);
+	if (WARN_ON_ONCE(!is_endbr(addr)))
 		return;
-	}
 
 	DPRINTK(ENDBR, "ENDBR at: %pS (%px)", addr, addr);
 
@@ -896,7 +894,7 @@ void __init_or_module noinline apply_sea
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
 
-		poison_endbr(addr, true);
+		poison_endbr(addr);
 		if (IS_ENABLED(CONFIG_FINEIBT))
 			poison_cfi(addr - 16);
 	}
@@ -1203,6 +1201,14 @@ static int cfi_rewrite_preamble(s32 *sta
 		void *addr = (void *)s + *s;
 		u32 hash;
 
+		/*
+		 * When the function doesn't start with ENDBR the compiler will
+		 * have determined there are no indirect calls to it and we
+		 * don't need no CFI either.
+		 */
+		if (!is_endbr(addr + 16))
+			continue;
+
 		hash = decode_preamble_hash(addr);
 		if (WARN(!hash, "no CFI hash found at: %pS %px %*ph\n",
 			 addr, addr, 5, addr))
@@ -1223,7 +1229,10 @@ static void cfi_rewrite_endbr(s32 *start
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
 
-		poison_endbr(addr+16, false);
+		if (!is_endbr(addr + 16))
+			continue;
+
+		poison_endbr(addr + 16);
 	}
 }
 
@@ -1356,9 +1365,23 @@ static inline void poison_hash(void *add
 
 static void poison_cfi(void *addr)
 {
+	/*
+	 * Compilers manage to be inconsistent with ENDBR vs __cfi prefixes,
+	 * some (static) functions for which they can determine the address
+	 * is never taken do not get a __cfi prefix, but *DO* get an ENDBR.
+	 *
+	 * As such, these functions will get sealed, but we need to be careful
+	 * to not unconditionally scribble the previous function.
+	 */
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
 		/*
+		 * FineIBT prefix should start with an ENDBR.
+		 */
+		if (!is_endbr(addr))
+			break;
+
+		/*
 		 * __cfi_\func:
 		 *	osp nopl (%rax)
 		 *	subl	$0, %r10d
@@ -1366,12 +1389,17 @@ static void poison_cfi(void *addr)
 		 *	ud2
 		 * 1:	nop
 		 */
-		poison_endbr(addr, false);
+		poison_endbr(addr);
 		poison_hash(addr + fineibt_preamble_hash);
 		break;
 
 	case CFI_KCFI:
 		/*
+		 * kCFI prefix should start with a valid hash.
+		 */
+		if (!decode_preamble_hash(addr))
+			break;
+		/*
 		 * __cfi_\func:
 		 *	movl	$0, %eax
 		 *	.skip	11, 0x90



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (7 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 08/14] x86/ibt: Clean up poison_endbr() Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-28  1:07   ` Josh Poimboeuf
                     ` (2 more replies)
  2024-09-27 19:49 ` [PATCH 10/14] x86/early_printk: Harden early_serial Peter Zijlstra
                   ` (4 subsequent siblings)
  13 siblings, 3 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

As originally conceived, use UD1 based poison to seal IBT. This is also fatal
on !IBT hardware.

This requires all direct (tail) calls avoid ever touching ENDBR. To that
purpose rewrite them using the .call_sites and .tail_call_sites sections from
objtool --direct-call.

Since this is a wee bit tricky, stick this in a 'def_bool y' config option.

This again stacks 3 layers of relocation, just like the earlier callthunk patch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig                   |    4 +
 arch/x86/include/asm/alternative.h |    1 
 arch/x86/include/asm/cfi.h         |   14 ++---
 arch/x86/include/asm/ibt.h         |    9 +++
 arch/x86/kernel/alternative.c      |   89 ++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/module.c           |    9 +++
 arch/x86/kernel/static_call.c      |   23 ++++++++-
 arch/x86/net/bpf_jit_comp.c        |    6 ++
 scripts/Makefile.lib               |    1 
 tools/objtool/check.c              |    2 
 10 files changed, 144 insertions(+), 14 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1866,6 +1866,10 @@ config X86_KERNEL_IBT
 	  does significantly reduce the number of ENDBR instructions in the
 	  kernel image.
 
+config X86_KERNEL_IBT_PLUS
+	depends on X86_KERNEL_IBT
+	def_bool y
+
 config X86_INTEL_MEMORY_PROTECTION_KEYS
 	prompt "Memory Protection Keys"
 	def_bool y
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -98,6 +98,7 @@ extern struct alt_instr __alt_instructio
 extern int alternatives_patched;
 
 extern void alternative_instructions(void);
+extern void apply_direct_call_offset(s32 *start, s32 *end);
 extern void apply_alternatives(struct alt_instr *start, struct alt_instr *end);
 extern void apply_retpolines(s32 *start, s32 *end);
 extern void apply_returns(s32 *start, s32 *end);
--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -31,12 +31,12 @@
  * IBT:
  *
  * foo:
- *   endbr64
+ *   endbr64 / osp nop3 / ud1 0x0(%eax), %edx
  *   ... code here ...
  *   ret
  *
  * direct caller:
- *   call foo / call foo+4
+ *   call foo / call foo+4	# must be +4 when IBT+
  *
  * indirect caller:
  *   lea foo(%rip), %r11
@@ -50,12 +50,12 @@
  *   movl $0x12345678, %eax
  *				# 11 nops when CONFIG_CALL_PADDING
  * foo:
- *   endbr64			# when IBT
+ *   endbr64 / osp nop3 / ud1	# when IBT
  *   ... code here ...
  *   ret
  *
  * direct call:
- *   call foo			# / call foo+4 when IBT
+ *   call foo / call foo+4	# +4 possible with IBT, mandatory with IBT+
  *
  * indirect call:
  *   lea foo(%rip), %r11
@@ -72,16 +72,16 @@
  * __cfi_foo:
  *   endbr64
  *   subl 0x12345678, %r10d
- *   jz   foo
+ *   jz   foo+4
  *   ud2
  *   nop
  * foo:
- *   osp nop3			# was endbr64
+ *   osp nop3 / ud1 0x0(%eax), %edx	# was endbr64
  *   ... code here ...
  *   ret
  *
  * direct caller:
- *   call foo / call foo+4
+ *   call foo / call foo+4	# must be +4 when IBT+
  *
  * indirect caller:
  *   lea foo(%rip), %r11
--- a/arch/x86/include/asm/ibt.h
+++ b/arch/x86/include/asm/ibt.h
@@ -58,11 +58,20 @@ static __always_inline __attribute_const
 
 static __always_inline __attribute_const__ u32 gen_endbr_poison(void)
 {
+#ifdef CONFIG_X86_KERNEL_IBT_PLUS
+	/*
+	 * When we rewrite direct calls to +4, the endbr at +0 becomes unused,
+	 * poisong it with a UD1 to trip !IBT hardware and to ensure these
+	 * bytes are really unused.
+	 */
+	return 0x0048b90f; /* ud1 0x0(%eax), %edx */
+#else
 	/*
 	 * 4 byte NOP that isn't NOP4 (in fact it is OSP NOP3), such that it
 	 * will be unique to (former) ENDBR sites.
 	 */
 	return 0x001f0f66; /* osp nopl (%rax) */
+#endif
 }
 
 static inline bool __is_endbr(u32 val)
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -171,6 +171,8 @@ static void add_nop(u8 *buf, unsigned in
 		*buf = INT3_INSN_OPCODE;
 }
 
+extern s32 __call_sites[], __call_sites_end[];
+extern s32 __tail_call_sites[], __tail_call_sites_end[];
 extern s32 __retpoline_sites[], __retpoline_sites_end[];
 extern s32 __return_sites[], __return_sites_end[];
 extern s32 __cfi_sites[], __cfi_sites_end[];
@@ -423,6 +425,19 @@ static int alt_replace_call(u8 *instr, u
 	if (!target)
 		target = bug;
 
+#ifdef CONFIG_X86_KERNEL_IBT_PLUS
+	/*
+	 * apply_direct_call_offset() would have patched the alternative to
+	 * "CALL BUG_func+4" *if* that function has an ENDBR. And *if* target
+	 * also has an ENDBR this all works out. Except if BUG_func() and target()
+	 * do not agree on the having of ENDBR, then things go sideways.
+	 */
+	if (is_endbr(bug))
+		bug += ENDBR_INSN_SIZE;
+	if (is_endbr(target))
+		target += ENDBR_INSN_SIZE;
+#endif
+
 	/* (BUG_func - .) + (target - BUG_func) := target - . */
 	*(s32 *)(insn_buff + 1) += target - bug;
 
@@ -850,6 +865,64 @@ void __init_or_module noinline apply_ret
 
 #endif /* CONFIG_MITIGATION_RETPOLINE && CONFIG_OBJTOOL */
 
+#ifdef CONFIG_X86_KERNEL_IBT_PLUS
+__init_or_module void apply_direct_call_offset(s32 *start, s32 *end)
+{
+	s32 *s;
+
+	/*
+	 * incompatible with call depth tracking
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH))
+		return;
+
+	for (s = start; s < end; s++) {
+		void *dest, *addr = (void *)s + *s;
+		struct insn insn;
+		int ret;
+
+		ret = insn_decode_kernel(&insn, addr);
+		if (WARN_ON_ONCE(ret < 0))
+			continue;
+
+		dest = addr + insn.length + insn.immediate.value;
+		if (!is_endbr(dest))
+			continue;
+
+		switch (insn.opcode.bytes[0]) {
+		case CALL_INSN_OPCODE:
+		case JMP32_INSN_OPCODE:
+			apply_reloc(4, addr+1, 4);
+			continue;
+
+		case JMP8_INSN_OPCODE:
+		case 0x70 ... 0x7f: /* Jcc.d8 */
+			apply_reloc(1, addr+1, 4);
+			continue;
+
+		case 0x0f:
+			switch (insn.opcode.bytes[1]) {
+			case 0x80 ... 0x8f:
+				apply_reloc(4, addr+2, 4);
+				continue;
+
+			default:
+				break;
+			}
+			break;
+
+		default:
+			break;
+		}
+
+		printk("at: %pS, instruction: %*ph\n", addr, insn.length, addr);
+		BUG();
+	}
+}
+#else
+__init_or_module void apply_direct_call_offset(s32 *start, s32 *end) { }
+#endif
+
 #ifdef CONFIG_X86_KERNEL_IBT
 
 __noendbr bool is_endbr(u32 *val)
@@ -1067,6 +1140,7 @@ asm(	".pushsection .rodata			\n"
 	"fineibt_preamble_start:		\n"
 	"	endbr64				\n"
 	"	subl	$0x12345678, %r10d	\n"
+	"fineibt_preamble_jcc:			\n"
 	"	je	fineibt_preamble_end	\n"
 	"	ud2				\n"
 	"	nop				\n"
@@ -1075,10 +1149,13 @@ asm(	".pushsection .rodata			\n"
 );
 
 extern u8 fineibt_preamble_start[];
+extern u8 fineibt_preamble_jcc[];
 extern u8 fineibt_preamble_end[];
 
-#define fineibt_preamble_size (fineibt_preamble_end - fineibt_preamble_start)
-#define fineibt_preamble_hash 7
+#define fineibt_preamble_size	(fineibt_preamble_end - fineibt_preamble_start)
+#define fineibt_preamble_offset	(fineibt_preamble_jcc - fineibt_preamble_start)
+#define fineibt_preamble_hash	(fineibt_preamble_offset - 4)
+#define fineibt_preamble_jccd8	(fineibt_preamble_offset + 1)
 
 asm(	".pushsection .rodata			\n"
 	"fineibt_caller_start:			\n"
@@ -1217,6 +1294,8 @@ static int cfi_rewrite_preamble(s32 *sta
 		text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size);
 		WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678);
 		text_poke_early(addr + fineibt_preamble_hash, &hash, 4);
+
+		*(u8 *)(addr + fineibt_preamble_jccd8) += 4;
 	}
 
 	return 0;
@@ -1726,6 +1805,12 @@ void __init alternative_instructions(voi
 	 */
 	paravirt_set_cap();
 
+	/*
+	 * Adjust all (tail) calls to func()+4 to avoid ENDBR.
+	 */
+	apply_direct_call_offset(__call_sites, __call_sites_end);
+	apply_direct_call_offset(__tail_call_sites, __tail_call_sites_end);
+
 	__apply_fineibt(__retpoline_sites, __retpoline_sites_end,
 			__cfi_sites, __cfi_sites_end, true);
 
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -227,7 +227,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 	const Elf_Shdr *s, *alt = NULL, *locks = NULL,
 		*orc = NULL, *orc_ip = NULL,
 		*retpolines = NULL, *returns = NULL, *ibt_endbr = NULL,
-		*calls = NULL, *cfi = NULL;
+		*calls = NULL, *tails = NULL, *cfi = NULL;
 	char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
 
 	for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) {
@@ -245,6 +245,8 @@ int module_finalize(const Elf_Ehdr *hdr,
 			returns = s;
 		if (!strcmp(".call_sites", secstrings + s->sh_name))
 			calls = s;
+		if (!strcmp(".tail_call_sites", secstrings + s->sh_name))
+			tails = s;
 		if (!strcmp(".cfi_sites", secstrings + s->sh_name))
 			cfi = s;
 		if (!strcmp(".ibt_endbr_seal", secstrings + s->sh_name))
@@ -284,6 +286,11 @@ int module_finalize(const Elf_Ehdr *hdr,
 		}
 
 		callthunks_patch_module_calls(&cs, me);
+		apply_direct_call_offset(cs.call_start, cs.call_end);
+	}
+	if (tails) {
+		void *tseg = (void *)tails->sh_addr;
+		apply_direct_call_offset(tseg, tseg + tails->sh_size);
 	}
 	if (alt) {
 		/* patch .altinstructions */
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -50,6 +50,23 @@ asm (".global __static_call_return\n\t"
      "ret; int3\n\t"
      ".size __static_call_return, . - __static_call_return \n\t");
 
+static void *translate_call_dest(void *dest, bool call)
+{
+	if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH)) {
+		if (!call)
+			return dest;
+
+		return callthunks_translate_call_dest(dest);
+	}
+
+	if (IS_ENABLED(CONFIG_X86_KERNEL_IBT_PLUS)) {
+		if (is_endbr(dest))
+			dest += 4;
+	}
+
+	return dest;
+}
+
 static void __ref __static_call_transform(void *insn, enum insn_type type,
 					  void *func, bool modinit)
 {
@@ -63,7 +80,7 @@ static void __ref __static_call_transfor
 
 	switch (type) {
 	case CALL:
-		func = callthunks_translate_call_dest(func);
+		func = translate_call_dest(func, true);
 		code = text_gen_insn(CALL_INSN_OPCODE, insn, func);
 		if (func == &__static_call_return0) {
 			emulate = code;
@@ -77,6 +94,7 @@ static void __ref __static_call_transfor
 		break;
 
 	case JMP:
+		func = translate_call_dest(func, false);
 		code = text_gen_insn(JMP32_INSN_OPCODE, insn, func);
 		break;
 
@@ -92,7 +110,8 @@ static void __ref __static_call_transfor
 			func = __static_call_return;
 			if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
 				func = x86_return_thunk;
-		}
+
+		} else func = translate_call_dest(func, false);
 
 		buf[0] = 0x0f;
 		__text_gen_insn(buf+1, op, insn+1, func, 5);
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -555,6 +555,8 @@ static int emit_patch(u8 **pprog, void *
 
 static int emit_call(u8 **pprog, void *func, void *ip)
 {
+	if (is_endbr(func))
+		func += ENDBR_INSN_SIZE;
 	return emit_patch(pprog, func, ip, 0xE8);
 }
 
@@ -562,11 +564,13 @@ static int emit_rsb_call(u8 **pprog, voi
 {
 	OPTIMIZER_HIDE_VAR(func);
 	ip += x86_call_depth_emit_accounting(pprog, func, ip);
-	return emit_patch(pprog, func, ip, 0xE8);
+	return emit_call(pprog, func, ip);
 }
 
 static int emit_jump(u8 **pprog, void *func, void *ip)
 {
+	if (is_endbr(func))
+		func += ENDBR_INSN_SIZE;
 	return emit_patch(pprog, func, ip, 0xE9);
 }
 
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -269,6 +269,7 @@ objtool-args-$(CONFIG_HAVE_JUMP_LABEL_HA
 objtool-args-$(CONFIG_HAVE_NOINSTR_HACK)		+= --hacks=noinstr
 objtool-args-$(CONFIG_MITIGATION_CALL_DEPTH_TRACKING)	+= --direct-call
 objtool-args-$(CONFIG_X86_KERNEL_IBT)			+= --ibt
+objtool-args-$(CONFIG_X86_KERNEL_IBT_PLUS)		+= --direct-call
 objtool-args-$(CONFIG_FINEIBT)				+= --cfi
 objtool-args-$(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL)	+= --mcount
 ifdef CONFIG_FTRACE_MCOUNT_USE_OBJTOOL
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1455,7 +1455,7 @@ static void annotate_call_site(struct ob
 		return;
 	}
 
-	if (!insn->sec->init && !insn->_call_dest->embedded_insn) {
+	if (!insn->_call_dest->embedded_insn) {
 		if (insn->type == INSN_CALL)
 			list_add_tail(&insn->call_node, &file->call_list);
 		else



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-27 19:49 ` [PATCH 09/14] x86/ibt: Implement IBT+ Peter Zijlstra
@ 2024-09-28  1:07   ` Josh Poimboeuf
  2024-09-28 13:12     ` Peter Zijlstra
  2024-09-29 17:38   ` Alexei Starovoitov
  2024-11-05 10:40   ` Peter Zijlstra
  2 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-28  1:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:49:05PM +0200, Peter Zijlstra wrote:
> +static void *translate_call_dest(void *dest, bool call)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH)) {
> +		if (!call)
> +			return dest;

A tail call is considered a call by virtue of the function name.  Change
the "call" arg to "tail" to make it clearer.

> +++ b/scripts/Makefile.lib
> @@ -269,6 +269,7 @@ objtool-args-$(CONFIG_HAVE_JUMP_LABEL_HA
>  objtool-args-$(CONFIG_HAVE_NOINSTR_HACK)		+= --hacks=noinstr
>  objtool-args-$(CONFIG_MITIGATION_CALL_DEPTH_TRACKING)	+= --direct-call
>  objtool-args-$(CONFIG_X86_KERNEL_IBT)			+= --ibt
> +objtool-args-$(CONFIG_X86_KERNEL_IBT_PLUS)		+= --direct-call

Only add '--direct-call' once:

objtool-args-$(or $(CONFIG_MITIGATION_CALL_DEPTH_TRACKING),$(CONFIG_X86_KERNEL_IBT_PLUS)) += --direct-call

>  objtool-args-$(CONFIG_FINEIBT)				+= --cfi
>  objtool-args-$(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL)	+= --mcount
>  ifdef CONFIG_FTRACE_MCOUNT_USE_OBJTOOL

> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -1455,7 +1455,7 @@ static void annotate_call_site(struct ob
>  		return;
>  	}
>  
> -	if (!insn->sec->init && !insn->_call_dest->embedded_insn) {
> +	if (!insn->_call_dest->embedded_insn) {

Why did we have the 'init' check to start with?  Presumably init memory
gets freed after the call dest patching so this is not a problem?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-28  1:07   ` Josh Poimboeuf
@ 2024-09-28 13:12     ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-28 13:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 06:07:33PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:05PM +0200, Peter Zijlstra wrote:
> > +static void *translate_call_dest(void *dest, bool call)
> > +{
> > +	if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH)) {
> > +		if (!call)
> > +			return dest;
> 
> A tail call is considered a call by virtue of the function name.  Change
> the "call" arg to "tail" to make it clearer.

Sure.

> > +++ b/scripts/Makefile.lib
> > @@ -269,6 +269,7 @@ objtool-args-$(CONFIG_HAVE_JUMP_LABEL_HA
> >  objtool-args-$(CONFIG_HAVE_NOINSTR_HACK)		+= --hacks=noinstr
> >  objtool-args-$(CONFIG_MITIGATION_CALL_DEPTH_TRACKING)	+= --direct-call
> >  objtool-args-$(CONFIG_X86_KERNEL_IBT)			+= --ibt
> > +objtool-args-$(CONFIG_X86_KERNEL_IBT_PLUS)		+= --direct-call
> 
> Only add '--direct-call' once:
> 
> objtool-args-$(or $(CONFIG_MITIGATION_CALL_DEPTH_TRACKING),$(CONFIG_X86_KERNEL_IBT_PLUS)) += --direct-call

Heh. I don't do enough Makefile to ever remember how they work :/

> >  objtool-args-$(CONFIG_FINEIBT)				+= --cfi
> >  objtool-args-$(CONFIG_FTRACE_MCOUNT_USE_OBJTOOL)	+= --mcount
> >  ifdef CONFIG_FTRACE_MCOUNT_USE_OBJTOOL
> 
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -1455,7 +1455,7 @@ static void annotate_call_site(struct ob
> >  		return;
> >  	}
> >  
> > -	if (!insn->sec->init && !insn->_call_dest->embedded_insn) {
> > +	if (!insn->_call_dest->embedded_insn) {
> 
> Why did we have the 'init' check to start with?  Presumably init memory
> gets freed after the call dest patching so this is not a problem?

Ah, all this came from the calldepth tracking stuff. And we didn't care
about init code, that runs before userspace and so RSB overflows aren't
much of a problem.

But with this here thing, we very much also want to patch the init code
to not land on an ENDBR that will be turned into a UD1 during init :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-27 19:49 ` [PATCH 09/14] x86/ibt: Implement IBT+ Peter Zijlstra
  2024-09-28  1:07   ` Josh Poimboeuf
@ 2024-09-29 17:38   ` Alexei Starovoitov
  2024-09-30  8:23     ` Peter Zijlstra
  2024-11-05 10:40   ` Peter Zijlstra
  2 siblings, 1 reply; 60+ messages in thread
From: Alexei Starovoitov @ 2024-09-29 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -555,6 +555,8 @@ static int emit_patch(u8 **pprog, void *
>
>  static int emit_call(u8 **pprog, void *func, void *ip)
>  {
> +       if (is_endbr(func))
> +               func += ENDBR_INSN_SIZE;
>         return emit_patch(pprog, func, ip, 0xE8);
>  }
>
> @@ -562,11 +564,13 @@ static int emit_rsb_call(u8 **pprog, voi
>  {
>         OPTIMIZER_HIDE_VAR(func);
>         ip += x86_call_depth_emit_accounting(pprog, func, ip);
> -       return emit_patch(pprog, func, ip, 0xE8);
> +       return emit_call(pprog, func, ip);
>  }
>
>  static int emit_jump(u8 **pprog, void *func, void *ip)
>  {
> +       if (is_endbr(func))
> +               func += ENDBR_INSN_SIZE;
>         return emit_patch(pprog, func, ip, 0xE9);
>  }

Makes sense, but it feels like it's fixing the existing bug
that we somehow didn't notice earlier?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-29 17:38   ` Alexei Starovoitov
@ 2024-09-30  8:23     ` Peter Zijlstra
  2024-09-30 17:00       ` Alexei Starovoitov
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-30  8:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Sun, Sep 29, 2024 at 10:38:58AM -0700, Alexei Starovoitov wrote:
> On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -555,6 +555,8 @@ static int emit_patch(u8 **pprog, void *
> >
> >  static int emit_call(u8 **pprog, void *func, void *ip)
> >  {
> > +       if (is_endbr(func))
> > +               func += ENDBR_INSN_SIZE;
> >         return emit_patch(pprog, func, ip, 0xE8);
> >  }
> >
> > @@ -562,11 +564,13 @@ static int emit_rsb_call(u8 **pprog, voi
> >  {
> >         OPTIMIZER_HIDE_VAR(func);
> >         ip += x86_call_depth_emit_accounting(pprog, func, ip);
> > -       return emit_patch(pprog, func, ip, 0xE8);
> > +       return emit_call(pprog, func, ip);
> >  }
> >
> >  static int emit_jump(u8 **pprog, void *func, void *ip)
> >  {
> > +       if (is_endbr(func))
> > +               func += ENDBR_INSN_SIZE;
> >         return emit_patch(pprog, func, ip, 0xE9);
> >  }
> 
> Makes sense, but it feels like it's fixing the existing bug
> that we somehow didn't notice earlier?

Before all this func()+0 was a valid call address -- as it's been
forever.

While it is true that with the introduction of ENDBR some compilers will
do direct calls to func()+4 to avoid the ENDBR (less instructions, more
faster etc..) this was very much an optional thing.

Notably, up until this point we would use a 4 byte NOP to seal(*)
functions, specifically so that anybody doing direct calls to func()+0
would continue to work.

These patches however change all that by sealing with a 4 byte UD1
instruction, which makes direct calls to func()+0 fatal. As such, we
must guarantee all direct calls are to func()+4. So what used to be an
optimization is now a strict requirement.

Indirect calls still go to func()+0 (or func()-16 for FineIBT) and will
go *bang* if !ENDBR or UD1 (depending on the hardware having CET/IBT
support).

(*) with sealing we mean the explicit action of disallowing indirect
calls to a given function.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-30  8:23     ` Peter Zijlstra
@ 2024-09-30 17:00       ` Alexei Starovoitov
  0 siblings, 0 replies; 60+ messages in thread
From: Alexei Starovoitov @ 2024-09-30 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Mon, Sep 30, 2024 at 1:23 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Sep 29, 2024 at 10:38:58AM -0700, Alexei Starovoitov wrote:
> > On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -555,6 +555,8 @@ static int emit_patch(u8 **pprog, void *
> > >
> > >  static int emit_call(u8 **pprog, void *func, void *ip)
> > >  {
> > > +       if (is_endbr(func))
> > > +               func += ENDBR_INSN_SIZE;
> > >         return emit_patch(pprog, func, ip, 0xE8);
> > >  }
> > >
> > > @@ -562,11 +564,13 @@ static int emit_rsb_call(u8 **pprog, voi
> > >  {
> > >         OPTIMIZER_HIDE_VAR(func);
> > >         ip += x86_call_depth_emit_accounting(pprog, func, ip);
> > > -       return emit_patch(pprog, func, ip, 0xE8);
> > > +       return emit_call(pprog, func, ip);
> > >  }
> > >
> > >  static int emit_jump(u8 **pprog, void *func, void *ip)
> > >  {
> > > +       if (is_endbr(func))
> > > +               func += ENDBR_INSN_SIZE;
> > >         return emit_patch(pprog, func, ip, 0xE9);
> > >  }
> >
> > Makes sense, but it feels like it's fixing the existing bug
> > that we somehow didn't notice earlier?
>
> Before all this func()+0 was a valid call address -- as it's been
> forever.
>
> While it is true that with the introduction of ENDBR some compilers will
> do direct calls to func()+4 to avoid the ENDBR (less instructions, more
> faster etc..) this was very much an optional thing.
>
> Notably, up until this point we would use a 4 byte NOP to seal(*)
> functions, specifically so that anybody doing direct calls to func()+0
> would continue to work.
>
> These patches however change all that by sealing with a 4 byte UD1
> instruction, which makes direct calls to func()+0 fatal. As such, we
> must guarantee all direct calls are to func()+4. So what used to be an
> optimization is now a strict requirement.
>
> Indirect calls still go to func()+0 (or func()-16 for FineIBT) and will
> go *bang* if !ENDBR or UD1 (depending on the hardware having CET/IBT
> support).
>
> (*) with sealing we mean the explicit action of disallowing indirect
> calls to a given function.

I see. Thanks for explaining. I would copy paste the above details
into the commit log.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] x86/ibt: Implement IBT+
  2024-09-27 19:49 ` [PATCH 09/14] x86/ibt: Implement IBT+ Peter Zijlstra
  2024-09-28  1:07   ` Josh Poimboeuf
  2024-09-29 17:38   ` Alexei Starovoitov
@ 2024-11-05 10:40   ` Peter Zijlstra
  2 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-11-05 10:40 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:49:05PM +0200, Peter Zijlstra wrote:

> +#ifdef CONFIG_X86_KERNEL_IBT_PLUS
> +__init_or_module void apply_direct_call_offset(s32 *start, s32 *end)
> +{
> +	s32 *s;
> +
> +	/*
> +	 * incompatible with call depth tracking
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_CALL_DEPTH))
> +		return;
> +
> +	for (s = start; s < end; s++) {
> +		void *dest, *addr = (void *)s + *s;
> +		struct insn insn;
> +		int ret;
> +
> +		ret = insn_decode_kernel(&insn, addr);
> +		if (WARN_ON_ONCE(ret < 0))
> +			continue;
> +
> +		dest = addr + insn.length + insn.immediate.value;
> +		if (!is_endbr(dest))
> +			continue;
> +
> +		switch (insn.opcode.bytes[0]) {
> +		case CALL_INSN_OPCODE:
> +		case JMP32_INSN_OPCODE:
> +			apply_reloc(4, addr+1, 4);
> +			continue;
> +
> +		case JMP8_INSN_OPCODE:
> +		case 0x70 ... 0x7f: /* Jcc.d8 */
> +			apply_reloc(1, addr+1, 4);
> +			continue;

*sigh*... I have a clang-19 build (thanks 0day) that uses a jmp.d8 +0x7e
as a tail-call, guess how well it goes adding 4 to that :-(

Luckily the next instruction is a giant (alignment) NOP, so I *could* go
fix that up, but perhaps this is pushing things too far ...

> +
> +		case 0x0f:
> +			switch (insn.opcode.bytes[1]) {
> +			case 0x80 ... 0x8f:
> +				apply_reloc(4, addr+2, 4);
> +				continue;
> +
> +			default:
> +				break;
> +			}
> +			break;
> +
> +		default:
> +			break;
> +		}
> +
> +		printk("at: %pS, instruction: %*ph\n", addr, insn.length, addr);
> +		BUG();
> +	}
> +}
> +#else
> +__init_or_module void apply_direct_call_offset(s32 *start, s32 *end) { }
> +#endif

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 10/14] x86/early_printk: Harden early_serial
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (8 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 09/14] x86/ibt: Implement IBT+ Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-27 19:49 ` [PATCH 11/14] llvm: kCFI pointer stuff Peter Zijlstra
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Scott found that mem32_serial_in() is an ideal speculation gadget, an
indirectly callable function that takes an adddress and offset and
immediately does a load.

Use static_call() to take away the need for indirect calls and
explicitly seal the functions to ensure they're not callable on IBT
enabled parts.

Reported-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/early_printk.c |   51 ++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

--- a/arch/x86/kernel/early_printk.c
+++ b/arch/x86/kernel/early_printk.c
@@ -19,6 +19,7 @@
 #include <linux/usb/ehci_def.h>
 #include <linux/usb/xhci-dbgp.h>
 #include <asm/pci_x86.h>
+#include <linux/static_call.h>
 
 /* Simple VGA output */
 #define VGABASE		(__ISA_IO_base + 0xb8000)
@@ -94,26 +95,28 @@ static unsigned long early_serial_base =
 #define DLL             0       /*  Divisor Latch Low         */
 #define DLH             1       /*  Divisor latch High        */
 
-static unsigned int io_serial_in(unsigned long addr, int offset)
+static __noendbr unsigned int io_serial_in(unsigned long addr, int offset)
 {
 	return inb(addr + offset);
 }
+NOENDBR_SYMBOL(io_serial_in);
 
-static void io_serial_out(unsigned long addr, int offset, int value)
+static __noendbr void io_serial_out(unsigned long addr, int offset, int value)
 {
 	outb(value, addr + offset);
 }
+NOENDBR_SYMBOL(io_serial_out);
 
-static unsigned int (*serial_in)(unsigned long addr, int offset) = io_serial_in;
-static void (*serial_out)(unsigned long addr, int offset, int value) = io_serial_out;
+DEFINE_STATIC_CALL(serial_in, io_serial_in);
+DEFINE_STATIC_CALL(serial_out, io_serial_out);
 
 static int early_serial_putc(unsigned char ch)
 {
 	unsigned timeout = 0xffff;
 
-	while ((serial_in(early_serial_base, LSR) & XMTRDY) == 0 && --timeout)
+	while ((static_call(serial_in)(early_serial_base, LSR) & XMTRDY) == 0 && --timeout)
 		cpu_relax();
-	serial_out(early_serial_base, TXR, ch);
+	static_call(serial_out)(early_serial_base, TXR, ch);
 	return timeout ? 0 : -1;
 }
 
@@ -131,16 +134,16 @@ static __init void early_serial_hw_init(
 {
 	unsigned char c;
 
-	serial_out(early_serial_base, LCR, 0x3);	/* 8n1 */
-	serial_out(early_serial_base, IER, 0);	/* no interrupt */
-	serial_out(early_serial_base, FCR, 0);	/* no fifo */
-	serial_out(early_serial_base, MCR, 0x3);	/* DTR + RTS */
-
-	c = serial_in(early_serial_base, LCR);
-	serial_out(early_serial_base, LCR, c | DLAB);
-	serial_out(early_serial_base, DLL, divisor & 0xff);
-	serial_out(early_serial_base, DLH, (divisor >> 8) & 0xff);
-	serial_out(early_serial_base, LCR, c & ~DLAB);
+	static_call(serial_out)(early_serial_base, LCR, 0x3);	/* 8n1 */
+	static_call(serial_out)(early_serial_base, IER, 0);	/* no interrupt */
+	static_call(serial_out)(early_serial_base, FCR, 0);	/* no fifo */
+	static_call(serial_out)(early_serial_base, MCR, 0x3);	/* DTR + RTS */
+
+	c = static_call(serial_in)(early_serial_base, LCR);
+	static_call(serial_out)(early_serial_base, LCR, c | DLAB);
+	static_call(serial_out)(early_serial_base, DLL, divisor & 0xff);
+	static_call(serial_out)(early_serial_base, DLH, (divisor >> 8) & 0xff);
+	static_call(serial_out)(early_serial_base, LCR, c & ~DLAB);
 }
 
 #define DEFAULT_BAUD 9600
@@ -183,28 +186,26 @@ static __init void early_serial_init(cha
 	/* Convert from baud to divisor value */
 	divisor = 115200 / baud;
 
-	/* These will always be IO based ports */
-	serial_in = io_serial_in;
-	serial_out = io_serial_out;
-
 	/* Set up the HW */
 	early_serial_hw_init(divisor);
 }
 
 #ifdef CONFIG_PCI
-static void mem32_serial_out(unsigned long addr, int offset, int value)
+static __noendbr void mem32_serial_out(unsigned long addr, int offset, int value)
 {
 	u32 __iomem *vaddr = (u32 __iomem *)addr;
 	/* shift implied by pointer type */
 	writel(value, vaddr + offset);
 }
+NOENDBR_SYMBOL(mem32_serial_out);
 
-static unsigned int mem32_serial_in(unsigned long addr, int offset)
+static __noendbr unsigned int mem32_serial_in(unsigned long addr, int offset)
 {
 	u32 __iomem *vaddr = (u32 __iomem *)addr;
 	/* shift implied by pointer type */
 	return readl(vaddr + offset);
 }
+NOENDBR_SYMBOL(mem32_serial_in);
 
 /*
  * early_pci_serial_init()
@@ -278,15 +279,13 @@ static __init void early_pci_serial_init
 	 */
 	if ((bar0 & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
 		/* it is IO mapped */
-		serial_in = io_serial_in;
-		serial_out = io_serial_out;
 		early_serial_base = bar0 & PCI_BASE_ADDRESS_IO_MASK;
 		write_pci_config(bus, slot, func, PCI_COMMAND,
 				 cmdreg|PCI_COMMAND_IO);
 	} else {
 		/* It is memory mapped - assume 32-bit alignment */
-		serial_in = mem32_serial_in;
-		serial_out = mem32_serial_out;
+		static_call_update(serial_in, mem32_serial_in);
+		static_call_update(serial_out, mem32_serial_out);
 		/* WARNING! assuming the address is always in the first 4G */
 		early_serial_base =
 			(unsigned long)early_ioremap(bar0 & PCI_BASE_ADDRESS_MEM_MASK, 0x10);



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 11/14] llvm: kCFI pointer stuff
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (9 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 10/14] x86/early_printk: Harden early_serial Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-29 17:53   ` Alexei Starovoitov
  2024-10-30  6:29   ` Constable, Scott D
  2024-09-27 19:49 ` [PATCH 12/14] x86: Hacks for hacked up llvm Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with
a u8 bitmask of pointer arguments. This should really be under a new
compiler flag, dependent on both x86_64 and kCFI.

Per the comment, the bitmask represents the register based arguments
as the first 6 bits and bit 6 is used to cover all stack based
arguments. The high bit is used for invalid values.

The purpose is to put a store dependency on the set registers, thereby
blocking speculation paths that would otherwise expoit their value.


Note1:

This implementation simply sets the bit for any pointer type. A better
implementation would only set the bit for any argument that is
dereferenced in the function body.

This better implementation would also capture things like:

  void foo(unsigned long addr, void *args)
  {
    u32 t = *(u32 *)addr;
    bar(t, args);
  }

Which, in contrast to the implementation below, would set bit0 while
leaving bit1 unset -- the exact opposite of this implementation.

Notably, addr *is* dereferenced, even though it is not a pointer on
entry, while args is a pointer, but is not derefereced but passed on
to bar -- if bar uses it, it gets to deal with it.

Note2:

Do we want to make this a u32 to keep room for all registers? AFAICT
the current use is only concerned with the argument registers and
those are limited to 6 for the C ABI, but custom (assembly) functions
could use things outside of that.

---
diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
index 73c745062096..42dcbc40ab4b 100644
--- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
+++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
@@ -143,11 +143,28 @@ void X86AsmPrinter::EmitKCFITypePadding(const MachineFunction &MF,
   // one. Otherwise, just pad with nops. The X86::MOV32ri instruction emitted
   // in X86AsmPrinter::emitKCFITypeId is 5 bytes long.
   if (HasType)
-    PrefixBytes += 5;
+    PrefixBytes += 7;
 
   emitNops(offsetToAlignment(PrefixBytes, MF.getAlignment()));
 }
 
+static uint8_t getKCFIPointerArgs(const Function &F)
+{
+  uint8_t val = 0;
+
+  if (F.isVarArg())
+    return 0x7f;
+
+  for (int i = 0; i < F.arg_size() ; i++) {
+    Argument *A = F.getArg(i);
+    Type *T = A->getType();
+    if (T->getTypeID() == Type::PointerTyID)
+      val |= 1 << std::min(i, 6);
+  }
+
+  return val;
+}
+
 /// emitKCFITypeId - Emit the KCFI type information in architecture specific
 /// format.
 void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
@@ -183,6 +200,26 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
                               .addReg(X86::EAX)
                               .addImm(MaskKCFIType(Type->getZExtValue())));
 
+  // Extend the kCFI meta-data with a byte that has a bit set for each argument
+  // register that contains a pointer. Specifically for x86_64, which has 6
+  // argument registers:
+  //
+  //   bit0 - rdi
+  //   bit1 - rsi
+  //   bit2 - rdx
+  //   bit3 - rcx
+  //   bit4 - r8
+  //   bit5 - r9
+  //
+  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
+  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
+  // error states.
+  //
+  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
+  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
+		  .addReg(X86::AL)
+		  .addImm(getKCFIPointerArgs(F)));
+
   if (MAI->hasDotTypeDotSizeDirective()) {
     MCSymbol *EndSym = OutContext.createTempSymbol("cfi_func_end");
     OutStreamer->emitLabel(EndSym);
diff --git a/llvm/lib/Target/X86/X86MCInstLower.cpp b/llvm/lib/Target/X86/X86MCInstLower.cpp
index cbb012161524..c0776ef78153 100644
--- a/llvm/lib/Target/X86/X86MCInstLower.cpp
+++ b/llvm/lib/Target/X86/X86MCInstLower.cpp
@@ -897,7 +897,7 @@ void X86AsmPrinter::LowerKCFI_CHECK(const MachineInstr &MI) {
                               .addReg(AddrReg)
                               .addImm(1)
                               .addReg(X86::NoRegister)
-                              .addImm(-(PrefixNops + 4))
+                              .addImm(-(PrefixNops + 6))
                               .addReg(X86::NoRegister));
 
   MCSymbol *Pass = OutContext.createTempSymbol();



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-09-27 19:49 ` [PATCH 11/14] llvm: kCFI pointer stuff Peter Zijlstra
@ 2024-09-29 17:53   ` Alexei Starovoitov
  2024-09-30  8:27     ` Peter Zijlstra
  2024-10-30  6:29   ` Constable, Scott D
  1 sibling, 1 reply; 60+ messages in thread
From: Alexei Starovoitov @ 2024-09-29 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Fri, Sep 27, 2024 at 12:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with
> a u8 bitmask of pointer arguments. This should really be under a new
> compiler flag, dependent on both x86_64 and kCFI.
>
> Per the comment, the bitmask represents the register based arguments
> as the first 6 bits and bit 6 is used to cover all stack based
> arguments. The high bit is used for invalid values.
>
> The purpose is to put a store dependency on the set registers, thereby
> blocking speculation paths that would otherwise expoit their value.
>
>
> Note1:
>
> This implementation simply sets the bit for any pointer type. A better
> implementation would only set the bit for any argument that is
> dereferenced in the function body.
>
> This better implementation would also capture things like:
>
>   void foo(unsigned long addr, void *args)
>   {
>     u32 t = *(u32 *)addr;
>     bar(t, args);
>   }
>
> Which, in contrast to the implementation below, would set bit0 while
> leaving bit1 unset -- the exact opposite of this implementation.
>
> Notably, addr *is* dereferenced, even though it is not a pointer on
> entry, while args is a pointer, but is not derefereced but passed on
> to bar -- if bar uses it, it gets to deal with it.
>
> Note2:
>
> Do we want to make this a u32 to keep room for all registers? AFAICT
> the current use is only concerned with the argument registers and
> those are limited to 6 for the C ABI, but custom (assembly) functions
> could use things outside of that.
>
> ---
> diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
> index 73c745062096..42dcbc40ab4b 100644
> --- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
> +++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
> @@ -143,11 +143,28 @@ void X86AsmPrinter::EmitKCFITypePadding(const MachineFunction &MF,
>    // one. Otherwise, just pad with nops. The X86::MOV32ri instruction emitted
>    // in X86AsmPrinter::emitKCFITypeId is 5 bytes long.
>    if (HasType)
> -    PrefixBytes += 5;
> +    PrefixBytes += 7;
>
>    emitNops(offsetToAlignment(PrefixBytes, MF.getAlignment()));
>  }
>
> +static uint8_t getKCFIPointerArgs(const Function &F)
> +{
> +  uint8_t val = 0;
> +
> +  if (F.isVarArg())
> +    return 0x7f;
> +
> +  for (int i = 0; i < F.arg_size() ; i++) {
> +    Argument *A = F.getArg(i);
> +    Type *T = A->getType();
> +    if (T->getTypeID() == Type::PointerTyID)
> +      val |= 1 << std::min(i, 6);
> +  }
> +
> +  return val;
> +}
> +
>  /// emitKCFITypeId - Emit the KCFI type information in architecture specific
>  /// format.
>  void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
> @@ -183,6 +200,26 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF) {
>                                .addReg(X86::EAX)
>                                .addImm(MaskKCFIType(Type->getZExtValue())));
>
> +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> +  // register that contains a pointer. Specifically for x86_64, which has 6
> +  // argument registers:
> +  //
> +  //   bit0 - rdi
> +  //   bit1 - rsi
> +  //   bit2 - rdx
> +  //   bit3 - rcx
> +  //   bit4 - r8
> +  //   bit5 - r9
> +  //
> +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> +  // error states.
> +  //
> +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> +                 .addReg(X86::AL)
> +                 .addImm(getKCFIPointerArgs(F)));

If I'm reading this correctly it will be an 8-bit move which
doesn't clear upper bits.
If consumer is in assembly it's ok-ish,
but it's an argument to __bhi_args_foo functions,
so should be properly zero extended per call convention.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-09-29 17:53   ` Alexei Starovoitov
@ 2024-09-30  8:27     ` Peter Zijlstra
  2024-09-30 16:59       ` Alexei Starovoitov
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-30  8:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote:

> > +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> > +  // register that contains a pointer. Specifically for x86_64, which has 6
> > +  // argument registers:
> > +  //
> > +  //   bit0 - rdi
> > +  //   bit1 - rsi
> > +  //   bit2 - rdx
> > +  //   bit3 - rcx
> > +  //   bit4 - r8
> > +  //   bit5 - r9
> > +  //
> > +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> > +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> > +  // error states.
> > +  //
> > +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> > +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> > +                 .addReg(X86::AL)
> > +                 .addImm(getKCFIPointerArgs(F)));
> 
> If I'm reading this correctly it will be an 8-bit move which
> doesn't clear upper bits.
> If consumer is in assembly it's ok-ish,
> but it's an argument to __bhi_args_foo functions,
> so should be properly zero extended per call convention.

These kCFI 'instructions' are never executed. Their sole purpose is to
encode the immediates. They are instructions because they live in .text
and having them this way makes disassemly work nicely. As such, we've
taken to using the 1 byte move instruction to carry them with the least
amounts of bytes.

The consumer is the kernel instruction decoder, we take the immediate
and use that.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-09-30  8:27     ` Peter Zijlstra
@ 2024-09-30 16:59       ` Alexei Starovoitov
  2024-10-01 10:21         ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Alexei Starovoitov @ 2024-09-30 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Mon, Sep 30, 2024 at 1:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote:
>
> > > +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> > > +  // register that contains a pointer. Specifically for x86_64, which has 6
> > > +  // argument registers:
> > > +  //
> > > +  //   bit0 - rdi
> > > +  //   bit1 - rsi
> > > +  //   bit2 - rdx
> > > +  //   bit3 - rcx
> > > +  //   bit4 - r8
> > > +  //   bit5 - r9
> > > +  //
> > > +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> > > +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> > > +  // error states.
> > > +  //
> > > +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> > > +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> > > +                 .addReg(X86::AL)
> > > +                 .addImm(getKCFIPointerArgs(F)));
> >
> > If I'm reading this correctly it will be an 8-bit move which
> > doesn't clear upper bits.
> > If consumer is in assembly it's ok-ish,
> > but it's an argument to __bhi_args_foo functions,
> > so should be properly zero extended per call convention.
>
> These kCFI 'instructions' are never executed. Their sole purpose is to
> encode the immediates. They are instructions because they live in .text
> and having them this way makes disassemly work nicely. As such, we've
> taken to using the 1 byte move instruction to carry them with the least
> amounts of bytes.
>
> The consumer is the kernel instruction decoder, we take the immediate
> and use that.

I see... and after decoding imm bits in mov %al insn the kernel will
insert a call to corresponding __bhi_args_* stub that will use
cmovne on corresponding register(s) to sanitize the value?
That was difficult to grasp.
A design doc would have helped.

I wonder whether this whole complexity is worth it vs
always calling __bhi_args_all()

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-09-30 16:59       ` Alexei Starovoitov
@ 2024-10-01 10:21         ` Peter Zijlstra
  2024-10-02 16:48           ` Alexei Starovoitov
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-10-01 10:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Mon, Sep 30, 2024 at 09:59:11AM -0700, Alexei Starovoitov wrote:
> On Mon, Sep 30, 2024 at 1:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Sep 29, 2024 at 10:53:05AM -0700, Alexei Starovoitov wrote:
> >
> > > > +  // Extend the kCFI meta-data with a byte that has a bit set for each argument
> > > > +  // register that contains a pointer. Specifically for x86_64, which has 6
> > > > +  // argument registers:
> > > > +  //
> > > > +  //   bit0 - rdi
> > > > +  //   bit1 - rsi
> > > > +  //   bit2 - rdx
> > > > +  //   bit3 - rcx
> > > > +  //   bit4 - r8
> > > > +  //   bit5 - r9
> > > > +  //
> > > > +  // bit6 will denote any pointer on stack (%rsp), and all 7 bits set will
> > > > +  // indicate vararg or any other 'unknown' configuration. Leaving bit7 for
> > > > +  // error states.
> > > > +  //
> > > > +  // XXX: should be conditional on some new x86_64 specific 'bhi' argument.
> > > > +  EmitAndCountInstruction(MCInstBuilder(X86::MOV8ri)
> > > > +                 .addReg(X86::AL)
> > > > +                 .addImm(getKCFIPointerArgs(F)));
> > >
> > > If I'm reading this correctly it will be an 8-bit move which
> > > doesn't clear upper bits.
> > > If consumer is in assembly it's ok-ish,
> > > but it's an argument to __bhi_args_foo functions,
> > > so should be properly zero extended per call convention.
> >
> > These kCFI 'instructions' are never executed. Their sole purpose is to
> > encode the immediates. They are instructions because they live in .text
> > and having them this way makes disassemly work nicely. As such, we've
> > taken to using the 1 byte move instruction to carry them with the least
> > amounts of bytes.
> >
> > The consumer is the kernel instruction decoder, we take the immediate
> > and use that.
> 
> I see... and after decoding imm bits in mov %al insn the kernel will
> insert a call to corresponding __bhi_args_* stub that will use
> cmovne on corresponding register(s) to sanitize the value?
> That was difficult to grasp.
> A design doc would have helped.

Does something like this help?

diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
index 31d19c815f99..b6e7e79e79c6 100644
--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -44,11 +44,28 @@
  *   call *%r11
  *
  *
+ * IBT+:
+ *
+ * foo:
+ *   endbr64 / ud1 0(%eax), %edx
+ *   ... code here ...
+ *   ret
+ *
+ * direct caller:
+ *   call foo+4
+ *
+ * indirect caller:
+ *   lea foo(%rip), %r11
+ *   ...
+ *   call *%r11
+ *
+ *
  * kCFI:
  *
  * __cfi_foo:
  *   movl $0x12345678, %eax	# kCFI signature hash
- *				# 11 nops when CONFIG_CALL_PADDING
+ *   movb $0x12, %al		# kCFI pointer argument mask
+ *				# 9 nops when CONFIG_CALL_PADDING
  * foo:
  *   endbr64			# when IBT
  *   ... code here ...
@@ -91,6 +108,57 @@
  *   nop4
  *   call *%r11
  *
+ *
+ * FineIBT+:
+ *
+ * __cfi_foo:
+ *   endbr64
+ *   subl 0x12345678, %r10d
+ *   jz   foo
+ *   ud2
+ *   nop
+ * foo:
+ *   ud1 0(%eax), %edx		# was endbr64
+ * foo_4:
+ *   ... code here ...
+ *   ret
+ *
+ * direct caller:
+ *   call foo+4
+ *
+ * indirect caller:
+ *   lea foo(%rip), %r11
+ *   ...
+ *   movl $0x12345678, %r10d
+ *   subl $16, %r11
+ *   nop4
+ *   call *%r11
+ *
+ *
+ * FineIBT-BHI:
+ *
+ * __cfi_foo:
+ *   endbr64
+ *   subl 0x12345678, %r10d
+ *   jz   foo-1
+ *   ud2
+ * foo-1:
+ *   call __bhi_args_XXX	# depends on kCFI pointer argument mask
+ * foo+4:
+ *   ... code here ...
+ *   ret
+ *
+ * direct caller:
+ *   call foo+4
+ *
+ * indirect caller:
+ *   lea foo(%rip), %r11
+ *   ...
+ *   movl $0x12345678, %r10d
+ *   subl $16, %r11
+ *   nop4
+ *   call *%r11
+ *
  */
 enum cfi_mode {
 	CFI_AUTO,	/* FineIBT if hardware has IBT, otherwise kCFI */




> I wonder whether this whole complexity is worth it vs
> always calling __bhi_args_all()

That's one for Scott to answer; I think always doing _all will hurt
especially bad because it includes rsp.



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-10-01 10:21         ` Peter Zijlstra
@ 2024-10-02 16:48           ` Alexei Starovoitov
  0 siblings, 0 replies; 60+ messages in thread
From: Alexei Starovoitov @ 2024-10-02 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: X86 ML, LKML, alyssa.milburn, scott.d.constable, Joao Moreira,
	Andrew Cooper, Josh Poimboeuf, Jose E. Marchesi, H.J. Lu,
	Nick Desaulniers, Sami Tolvanen, Nathan Chancellor, ojeda,
	Kees Cook

On Tue, Oct 1, 2024 at 3:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Does something like this help?

Yep. Thanks.

> diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
> index 31d19c815f99..b6e7e79e79c6 100644
> --- a/arch/x86/include/asm/cfi.h
> +++ b/arch/x86/include/asm/cfi.h
> @@ -44,11 +44,28 @@
>   *   call *%r11
>   *
>   *
> + * IBT+:
> + *
> + * foo:
> + *   endbr64 / ud1 0(%eax), %edx
> + *   ... code here ...
> + *   ret
> + *
> + * direct caller:
> + *   call foo+4
> + *
> + * indirect caller:
> + *   lea foo(%rip), %r11
> + *   ...
> + *   call *%r11
> + *
> + *
>   * kCFI:
>   *
>   * __cfi_foo:
>   *   movl $0x12345678, %eax    # kCFI signature hash
> - *                             # 11 nops when CONFIG_CALL_PADDING
> + *   movb $0x12, %al           # kCFI pointer argument mask
> + *                             # 9 nops when CONFIG_CALL_PADDING
>   * foo:
>   *   endbr64                   # when IBT
>   *   ... code here ...
> @@ -91,6 +108,57 @@
>   *   nop4
>   *   call *%r11
>   *
> + *
> + * FineIBT+:
> + *
> + * __cfi_foo:
> + *   endbr64
> + *   subl 0x12345678, %r10d
> + *   jz   foo

should it be 'jz foo_4' ?
Otherwise it will trap after endbr64 sealing.

> + *   ud2
> + *   nop
> + * foo:
> + *   ud1 0(%eax), %edx         # was endbr64
> + * foo_4:
> + *   ... code here ...
> + *   ret
> + *
> + * direct caller:
> + *   call foo+4
> + *
> + * indirect caller:
> + *   lea foo(%rip), %r11
> + *   ...
> + *   movl $0x12345678, %r10d
> + *   subl $16, %r11
> + *   nop4
> + *   call *%r11
> + *
> + *
> + * FineIBT-BHI:
> + *
> + * __cfi_foo:
> + *   endbr64
> + *   subl 0x12345678, %r10d
> + *   jz   foo-1
> + *   ud2
> + * foo-1:
> + *   call __bhi_args_XXX       # depends on kCFI pointer argument mask
> + * foo+4:
> + *   ... code here ...
> + *   ret
> + *
> + * direct caller:
> + *   call foo+4
> + *
> + * indirect caller:
> + *   lea foo(%rip), %r11
> + *   ...
> + *   movl $0x12345678, %r10d
> + *   subl $16, %r11
> + *   nop4
> + *   call *%r11
> + *
>   */
>  enum cfi_mode {
>         CFI_AUTO,       /* FineIBT if hardware has IBT, otherwise kCFI */
>
>
>
>
> > I wonder whether this whole complexity is worth it vs
> > always calling __bhi_args_all()
>
> That's one for Scott to answer; I think always doing _all will hurt
> especially bad because it includes rsp.

But why cmovne %rsp ?
Because some pointers are passed on the stack ?
but %rsp itself isn't involved in speculation.
load/store from stack can still speculate regardless of cmovne %rsp ?
Or is it acting like a barrier for all subsequent access from stack
including things like 'push rbp' in the function prologue?

Overall it feels like a lot of complexity for a 'security checkbox'.
If it hurts perf so much regardless the extra % here and there will
be ignored by security obsessed people and folks who care about
performance won't be enabling it anyway.

btw the alternative to hacking compilers is to get information
about pointers in function arguments from BTF.
It's all there. No need to encode it via movb $0x12, %al.

$ bpftool btf dump file vmlinux format c|grep pick_next_task
    struct task_struct * (*pick_next_task)(struct rq *, struct task_struct *);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-09-27 19:49 ` [PATCH 11/14] llvm: kCFI pointer stuff Peter Zijlstra
  2024-09-29 17:53   ` Alexei Starovoitov
@ 2024-10-30  6:29   ` Constable, Scott D
  2024-10-30 20:07     ` Constable, Scott D
  1 sibling, 1 reply; 60+ messages in thread
From: Constable, Scott D @ 2024-10-30  6:29 UTC (permalink / raw)
  To: Peter Zijlstra, x86@kernel.org
  Cc: linux-kernel@vger.kernel.org, Milburn, Alyssa,
	joao@overdrivepizza.com, andrew.cooper3@citrix.com,
	jpoimboe@kernel.org, jose.marchesi@oracle.com,
	hjl.tools@gmail.com, ndesaulniers@google.com,
	samitolvanen@google.com, nathan@kernel.org, ojeda@kernel.org,
	kees@kernel.org, alexei.starovoitov@gmail.com

> Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with a u8 bitmask of pointer arguments. This should really be under a new compiler flag, dependent on both x86_64 and kCFI.
> 
> Per the comment, the bitmask represents the register based arguments as the first 6 bits and bit 6 is used to cover all stack based arguments. The high bit is used for invalid values.
> 
> The purpose is to put a store dependency on the set registers, thereby blocking speculation paths that would otherwise expoit their value.

Given the ongoing discussion on [PATCH 13/14] where there is a growing consensus that all arguments (not just pointers) should be poisoned after a misprediction, a different encoding scheme would be needed. I believe there are 8 possibilities, which correspond to the function's arity:

0: Function takes 0 args
1: Function takes 1 arg
2: Function takes 2 args
3: Function takes 3 args
4: Function takes 4 args
5: Function takes 5 args
6: Function takes 6 args
7: Function takes >6 args

These possibilities can be encoded with 3 bits. I suspect that it might actually be beneficial to steal 3 bits from the u32 kCFI hash (either by using a smaller 29-bit hash or by truncating the 32-bit hash down to 29 bits). This scheme would arguably strengthen both kCFI and FineIBT by partitioning CFI edges such that a j-arity function cannot call a k-arity function unless j=k (or unless j>6 and k>6); the current 32-bit kCFI hash does not prevent, for example, a 2-arity fptr from calling a 3-arity target if the kCFI hashes collide. The disadvantage of the 29-bit hash is that it would increase the probability of collisions within each arity, but on the other hand the total number of functions of each given arity is much smaller than the total number of functions of all arities.

Regards,

Scott Constable

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 11/14] llvm: kCFI pointer stuff
  2024-10-30  6:29   ` Constable, Scott D
@ 2024-10-30 20:07     ` Constable, Scott D
  0 siblings, 0 replies; 60+ messages in thread
From: Constable, Scott D @ 2024-10-30 20:07 UTC (permalink / raw)
  To: Peter Zijlstra, x86@kernel.org
  Cc: linux-kernel@vger.kernel.org, Milburn, Alyssa,
	joao@overdrivepizza.com, andrew.cooper3@citrix.com,
	jpoimboe@kernel.org, jose.marchesi@oracle.com,
	hjl.tools@gmail.com, ndesaulniers@google.com,
	samitolvanen@google.com, nathan@kernel.org, ojeda@kernel.org,
	kees@kernel.org, alexei.starovoitov@gmail.com

> > Quick hack to extend the Clang-kCFI function meta-data (u32 hash) with a u8 bitmask of pointer arguments. This should really be under a new compiler flag, dependent on both x86_64 and kCFI.
> > 
> > Per the comment, the bitmask represents the register based arguments as the first 6 bits and bit 6 is used to cover all stack based arguments. The high bit is used for invalid values.
> > 
> > The purpose is to put a store dependency on the set registers, thereby blocking speculation paths that would otherwise expoit their value.
> 
> Given the ongoing discussion on [PATCH 13/14] where there is a growing consensus that all arguments (not just pointers) should be poisoned after a misprediction, a different encoding scheme would be needed. I believe there are 8 possibilities, which correspond to the function's arity:
> 
> 0: Function takes 0 args
> 1: Function takes 1 arg
> 2: Function takes 2 args
> 3: Function takes 3 args
> 4: Function takes 4 args
> 5: Function takes 5 args
> 6: Function takes 6 args
> 7: Function takes >6 args
>
> These possibilities can be encoded with 3 bits. I suspect that it might actually be beneficial to steal 3 bits from the u32 kCFI hash (either by using a smaller 29-bit hash or by truncating the 32-bit hash down to 29 bits). This scheme would arguably strengthen both kCFI and FineIBT by partitioning CFI edges such that a j-arity function cannot call a k-arity function unless j=k (or unless j>6 and k>6); the current 32-bit kCFI hash does not prevent, for example, a 2-arity fptr from calling a 3-arity target if the kCFI hashes collide. The disadvantage of the 29-bit hash is that it would increase the probability of collisions within each arity, but on the other hand the total number of functions of each given arity is much smaller than the total number of functions of all arities.

I have done some additional analysis on my Noble kernel, which suggests that the proposed 29-bit hash with 3-bit arity will only be more secure than the existing 32-bit hash. Consider: My kernel has 141,617 total indirect call targets, with 10,903 unique function types. With a 32-bit kCFI hash, the expected number of collisions is 2^-32 * (10903 C 2) = 0.01383765. Then I scanned the kernel to identify the number of unique function types for each arity, and computed the corresponding expected number of collisions within each arity, and assuming a 29-bit hash:

# Args	total targets	unique types	Expected collisions
0	12682	32	0.00000092
1	42981	2492	0.00578125
2	37657	3775	0.01326841
3	29436	2547	0.00603931
4	12343	1169	0.00127162
5	4137	519	0.00025038
6	1700	221	0.00004528
more	681	148	0.00002026

(Sorry if the formatting became weird after copying from Excel)

Hence, even the arity (2) with the largest number of unique function types (3775) has a lower expected value for 29-bit collisions (0.01326841) than the expected value for 32-bit collisions (0.01383765).

Regards,

Scott Constable

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 12/14] x86: Hacks for hacked up llvm
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (10 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 11/14] llvm: kCFI pointer stuff Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-27 19:49 ` [PATCH 13/14] x86: BHI stubs Peter Zijlstra
  2024-09-27 19:49 ` [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation Peter Zijlstra
  13 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

XXX do this nicely once the llvm hacks are done nice

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/Kconfig               |    8 ++++----
 arch/x86/include/asm/linkage.h |    2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2444,10 +2444,10 @@ config CC_HAS_ENTRY_PADDING
 
 config FUNCTION_PADDING_CFI
 	int
-	default 59 if FUNCTION_ALIGNMENT_64B
-	default 27 if FUNCTION_ALIGNMENT_32B
-	default 11 if FUNCTION_ALIGNMENT_16B
-	default  3 if FUNCTION_ALIGNMENT_8B
+	default 57 if FUNCTION_ALIGNMENT_64B
+	default 25 if FUNCTION_ALIGNMENT_32B
+	default  9 if FUNCTION_ALIGNMENT_16B
+	default  1 if FUNCTION_ALIGNMENT_8B
 	default  0
 
 # Basically: FUNCTION_ALIGNMENT - 5*CFI_CLANG
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -102,6 +102,8 @@
 	CFI_PRE_PADDING						\
 	.byte 0xb8 ASM_NL					\
 	.long __kcfi_typeid_##name ASM_NL			\
+	.byte 0xb0 ASM_NL					\
+	.byte 0x7f ASM_NL					\
 	CFI_POST_PADDING					\
 	SYM_FUNC_END(__cfi_##name)
 



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 13/14] x86: BHI stubs
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (11 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 12/14] x86: Hacks for hacked up llvm Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-28  1:37   ` Josh Poimboeuf
  2024-09-30 21:30   ` Josh Poimboeuf
  2024-09-27 19:49 ` [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation Peter Zijlstra
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Mostly generated combinatorical stubs used to poison function
argument pointers.

Notably, since this targets eIBRS parts which do not suffer from
retbleed, use normal return instructions to save some space.

In total: 6c1 + 6c2 + 6c3 + 6c4 + 1 = 6 + 15 + 20 + 15 + 1 = 57 stubs.

Note: Scott said only 0.6% of the kernel functions take 5 or more
pointer arguments, if any of those turns out to be performance
critical, we can add more stubs.

Note: the nested for loops are horrid, should be fixed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/cfi.h    |   11 
 arch/x86/kernel/alternative.c |   94 ++++++
 arch/x86/lib/Makefile         |    1 
 arch/x86/lib/bhi.S            |  602 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 708 insertions(+)

--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -101,6 +101,17 @@ enum cfi_mode {
 
 extern enum cfi_mode cfi_mode;
 
+typedef u8 bhi_thunk_8[8];
+typedef u8 bhi_thunk_16[16];
+typedef u8 bhi_thunk_32[32];
+
+extern bhi_thunk_8  __bhi_args_6c1[];
+extern bhi_thunk_16 __bhi_args_6c2[];
+extern bhi_thunk_16 __bhi_args_6c3[];
+extern bhi_thunk_32 __bhi_args_6c4[];
+
+extern u8 __bhi_args_all[];
+
 struct pt_regs;
 
 #ifdef CONFIG_CFI_CLANG
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1039,10 +1039,104 @@ u32 cfi_get_func_hash(void *func)
 
 	return hash;
 }
+
 #endif
 
 #ifdef CONFIG_FINEIBT
 
+static void *bhi_args_1(u8 args, void *addr)
+{
+	u8 bytes[5];
+
+	for (int i = 0; i < 6; i++) {
+		if (args != BIT(i))
+			continue;
+
+		bytes[0] = 0x2e;
+		memcpy(&bytes[1], &__bhi_args_6c1[i], 4);
+
+		text_poke_early(addr, bytes, 5);
+
+		return NULL;
+	}
+
+	return __bhi_args_all;
+}
+
+static void *bhi_args_2(u8 args, void *addr)
+{
+	int x = 0;
+
+	for (int i = 0; i < 6; i++) {
+		for (int j = i + 1; j < 6; j++) {
+			if (args == (BIT(i) | BIT(j)))
+				return &__bhi_args_6c2[x];
+			x++;
+		}
+	}
+
+	return __bhi_args_all;
+}
+
+static void *bhi_args_3(u8 args, void *addr)
+{
+	int x = 0;
+
+	for (int i = 0; i < 6; i++) {
+		for (int j = i + 1; j < 6; j++) {
+			for (int k = j + 1; k < 6; k++) {
+				if (args == (BIT(i) | BIT(j) | BIT(k)))
+					return &__bhi_args_6c3[x];
+				x++;
+			}
+		}
+	}
+
+	return __bhi_args_all;
+}
+
+static void *bhi_args_4(u8 args, void *addr)
+{
+	int x = 0;
+
+	for (int i = 0; i < 6; i++) {
+		for (int j = i + 1; j < 6; j++) {
+			for (int k = j + 1; k < 6; k++) {
+				for (int l = k + 1; l < 6; l++) {
+					if (args == (BIT(i) | BIT(j) | BIT(k) | BIT(l)))
+						return &__bhi_args_6c4[x];
+					x++;
+				}
+			}
+		}
+	}
+
+	return __bhi_args_all;
+}
+
+static void bhi_args(u8 args, void *addr)
+{
+	void *dest = __bhi_args_all;
+
+	if (WARN_ON_ONCE(!args))
+		return;
+
+	switch(hweight8(args)) {
+	case 1: if (bhi_args_1(args, addr) == dest)
+			break;
+		return;
+
+	case 2: dest = bhi_args_2(args, addr); break;
+	case 3: dest = bhi_args_3(args, addr); break;
+	case 4: dest = bhi_args_4(args, addr); break;
+
+	default:
+		break;
+	}
+
+	text_poke_early(addr, text_gen_insn(CALL_INSN_OPCODE, addr, dest), 5);
+}
+
 static bool cfi_rand __ro_after_init = true;
 static u32  cfi_seed __ro_after_init;
 
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -60,4 +60,5 @@ endif
         lib-y += memmove_64.o memset_64.o
         lib-y += copy_user_64.o copy_user_uncached_64.o
 	lib-y += cmpxchg16b_emu.o
+	lib-y += bhi.o
 endif
--- /dev/null
+++ b/arch/x86/lib/bhi.S
@@ -0,0 +1,602 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/linkage.h>
+#include <asm/unwind_hints.h>
+#include <asm/nospec-branch.h>
+
+/*
+ * At the function start, launder function arguments that are a pointer through
+ * CMOVcc, this will create a write dependency in the speculation flow.
+ *
+ * Notably, the CFI preambles calling these will have ZF set and r10 zero.
+ */
+
+.pushsection .noinstr.text, "ax"
+
+	.align 8
+SYM_CODE_START(__bhi_args_6c1)
+	ANNOTATE_NOENDBR
+	.align 8
+SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 8
+SYM_INNER_LABEL(__bhi_args_1, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 8
+SYM_INNER_LABEL(__bhi_args_2, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 8
+SYM_INNER_LABEL(__bhi_args_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 8
+SYM_INNER_LABEL(__bhi_args_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 8
+SYM_INNER_LABEL(__bhi_args_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+SYM_CODE_END(__bhi_args_6c1)
+
+
+	.align 16
+SYM_CODE_START(__bhi_args_6c2)
+	ANNOTATE_NOENDBR
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_1, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_2, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_2, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_2_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_2_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_2_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+SYM_CODE_END(__bhi_args_6c2)
+
+
+	.align 16
+SYM_CODE_START(__bhi_args_6c3)
+	ANNOTATE_NOENDBR
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_1_2, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_1_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_1_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_1_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_2_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_2_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_2_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_0_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_2_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_2_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_2_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_1_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_2_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_2_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_2_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 16
+SYM_INNER_LABEL(__bhi_args_3_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+SYM_CODE_END(__bhi_args_6c3)
+
+
+	.align 32
+SYM_CODE_START(__bhi_args_6c4)
+	ANNOTATE_NOENDBR
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_1_2_3, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_1_2_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_1_2_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_1_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_1_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_1_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_2_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_2_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_2_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_0_3_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_1_2_3_4, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_1_2_3_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_1_2_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_1_3_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rsi
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+	.align 32
+SYM_INNER_LABEL(__bhi_args_2_3_4_5, SYM_L_LOCAL)
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+
+SYM_CODE_END(__bhi_args_6c4)
+
+SYM_CODE_START(__bhi_args_all)
+	ANNOTATE_NOENDBR
+	UNWIND_HINT_FUNC
+	cmovne %r10, %rdi
+	cmovne %r10, %rsi
+	cmovne %r10, %rdx
+	cmovne %r10, %rcx
+	cmovne %r10, %r8
+	cmovne %r10, %r9
+	cmovne %r10, %rsp
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
+SYM_CODE_END(__bhi_args_all)
+
+.popsection



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-27 19:49 ` [PATCH 13/14] x86: BHI stubs Peter Zijlstra
@ 2024-09-28  1:37   ` Josh Poimboeuf
  2024-09-28 13:23     ` Peter Zijlstra
  2024-09-30 21:30   ` Josh Poimboeuf
  1 sibling, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-28  1:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> +static void *bhi_args_1(u8 args, void *addr)
> +{
> +	u8 bytes[5];
> +
> +	for (int i = 0; i < 6; i++) {
> +		if (args != BIT(i))
> +			continue;
> +
> +		bytes[0] = 0x2e;
> +		memcpy(&bytes[1], &__bhi_args_6c1[i], 4);
> +
> +		text_poke_early(addr, bytes, 5);
> +
> +		return NULL;

I assume there's some good reason this doesn't return a pointer to the
code like the others?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-28  1:37   ` Josh Poimboeuf
@ 2024-09-28 13:23     ` Peter Zijlstra
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-28 13:23 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 06:37:36PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> > +static void *bhi_args_1(u8 args, void *addr)
> > +{
> > +	u8 bytes[5];
> > +
> > +	for (int i = 0; i < 6; i++) {
> > +		if (args != BIT(i))
> > +			continue;
> > +
> > +		bytes[0] = 0x2e;
> > +		memcpy(&bytes[1], &__bhi_args_6c1[i], 4);
> > +
> > +		text_poke_early(addr, bytes, 5);
> > +
> > +		return NULL;
> 
> I assume there's some good reason this doesn't return a pointer to the
> code like the others?

The 1 bit case is different in that it does in-place CMOVcc while the
others do a CALL to an external stub.

Not saying this is the best way to do it, but it's what I ended up with
back then.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-27 19:49 ` [PATCH 13/14] x86: BHI stubs Peter Zijlstra
  2024-09-28  1:37   ` Josh Poimboeuf
@ 2024-09-30 21:30   ` Josh Poimboeuf
  2024-09-30 21:46     ` Josh Poimboeuf
  2024-09-30 22:23     ` Andrew Cooper
  1 sibling, 2 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-30 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> +/*
> + * At the function start, launder function arguments that are a pointer through
> + * CMOVcc, this will create a write dependency in the speculation flow.
> + *
> + * Notably, the CFI preambles calling these will have ZF set and r10 zero.
> + */
> +
> +.pushsection .noinstr.text, "ax"
> +
> +	.align 8
> +SYM_CODE_START(__bhi_args_6c1)
> +	ANNOTATE_NOENDBR
> +	.align 8
> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
> +	UNWIND_HINT_FUNC
> +	cmovne %r10, %rdi

IIUC, this works because if the "jz" in the CFI preamble mispredicts to
the __bhi_args_* code, "cmovne" will zero out the speculative value of
rdi.

Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-30 21:30   ` Josh Poimboeuf
@ 2024-09-30 21:46     ` Josh Poimboeuf
  2024-09-30 22:23     ` Andrew Cooper
  1 sibling, 0 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-30 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Mon, Sep 30, 2024 at 02:30:32PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> > +/*
> > + * At the function start, launder function arguments that are a pointer through
> > + * CMOVcc, this will create a write dependency in the speculation flow.
> > + *
> > + * Notably, the CFI preambles calling these will have ZF set and r10 zero.
> > + */
> > +
> > +.pushsection .noinstr.text, "ax"
> > +
> > +	.align 8
> > +SYM_CODE_START(__bhi_args_6c1)
> > +	ANNOTATE_NOENDBR
> > +	.align 8
> > +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
> > +	UNWIND_HINT_FUNC
> > +	cmovne %r10, %rdi
> 
> IIUC, this works because if the "jz" in the CFI preamble mispredicts to
> the __bhi_args_* code, "cmovne" will zero out the speculative value of
> rdi.
> 
> Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?

BTW, this "speculative pointer clearing" feature is broader than just
BHI so I wouldn't call it that.  It's really just a logical extension of
FineIBT IMO.

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-30 21:30   ` Josh Poimboeuf
  2024-09-30 21:46     ` Josh Poimboeuf
@ 2024-09-30 22:23     ` Andrew Cooper
  2024-09-30 22:38       ` Josh Poimboeuf
  1 sibling, 1 reply; 60+ messages in thread
From: Andrew Cooper @ 2024-09-30 22:23 UTC (permalink / raw)
  To: Josh Poimboeuf, Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	jose.marchesi, hjl.tools, ndesaulniers, samitolvanen, nathan,
	ojeda, kees, alexei.starovoitov

On 30/09/2024 10:30 pm, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
>> +/*
>> + * At the function start, launder function arguments that are a pointer through
>> + * CMOVcc, this will create a write dependency in the speculation flow.
>> + *
>> + * Notably, the CFI preambles calling these will have ZF set and r10 zero.
>> + */
>> +
>> +.pushsection .noinstr.text, "ax"
>> +
>> +	.align 8
>> +SYM_CODE_START(__bhi_args_6c1)
>> +	ANNOTATE_NOENDBR
>> +	.align 8
>> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
>> +	UNWIND_HINT_FUNC
>> +	cmovne %r10, %rdi
> IIUC, this works because if the "jz" in the CFI preamble mispredicts to
> the __bhi_args_* code, "cmovne" will zero out the speculative value of
> rdi.
>
> Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?

There's no encoding for CMOVcc which takes an $imm.

%r10 is guaranteed zero after the FineIBT prologue, but I don't see
anything in patch 11 which makes this true in the !FineIBT case.

~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-30 22:23     ` Andrew Cooper
@ 2024-09-30 22:38       ` Josh Poimboeuf
  2024-09-30 22:52         ` Andrew Cooper
  2024-10-01 11:03         ` Peter Zijlstra
  0 siblings, 2 replies; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-30 22:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Peter Zijlstra, x86, linux-kernel, alyssa.milburn,
	scott.d.constable, joao, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Mon, Sep 30, 2024 at 11:23:38PM +0100, Andrew Cooper wrote:
> On 30/09/2024 10:30 pm, Josh Poimboeuf wrote:
> > On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> >> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
> >> +	UNWIND_HINT_FUNC
> >> +	cmovne %r10, %rdi
> > IIUC, this works because if the "jz" in the CFI preamble mispredicts to
> > the __bhi_args_* code, "cmovne" will zero out the speculative value of
> > rdi.
> >
> > Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?
> 
> There's no encoding for CMOVcc which takes an $imm.

Ah.

> %r10 is guaranteed zero after the FineIBT prologue

If the "jz" in the FineIBT prologue mispredicts, isn't %r10 non-zero by
definition?

> , but I don't see
> anything in patch 11 which makes this true in the !FineIBT case.

I thought this code is only used by FineIBT?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-30 22:38       ` Josh Poimboeuf
@ 2024-09-30 22:52         ` Andrew Cooper
  2024-10-01 11:03         ` Peter Zijlstra
  1 sibling, 0 replies; 60+ messages in thread
From: Andrew Cooper @ 2024-09-30 22:52 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, x86, linux-kernel, alyssa.milburn,
	scott.d.constable, joao, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On 30/09/2024 11:38 pm, Josh Poimboeuf wrote:
> On Mon, Sep 30, 2024 at 11:23:38PM +0100, Andrew Cooper wrote:
>> On 30/09/2024 10:30 pm, Josh Poimboeuf wrote:
>>> On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
>>>> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
>>>> +	UNWIND_HINT_FUNC
>>>> +	cmovne %r10, %rdi
>>> IIUC, this works because if the "jz" in the CFI preamble mispredicts to
>>> the __bhi_args_* code, "cmovne" will zero out the speculative value of
>>> rdi.
>>>
>>> Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?
>> There's no encoding for CMOVcc which takes an $imm.
> Ah.
>
>> %r10 is guaranteed zero after the FineIBT prologue
> If the "jz" in the FineIBT prologue mispredicts, isn't %r10 non-zero by
> definition?

FineIBT uses SUB (and not CMP) to destroy the hash in %r10.

This makes it marginally harder to leak an unknown hash; you can't
trivially deference it, but there is a linear function if you know the
hash of the caller side.

In the bad speculation path, you're still overwriting pointers with a
number that is < 2^32, so will be stalled by SMAP if dereferenced.

>
>> , but I don't see
>> anything in patch 11 which makes this true in the !FineIBT case.
> I thought this code is only used by FineIBT?

Erm pass.  Peter?

~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-09-30 22:38       ` Josh Poimboeuf
  2024-09-30 22:52         ` Andrew Cooper
@ 2024-10-01 11:03         ` Peter Zijlstra
  2024-10-01 11:20           ` Andrew Cooper
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-10-01 11:03 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrew Cooper, x86, linux-kernel, alyssa.milburn,
	scott.d.constable, joao, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Mon, Sep 30, 2024 at 03:38:48PM -0700, Josh Poimboeuf wrote:
> On Mon, Sep 30, 2024 at 11:23:38PM +0100, Andrew Cooper wrote:
> > On 30/09/2024 10:30 pm, Josh Poimboeuf wrote:
> > > On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> > >> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
> > >> +	UNWIND_HINT_FUNC
> > >> +	cmovne %r10, %rdi
> > > IIUC, this works because if the "jz" in the CFI preamble mispredicts to
> > > the __bhi_args_* code, "cmovne" will zero out the speculative value of
> > > rdi.
> > >
> > > Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?
> > 
> > There's no encoding for CMOVcc which takes an $imm.
> 
> Ah.
> 
> > %r10 is guaranteed zero after the FineIBT prologue
> 
> If the "jz" in the FineIBT prologue mispredicts, isn't %r10 non-zero by
> definition?

Since I just wrote the comment...

 * FineIBT-BHI:
 *
 * __cfi_foo:
 *   endbr64
 *   subl 0x12345678, %r10d
 *   jz   foo-1
 *   ud2
 * foo-1:
 *   call __bhi_args_XXX
 * foo+4:
 *   ... code here ...
 *   ret
 *
 * direct caller:
 *   call foo+4
 *
 * indirect caller:
 *   lea foo(%rip), %r11
 *   ...
 *   movl $0x12345678, %r10d
 *   subl $16, %r11
 *   nop4
 *   call *%r11

And lets take a random bhi function:

+       .align 16
+SYM_INNER_LABEL(__bhi_args_0_1, SYM_L_LOCAL)
+       UNWIND_HINT_FUNC
+       cmovne %r10, %rdi
+       cmovne %r10, %rsi
+       ANNOTATE_UNRET_SAFE
+       ret
+       int3

So the case you worry about is SUBL does *not* result in 0, but we
speculate JZ true and end up in CALL, and do CMOVne.

Since we speculated Z, we must then also not do the CMOV, so the value
of R10 is irrelevant, it will not be used. The thing however is that
CMOV will unconditionally put a store dependency on the target register
(RDI, RSI in the above sequence) and as such any further speculative
code trying to use those registers will stall.

> > , but I don't see
> > anything in patch 11 which makes this true in the !FineIBT case.
> 
> I thought this code is only used by FineIBT?

Right, so I do have me a patch that adds it to regular KCFI as well, but
I dropped it for now, since I don't have a strong rationale for it and
it requires yet more compiler tinkering.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-10-01 11:03         ` Peter Zijlstra
@ 2024-10-01 11:20           ` Andrew Cooper
  2024-10-03 12:17             ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Cooper @ 2024-10-01 11:20 UTC (permalink / raw)
  To: Peter Zijlstra, Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	jose.marchesi, hjl.tools, ndesaulniers, samitolvanen, nathan,
	ojeda, kees, alexei.starovoitov

On 01/10/2024 12:03 pm, Peter Zijlstra wrote:
> On Mon, Sep 30, 2024 at 03:38:48PM -0700, Josh Poimboeuf wrote:
>> On Mon, Sep 30, 2024 at 11:23:38PM +0100, Andrew Cooper wrote:
>>> On 30/09/2024 10:30 pm, Josh Poimboeuf wrote:
>>>> On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
>>>>> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
>>>>> +	UNWIND_HINT_FUNC
>>>>> +	cmovne %r10, %rdi
>>>> IIUC, this works because if the "jz" in the CFI preamble mispredicts to
>>>> the __bhi_args_* code, "cmovne" will zero out the speculative value of
>>>> rdi.
>>>>
>>>> Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?
>>> There's no encoding for CMOVcc which takes an $imm.
>> Ah.
>>
>>> %r10 is guaranteed zero after the FineIBT prologue
>> If the "jz" in the FineIBT prologue mispredicts, isn't %r10 non-zero by
>> definition?
> Since I just wrote the comment...
>
>  * FineIBT-BHI:
>  *
>  * __cfi_foo:
>  *   endbr64
>  *   subl 0x12345678, %r10d
>  *   jz   foo-1
>  *   ud2
>  * foo-1:
>  *   call __bhi_args_XXX
>  * foo+4:
>  *   ... code here ...
>  *   ret
>  *
>  * direct caller:
>  *   call foo+4
>  *
>  * indirect caller:
>  *   lea foo(%rip), %r11
>  *   ...
>  *   movl $0x12345678, %r10d
>  *   subl $16, %r11

A compiler would normally spell this:

    lea foo-16(%rip), %r11

>  *   nop4
>  *   call *%r11
>
> And lets take a random bhi function:
>
> +       .align 16
> +SYM_INNER_LABEL(__bhi_args_0_1, SYM_L_LOCAL)
> +       UNWIND_HINT_FUNC
> +       cmovne %r10, %rdi
> +       cmovne %r10, %rsi
> +       ANNOTATE_UNRET_SAFE
> +       ret
> +       int3
>
> So the case you worry about is SUBL does *not* result in 0, but we
> speculate JZ true and end up in CALL, and do CMOVne.
>
> Since we speculated Z, we must then also not do the CMOV, so the value
> of R10 is irrelevant, it will not be used. The thing however is that
> CMOV will unconditionally put a store dependency on the target register
> (RDI, RSI in the above sequence) and as such any further speculative
> code trying to use those registers will stall.

How does that help?

The write dependency doesn't stop a dependent load from executing in the
shadow of a mispredicted branch.

This is why SLH forces the pointer to 0 on the bad speculation path.

~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-10-01 11:20           ` Andrew Cooper
@ 2024-10-03 12:17             ` Peter Zijlstra
  2024-10-03 13:59               ` Andrew Cooper
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-10-03 12:17 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Josh Poimboeuf, x86, linux-kernel, alyssa.milburn,
	scott.d.constable, joao, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Tue, Oct 01, 2024 at 12:20:02PM +0100, Andrew Cooper wrote:
> On 01/10/2024 12:03 pm, Peter Zijlstra wrote:
> > On Mon, Sep 30, 2024 at 03:38:48PM -0700, Josh Poimboeuf wrote:
> >> On Mon, Sep 30, 2024 at 11:23:38PM +0100, Andrew Cooper wrote:
> >>> On 30/09/2024 10:30 pm, Josh Poimboeuf wrote:
> >>>> On Fri, Sep 27, 2024 at 09:49:09PM +0200, Peter Zijlstra wrote:
> >>>>> +SYM_INNER_LABEL(__bhi_args_0, SYM_L_LOCAL)
> >>>>> +	UNWIND_HINT_FUNC
> >>>>> +	cmovne %r10, %rdi
> >>>> IIUC, this works because if the "jz" in the CFI preamble mispredicts to
> >>>> the __bhi_args_* code, "cmovne" will zero out the speculative value of
> >>>> rdi.
> >>>>
> >>>> Why use %r10 instead of a literal $0?  Also how do you know %r10 is 0?
> >>> There's no encoding for CMOVcc which takes an $imm.
> >> Ah.
> >>
> >>> %r10 is guaranteed zero after the FineIBT prologue
> >> If the "jz" in the FineIBT prologue mispredicts, isn't %r10 non-zero by
> >> definition?
> > Since I just wrote the comment...
> >
> >  * FineIBT-BHI:
> >  *
> >  * __cfi_foo:
> >  *   endbr64
> >  *   subl 0x12345678, %r10d
> >  *   jz   foo-1
> >  *   ud2
> >  * foo-1:
> >  *   call __bhi_args_XXX
> >  * foo+4:
> >  *   ... code here ...
> >  *   ret
> >  *
> >  * direct caller:
> >  *   call foo+4
> >  *
> >  * indirect caller:
> >  *   lea foo(%rip), %r11
> >  *   ...
> >  *   movl $0x12345678, %r10d
> >  *   subl $16, %r11
> 
> A compiler would normally spell this:
> 
>     lea foo-16(%rip), %r11

Yeah, but the original lea is outside of the code we control, the kcfi
caller section we're rewriting gets to have %r11 as is.

> >  *   nop4
> >  *   call *%r11
> >
> > And lets take a random bhi function:
> >
> > +       .align 16
> > +SYM_INNER_LABEL(__bhi_args_0_1, SYM_L_LOCAL)
> > +       UNWIND_HINT_FUNC
> > +       cmovne %r10, %rdi
> > +       cmovne %r10, %rsi
> > +       ANNOTATE_UNRET_SAFE
> > +       ret
> > +       int3
> >
> > So the case you worry about is SUBL does *not* result in 0, but we
> > speculate JZ true and end up in CALL, and do CMOVne.
> >
> > Since we speculated Z, we must then also not do the CMOV, so the value
> > of R10 is irrelevant, it will not be used. The thing however is that
> > CMOV will unconditionally put a store dependency on the target register
> > (RDI, RSI in the above sequence) and as such any further speculative
> > code trying to use those registers will stall.
> 
> How does that help?
> 
> The write dependency doesn't stop a dependent load from executing in the
> shadow of a mispredicted branch.

I've been given to understand CMOVcc will kill any further speculation
using the target register. So by 'poisoning' all argument registers that
are involved with loads, we avoid any such load from happening during
speculation.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-10-03 12:17             ` Peter Zijlstra
@ 2024-10-03 13:59               ` Andrew Cooper
  2024-10-14 17:50                 ` Constable, Scott D
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Cooper @ 2024-10-03 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, x86, linux-kernel, alyssa.milburn,
	scott.d.constable, joao, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On 03/10/2024 1:17 pm, Peter Zijlstra wrote:
> On Tue, Oct 01, 2024 at 12:20:02PM +0100, Andrew Cooper wrote:
>> On 01/10/2024 12:03 pm, Peter Zijlstra wrote:
>>>  *   nop4
>>>  *   call *%r11
>>>
>>> And lets take a random bhi function:
>>>
>>> +       .align 16
>>> +SYM_INNER_LABEL(__bhi_args_0_1, SYM_L_LOCAL)
>>> +       UNWIND_HINT_FUNC
>>> +       cmovne %r10, %rdi
>>> +       cmovne %r10, %rsi
>>> +       ANNOTATE_UNRET_SAFE
>>> +       ret
>>> +       int3
>>>
>>> So the case you worry about is SUBL does *not* result in 0, but we
>>> speculate JZ true and end up in CALL, and do CMOVne.
>>>
>>> Since we speculated Z, we must then also not do the CMOV, so the value
>>> of R10 is irrelevant, it will not be used. The thing however is that
>>> CMOV will unconditionally put a store dependency on the target register
>>> (RDI, RSI in the above sequence) and as such any further speculative
>>> code trying to use those registers will stall.
>> How does that help?
>>
>> The write dependency doesn't stop a dependent load from executing in the
>> shadow of a mispredicted branch.
> I've been given to understand CMOVcc will kill any further speculation
> using the target register. So by 'poisoning' all argument registers that
> are involved with loads, we avoid any such load from happening during
> speculation.

IANAPA (I am not a pipeline architect), but AIUI,

CMOVcc establishes a data dependency between flags and the destination
register that doesn't exist in the pipeline if you'd used a conditional
branch instead.

It does prevent a dependent load from executing before the CMOVcc has
executed.  But it does not stop that load from executing speculatively
eventually.

So, given the following case:

* SUB is/will results nonzero (ZF=0, %r10=nonzero)
* JZ predicted taken, despite (ZF=0)

we call __bhi_args_XXX wherein:

* CMOVNZ blocks until SUB executes (flags dependency)
* CMOVNZ eventually executes, and because ZF=0, it really does write
%r10 over the target registers

and then we enter the function with all pointers containing the nonzero
residual from the hash check.

Now, because it's a SUBL, the result is < 2^32, a straight deference of
one of these pointers will be blocked by SMAP (noone cares about 32bit,
or pre-SMAP hardware, right?)

Forward references from the pointers will be safe (assuming SIB doesn't
reach the canonical boundary), but backward references may wrap around
back into the kernel space.  These will not be blocked by SMAP and will
spill their secrets if suitably provoked.

~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 13/14] x86: BHI stubs
  2024-10-03 13:59               ` Andrew Cooper
@ 2024-10-14 17:50                 ` Constable, Scott D
  2024-10-14 21:54                   ` Andrew Cooper
  0 siblings, 1 reply; 60+ messages in thread
From: Constable, Scott D @ 2024-10-14 17:50 UTC (permalink / raw)
  To: andrew.cooper3@citrix.com, Peter Zijlstra
  Cc: Josh Poimboeuf, x86@kernel.org, linux-kernel@vger.kernel.org,
	Milburn, Alyssa, joao@overdrivepizza.com,
	jose.marchesi@oracle.com, hjl.tools@gmail.com,
	ndesaulniers@google.com, samitolvanen@google.com,
	nathan@kernel.org, ojeda@kernel.org, kees@kernel.org,
	alexei.starovoitov@gmail.com

Hello Andrew,

Your observation is valid. If we assume that the hashing function used by FineIBT is uniformly distributed, then the distribution of hashes at the call site and at the call target is [0,2^32-1]. The difference of the two hashes computed in R10 will have the same distribution because of wrap-around, and the mean of this distribution is 2^31-1. Therefore, to reasonably bypass the proposed mitigation, I believe an attacker would need the hardened pointer to be added/subtracted to/from an attacker-controlled 64-bit value, or an attacker-controlled 32-bit value scaled by 2, 4, or 8. Therefore, I think it would be reasonable to additionally apply the CMOV hardening to any 32-/64-bit integral parameters, including enums. I scanned the kernel (Ubuntu noble 6.8 config) and found that 77% of parameters to indirect call targets are pointers (which we already harden) and less than 20% are 32-/64-bit integrals and enums.

I think that this proposal would also address some other potential corner cases, such as:
- an attacker-controlled 32-/64-bit attacker-controlled integral parameter is used to index into a fixed-address array
- an attacker-controlled 64-bit attacker-controlled integral parameter is cast into a pointer

Does this proposal address your concern?

Thanks and regards,

Scott Constable

>On 03/10/2024 1:17 pm, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2024 at 12:20:02PM +0100, Andrew Cooper wrote:
>>> On 01/10/2024 12:03 pm, Peter Zijlstra wrote:
>>>>  *   nop4
>>>>  *   call *%r11
>>>>
>>>> And lets take a random bhi function:
>>>>
>>>> +       .align 16
>>>> +SYM_INNER_LABEL(__bhi_args_0_1, SYM_L_LOCAL)
>>>> +       UNWIND_HINT_FUNC
>>>> +       cmovne %r10, %rdi
>>>> +       cmovne %r10, %rsi
>>>> +       ANNOTATE_UNRET_SAFE
>>>> +       ret
>>>> +       int3
>>>>
>>>> So the case you worry about is SUBL does *not* result in 0, but we 
>>>> speculate JZ true and end up in CALL, and do CMOVne.
>>>>
>>>> Since we speculated Z, we must then also not do the CMOV, so the 
>>>> value of R10 is irrelevant, it will not be used. The thing however 
>>>> is that CMOV will unconditionally put a store dependency on the 
>>>> target register (RDI, RSI in the above sequence) and as such any 
>>>> further speculative code trying to use those registers will stall.
>>> How does that help?
>>>
>>> The write dependency doesn't stop a dependent load from executing in 
>>> the shadow of a mispredicted branch.
>> I've been given to understand CMOVcc will kill any further speculation 
>> using the target register. So by 'poisoning' all argument registers 
>> that are involved with loads, we avoid any such load from happening 
>> during speculation.

> IANAPA (I am not a pipeline architect), but AIUI,

> CMOVcc establishes a data dependency between flags and the destination register that doesn't exist in the pipeline if you'd used a conditional branch instead.

> It does prevent a dependent load from executing before the CMOVcc has executed.  But it does not stop that load from executing speculatively eventually.

> So, given the following case:

> * SUB is/will results nonzero (ZF=0, %r10=nonzero)
> * JZ predicted taken, despite (ZF=0)

> we call __bhi_args_XXX wherein:

> * CMOVNZ blocks until SUB executes (flags dependency)
> * CMOVNZ eventually executes, and because ZF=0, it really does write
> %r10 over the target registers

> and then we enter the function with all pointers containing the nonzero residual from the hash check.

> Now, because it's a SUBL, the result is < 2^32, a straight deference of one of these pointers will be blocked by SMAP (noone cares about 32bit, or pre-SMAP hardware, right?)

> Forward references from the pointers will be safe (assuming SIB doesn't reach the canonical boundary), but backward references may wrap around back into the kernel space.  These will not be blocked by SMAP and will spill their secrets if suitably provoked.

> ~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-10-14 17:50                 ` Constable, Scott D
@ 2024-10-14 21:54                   ` Andrew Cooper
  2024-10-21 15:06                     ` Constable, Scott D
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Cooper @ 2024-10-14 21:54 UTC (permalink / raw)
  To: Constable, Scott D, Peter Zijlstra
  Cc: Josh Poimboeuf, x86@kernel.org, linux-kernel@vger.kernel.org,
	Milburn, Alyssa, joao@overdrivepizza.com,
	jose.marchesi@oracle.com, hjl.tools@gmail.com,
	ndesaulniers@google.com, samitolvanen@google.com,
	nathan@kernel.org, ojeda@kernel.org, kees@kernel.org,
	alexei.starovoitov@gmail.com

On 14/10/2024 6:50 pm, Constable, Scott D wrote:
> Hello Andrew,
>
> Your observation is valid. If we assume that the hashing function used by FineIBT is uniformly distributed, then the distribution of hashes at the call site and at the call target is [0,2^32-1]. The difference of the two hashes computed in R10 will have the same distribution because of wrap-around, and the mean of this distribution is 2^31-1. Therefore, to reasonably bypass the proposed mitigation, I believe an attacker would need the hardened pointer to be added/subtracted to/from an attacker-controlled 64-bit value, or an attacker-controlled 32-bit value scaled by 2, 4, or 8. Therefore, I think it would be reasonable to additionally apply the CMOV hardening to any 32-/64-bit integral parameters, including enums. I scanned the kernel (Ubuntu noble 6.8 config) and found that 77% of parameters to indirect call targets are pointers (which we already harden) and less than 20% are 32-/64-bit integrals and enums.
>
> I think that this proposal would also address some other potential corner cases, such as:
> - an attacker-controlled 32-/64-bit attacker-controlled integral parameter is used to index into a fixed-address array
> - an attacker-controlled 64-bit attacker-controlled integral parameter is cast into a pointer
>
> Does this proposal address your concern?

Hello,

Thankyou for the analysis, and I'm glad I'm not just clutching at straws.

However, I'm not sure if extending this to other cases works very well. 
While the second point is probably easy for the compiler to figure out,
the former is looking a little bit more like a halting problem.

One key aspect is "how far can speculation continue beyond a
mispredicted Jcc", but it occurs to me since the last email that there
is no answer that Intel will give here.  It is uarch dependent and
expected to increase on future parts, so safety wise we must assume
infinite.

And infinite is no good, so we must reason about "good enough".

My gut feeling is that blindly using the residual from the hash check
isn't good enough.  7 years of  speculation fixes have shown that the
researchers are constantly proving "this will be good enough" wrong.

So, instead of simply using the residual, why don't we explicitly set
%r10 to a known value?

Because we need to preserve flags from original hash check, we can't use
any of the simple zeroing idioms, but we could use MOV $0, %r10 before
the CMOVs targetting the pointer parameters.

But, if we're using a long-ish encoding anyway, why not MOV $GB(2)-1, %r10 ?

This way, in the bad speculation path we'll set all pointers to 2G,
which removes most of the risk with backwards references, and makes the
behaviour invariant of the hash residual (which itself reduces the
opportunities to leak the hash value).

So I suppose the real question is whether one extra MOV is acceptable,
and is it good enough?  My gut feeling is yes to both.

To the extra cases, they can of course be added if the compiler support
isn't too horrible, independently of the extra MOV.  But, if 77% of
parameters to indirect functions are pointers anyway, isn't it work
considering CMOV-ing all parameter registers and turning the 57 stubs
into just 6, and improve I$ locality?

~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 13/14] x86: BHI stubs
  2024-10-14 21:54                   ` Andrew Cooper
@ 2024-10-21 15:06                     ` Constable, Scott D
  2024-10-29  5:59                       ` Joao Moreira
  0 siblings, 1 reply; 60+ messages in thread
From: Constable, Scott D @ 2024-10-21 15:06 UTC (permalink / raw)
  To: andrew.cooper3@citrix.com, Peter Zijlstra
  Cc: Josh Poimboeuf, x86@kernel.org, linux-kernel@vger.kernel.org,
	Milburn, Alyssa, joao@overdrivepizza.com,
	jose.marchesi@oracle.com, hjl.tools@gmail.com,
	ndesaulniers@google.com, samitolvanen@google.com,
	nathan@kernel.org, ojeda@kernel.org, kees@kernel.org,
	alexei.starovoitov@gmail.com

Hello Andrew,

My understanding of the FineIBT approach is that the hash values are not intended to be secret, and therefore leaking these hash values would not expose a new risk. Joao is also on this thread and knows a lot more about FineIBT than I do, so he can chime in if my reasoning is unsound.

Our intent in using the hash check residual as the poison is to overwrite a potentially attacker-controllable value with a fixed value that the attacker cannot control. But the points you raise have made us revisit this assumption. The residual from the hash check does give an attacker a small degree of control, because the attacker may be able to trigger a speculative mis-prediction from several different indirect call sites to a given indirect call target. If those call sites have different hash values, then the residual will differ accordingly. But this does not amount to arbitrary control over the residual. For example, if the attacker wants to poison the target's registers with a specific value (e.g., 0xdeadbeef) that will cause an out-of-bounds read to a desired location in kernel memory (e.g., fixed_addr+0xdeadbeef), then the attacker will need to identify an indirect call site with a 32-bit hash H1 such that the target's hash H2 satisfies H1-H2=0xdeadbeef. This scenario does not seem to be more hazardous than other existing Spectre v1 gadgets that likely exist throughout the kernel, which may leak some data at some specific addresses but without giving an attacker arbitrary control to leak gobs of kernel memory.

We do agree with your suggestion that all of the arguments should be poisoned, not just the pointers. For example, there might be some indirect call targets that look like:

void foo(unsigned short i) {
    unsigned long x = fixed_address_u64_array[i];
    // code that leaks x
}

In a previous email I had suggested that it might not be necessary to poison some arguments, including 16-bit values. Then code such as this (if it exists) could expose up to 512K of kernel memory above fixed_address_u64_array. So your concern about poisoning a proper subset of the arguments is valid, and we appreciate that your suggestion would also simplify the implementation by requiring fewer stubs.

Regards,

Scott Constable

> Hello,
>
> Thankyou for the analysis, and I'm glad I'm not just clutching at straws.
>
> However, I'm not sure if extending this to other cases works very well. While the second point is probably easy for the compiler to figure out, the former is looking a little bit more like a halting problem.
>
> One key aspect is "how far can speculation continue beyond a mispredicted Jcc", but it occurs to me since the last email that there is no answer that Intel will give here.  It is uarch dependent and expected to increase on future parts, so safety wise we must assume infinite.
>
> And infinite is no good, so we must reason about "good enough".
>
> My gut feeling is that blindly using the residual from the hash check isn't good enough.  7 years of  speculation fixes have shown that the researchers are constantly proving "this will be good enough" wrong.
>
> So, instead of simply using the residual, why don't we explicitly set
> %r10 to a known value?
>
> Because we need to preserve flags from original hash check, we can't use any of the simple zeroing idioms, but we could use MOV $0, %r10 before the CMOVs targetting the pointer parameters.
>
> But, if we're using a long-ish encoding anyway, why not MOV $GB(2)-1, %r10 ?
>
> This way, in the bad speculation path we'll set all pointers to 2G, which removes most of the risk with backwards references, and makes the behaviour invariant of the hash residual (which itself reduces the opportunities to leak the hash value).
>
> So I suppose the real question is whether one extra MOV is acceptable, and is it good enough?  My gut feeling is yes to both.
>
>
> To the extra cases, they can of course be added if the compiler support isn't too horrible, independently of the extra MOV.  But, if 77% of parameters to indirect functions are pointers anyway, isn't it work considering CMOV-ing all parameter registers and turning the 57 stubs into just 6, and improve I$ locality?
>
> ~Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] x86: BHI stubs
  2024-10-21 15:06                     ` Constable, Scott D
@ 2024-10-29  5:59                       ` Joao Moreira
  0 siblings, 0 replies; 60+ messages in thread
From: Joao Moreira @ 2024-10-29  5:59 UTC (permalink / raw)
  To: Constable, Scott D
  Cc: andrew.cooper3@citrix.com, Peter Zijlstra, Josh Poimboeuf,
	x86@kernel.org, linux-kernel@vger.kernel.org, Milburn, Alyssa,
	jose.marchesi@oracle.com, hjl.tools@gmail.com,
	ndesaulniers@google.com, samitolvanen@google.com,
	nathan@kernel.org, ojeda@kernel.org, kees@kernel.org,
	alexei.starovoitov@gmail.com

On Mon, Oct 21, 2024 at 8:06 AM Constable, Scott D
<scott.d.constable@intel.com> wrote:
>
> Hello Andrew,
>
> My understanding of the FineIBT approach is that the hash values are not intended to be secret, and therefore leaking these hash values would not expose a new risk. Joao is also on this thread and knows a lot more about FineIBT than I do, so he can chime in if my reasoning is unsound.

Technically the hashes are not a secret indeed. With that said, I
think it was Kees who submitted a patch that randomizes the hash
values at boot time, as a security in depth / probabilistic measure
against an attacker being able to generate executable valid targets.

Tks,
Joao

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation
  2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
                   ` (12 preceding siblings ...)
  2024-09-27 19:49 ` [PATCH 13/14] x86: BHI stubs Peter Zijlstra
@ 2024-09-27 19:49 ` Peter Zijlstra
  2024-09-28  1:50   ` Josh Poimboeuf
  13 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-27 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, peterz, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jpoimboe, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

Due to FineIBT weakness, add an additional mitigation for BHI.

Use the 5 bytes of the nop at -1 and the 4 byte poison to squirrel in
a BHI mitigation.

Relies on clang-cfi to emit an additional piece of magic in the kCFI
pre-amble, identifying which function arguments are pointers.

An additional u8 (next to the existing u32) is emitted like:

	movl	0x12345678, %eax	// CFI hash
	movb	0x12, %al		// CFI args

This u8 is a bitmask, where BIT(n) indicates the n'th argument is a
pointer, notably the 6 possible argument registers are:

	rdi, rsi, rdx, rcx, r8 and r9

Single bit can be inlined, while 2-4 bits have combinatoric stubs with
the required magic in. Anything more will fall back to
__bhi_args_all which additionally poisons %rsp for good measure, in
case things overflowed to the stack.


  FineIBT+				FineIBT+BHI

  __cfi_foo:				__cfi_foo:
    endbr64				  endbr64
    subl	$0x12345678, %r10d        subl	$0x12345678, %r10d
    jz	foo+4                             jz	+2
    ud2                                   ud2
    nop                                   call	__bhi_args_foo
  foo:                                  foo+4:
    ud1 0x0(%eax), %eax
    ...                                   ...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/cfi.h    |    1 
 arch/x86/kernel/alternative.c |   82 ++++++++++++++++++++++++++++++++++++++----
 arch/x86/net/bpf_jit_comp.c   |   16 ++++++--
 tools/objtool/check.c         |   16 ++++----
 4 files changed, 98 insertions(+), 17 deletions(-)

--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -97,6 +97,7 @@ enum cfi_mode {
 	CFI_OFF,	/* Taditional / IBT depending on .config */
 	CFI_KCFI,	/* Optionally CALL_PADDING, IBT, RETPOLINE */
 	CFI_FINEIBT,	/* see arch/x86/kernel/alternative.c */
+	CFI_FINEIBT_BHI,
 };
 
 extern enum cfi_mode cfi_mode;
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -932,7 +932,31 @@ __noendbr bool is_endbr(u32 *val)
 	if (get_kernel_nofault(endbr, val))
 		return false;
 
-	return __is_endbr(endbr);
+	if (__is_endbr(endbr))
+		return true;
+
+#if defined(CONFIG_FINEIBT) && defined(CONFIG_X86_KERNEL_IBT_PLUS)
+	if (cfi_mode != CFI_FINEIBT_BHI)
+		return false;
+
+	/*
+	 * The BHI clobbers 'replace' the ENDBR poison, but dynamic call
+	 * patching (static_call, kprobes, etc..) still need to be able
+	 * to identify and skip the foo()+0 'endbr'.
+	 */
+
+	/* REX CMOVNE, see bhi_args_1() */
+	if ((endbr & 0xc2fffff9) == 0xc2450f49)
+		return true;
+
+	/* CALL __bhi_args_* */
+	void *dest = (void *)val + 4 + (s32)endbr;
+	if (dest >= (void *)__bhi_args_6c1 &&
+	    dest <= (void *)__bhi_args_all)
+		return true;
+#endif
+
+	return false;
 }
 
 static void poison_cfi(void *addr);
@@ -1190,6 +1214,8 @@ static __init int cfi_parse_cmdline(char
 			cfi_mode = CFI_KCFI;
 		} else if (!strcmp(str, "fineibt")) {
 			cfi_mode = CFI_FINEIBT;
+		} else if (IS_ENABLED(CONFIG_X86_KERNEL_IBT_PLUS) && !strcmp(str, "fineibt+bhi")) {
+			cfi_mode = CFI_FINEIBT_BHI;
 		} else if (!strcmp(str, "norand")) {
 			cfi_rand = false;
 		} else {
@@ -1208,10 +1234,9 @@ early_param("cfi", cfi_parse_cmdline);
  *
  * __cfi_\func:					__cfi_\func:
  *	movl   $0x12345678,%eax		// 5	     endbr64			// 4
- *	nop					     subl   $0x12345678,%r10d   // 7
+ *	movb   $0x12,%al		// 2	     subl   $0x12345678,%r10d   // 7
  *	nop					     jz     1f			// 2
  *	nop					     ud2			// 2
- *	nop					1:   nop			// 1
  *	nop
  *	nop
  *	nop
@@ -1279,6 +1304,17 @@ static u32 decode_preamble_hash(void *ad
 	return 0; /* invalid hash value */
 }
 
+static u8 decode_preamble_args(void *addr)
+{
+	u8 *p = addr;
+
+	/* b0 12	movb $0x12, %al */
+	if (p[5] == 0xb0)
+		return p[6];
+
+	return 0xff; /* invalid args */
+}
+
 static u32 decode_caller_hash(void *addr)
 {
 	u8 *p = addr;
@@ -1371,6 +1407,7 @@ static int cfi_rewrite_preamble(s32 *sta
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
 		u32 hash;
+		u8 args;
 
 		/*
 		 * When the function doesn't start with ENDBR the compiler will
@@ -1385,11 +1422,25 @@ static int cfi_rewrite_preamble(s32 *sta
 			 addr, addr, 5, addr))
 			return -EINVAL;
 
+		args = decode_preamble_args(addr);
+
 		text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size);
 		WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678);
 		text_poke_early(addr + fineibt_preamble_hash, &hash, 4);
 
 		*(u8 *)(addr + fineibt_preamble_jccd8) += 4;
+
+		if (cfi_mode != CFI_FINEIBT_BHI)
+			continue;
+
+		WARN_ONCE(args == 0xff, "no CFI args found at %pS %px %*ph\n",
+			  addr, addr, 7, addr);
+
+		/*
+		 * Stash the ARGS byte in the NOP at __cfi_foo+15, see
+		 * cfi_rewrite_endbr().
+		 */
+		*(u8 *)(addr + fineibt_preamble_size - 1) = args;
 	}
 
 	return 0;
@@ -1401,11 +1452,26 @@ static void cfi_rewrite_endbr(s32 *start
 
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
+		u8 args;
 
 		if (!is_endbr(addr + 16))
 			continue;
 
-		poison_endbr(addr + 16);
+		if (cfi_mode != CFI_FINEIBT_BHI) {
+			poison_endbr(addr + 16);
+			continue;
+		}
+
+		/* recover the args byte */
+		args = *(u8 *)(addr + fineibt_preamble_size - 1);
+		*(u8 *)(addr + fineibt_preamble_size - 1) = BYTES_NOP1;
+		if (args) {
+			/* only skip the UD2 */
+			*(u8 *)(addr + fineibt_preamble_jccd8) = 2;
+
+			/* write BHI clobber in the 5 bytes that hold: nop + poison */
+			bhi_args(args, addr + fineibt_preamble_size - 1);
+		}
 	}
 }
 
@@ -1506,6 +1572,7 @@ static void __apply_fineibt(s32 *start_r
 		return;
 
 	case CFI_FINEIBT:
+	case CFI_FINEIBT_BHI:
 		/* place the FineIBT preamble at func()-16 */
 		ret = cfi_rewrite_preamble(start_cfi, end_cfi);
 		if (ret)
@@ -1519,8 +1586,10 @@ static void __apply_fineibt(s32 *start_r
 		/* now that nobody targets func()+0, remove ENDBR there */
 		cfi_rewrite_endbr(start_cfi, end_cfi);
 
-		if (builtin)
-			pr_info("Using FineIBT CFI\n");
+		if (builtin) {
+			pr_info("Using FineIBT%s CFI\n",
+				cfi_mode == CFI_FINEIBT_BHI ? "+BHI" : "");
+		}
 		return;
 
 	default:
@@ -1548,6 +1617,7 @@ static void poison_cfi(void *addr)
 	 */
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
+	case CFI_FINEIBT_BHI:
 		/*
 		 * FineIBT prefix should start with an ENDBR.
 		 */
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -401,10 +401,17 @@ static void emit_fineibt(u8 **pprog, u32
 
 	EMIT_ENDBR();
 	EMIT3_off32(0x41, 0x81, 0xea, hash);		/* subl $hash, %r10d	*/
-	EMIT2(0x74, 0x07);				/* jz.d8 +7		*/
-	EMIT2(0x0f, 0x0b);				/* ud2			*/
-	EMIT1(0x90);					/* nop			*/
-	EMIT_ENDBR_POISON();
+	if (cfi_mode == CFI_FINEIBT_BHI) {
+		EMIT2(0x74, 0x02);			/* jz.d8 +2		*/
+		EMIT2(0x0f, 0x0b);			/* ud2			*/
+		EMIT1(0x2e);				/* cs			*/
+		EMIT4(0x49, 0x0f, 0x45, 0xfa);		/* cmovne %r10, %rdi	*/
+	} else {
+		EMIT2(0x74, 0x07);			/* jz.d8 +7		*/
+		EMIT2(0x0f, 0x0b);			/* ud2			*/
+		EMIT1(0x90);				/* nop			*/
+		EMIT_ENDBR_POISON();
+	}
 
 	*pprog = prog;
 }
@@ -438,6 +445,7 @@ static void emit_cfi(u8 **pprog, u32 has
 
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
+	case CFI_FINEIBT_BHI:
 		emit_fineibt(&prog, hash);
 		break;
 
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -4397,11 +4397,9 @@ static int validate_ibt_insn(struct objt
 			continue;
 
 		off = reloc->sym->offset;
-		if (reloc_type(reloc) == R_X86_64_PC32 ||
-		    reloc_type(reloc) == R_X86_64_PLT32)
-			off += arch_dest_reloc_offset(reloc_addend(reloc));
-		else
-			off += reloc_addend(reloc);
+		off += reloc_addend(reloc);
+		if (arch_pc_relative_reloc(reloc))
+			off = arch_dest_reloc_offset(off);
 
 		dest = find_insn(file, reloc->sym->sec, off);
 		if (!dest)
@@ -4456,10 +4454,14 @@ static int validate_ibt_insn(struct objt
 static int validate_ibt_data_reloc(struct objtool_file *file,
 				   struct reloc *reloc)
 {
+	long offset = reloc->sym->offset;
 	struct instruction *dest;
 
-	dest = find_insn(file, reloc->sym->sec,
-			 reloc->sym->offset + reloc_addend(reloc));
+	offset += reloc_addend(reloc);
+	if (reloc_type(reloc) == R_X86_64_PLT32) // the fuck ?!
+		offset = arch_dest_reloc_offset(offset);
+
+	dest = find_insn(file, reloc->sym->sec, offset);
 	if (!dest)
 		return 0;
 



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation
  2024-09-27 19:49 ` [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation Peter Zijlstra
@ 2024-09-28  1:50   ` Josh Poimboeuf
  2024-09-28 13:16     ` Peter Zijlstra
  0 siblings, 1 reply; 60+ messages in thread
From: Josh Poimboeuf @ 2024-09-28  1:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 09:49:10PM +0200, Peter Zijlstra wrote:
> @@ -1190,6 +1214,8 @@ static __init int cfi_parse_cmdline(char
>  			cfi_mode = CFI_KCFI;
>  		} else if (!strcmp(str, "fineibt")) {
>  			cfi_mode = CFI_FINEIBT;
> +		} else if (IS_ENABLED(CONFIG_X86_KERNEL_IBT_PLUS) && !strcmp(str, "fineibt+bhi")) {
> +			cfi_mode = CFI_FINEIBT_BHI;
>  		} else if (!strcmp(str, "norand")) {
>  			cfi_rand = false;
>  		} else {

Do we need to hook this in with bugs.c somehow so it skips the other BHI
mitigations?

-- 
Josh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation
  2024-09-28  1:50   ` Josh Poimboeuf
@ 2024-09-28 13:16     ` Peter Zijlstra
  2024-10-28  5:45       ` Constable, Scott D
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2024-09-28 13:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, alyssa.milburn, scott.d.constable, joao,
	andrew.cooper3, jose.marchesi, hjl.tools, ndesaulniers,
	samitolvanen, nathan, ojeda, kees, alexei.starovoitov

On Fri, Sep 27, 2024 at 06:50:06PM -0700, Josh Poimboeuf wrote:
> On Fri, Sep 27, 2024 at 09:49:10PM +0200, Peter Zijlstra wrote:
> > @@ -1190,6 +1214,8 @@ static __init int cfi_parse_cmdline(char
> >  			cfi_mode = CFI_KCFI;
> >  		} else if (!strcmp(str, "fineibt")) {
> >  			cfi_mode = CFI_FINEIBT;
> > +		} else if (IS_ENABLED(CONFIG_X86_KERNEL_IBT_PLUS) && !strcmp(str, "fineibt+bhi")) {
> > +			cfi_mode = CFI_FINEIBT_BHI;
> >  		} else if (!strcmp(str, "norand")) {
> >  			cfi_rand = false;
> >  		} else {
> 
> Do we need to hook this in with bugs.c somehow so it skips the other BHI
> mitigations?

Yeah.. those didn't exist when I started this code :-) But yeah, once we
get to the point of doing this patch for real -- the compiler(s) have
the required features implemented properly and everyrhing, this should
be hooked up better.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation
  2024-09-28 13:16     ` Peter Zijlstra
@ 2024-10-28  5:45       ` Constable, Scott D
  0 siblings, 0 replies; 60+ messages in thread
From: Constable, Scott D @ 2024-10-28  5:45 UTC (permalink / raw)
  To: Peter Zijlstra, Josh Poimboeuf
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Milburn, Alyssa,
	joao@overdrivepizza.com, andrew.cooper3@citrix.com,
	jose.marchesi@oracle.com, hjl.tools@gmail.com,
	ndesaulniers@google.com, samitolvanen@google.com,
	nathan@kernel.org, ojeda@kernel.org, kees@kernel.org,
	alexei.starovoitov@gmail.com

> > On Fri, Sep 27, 2024 at 09:49:10PM +0200, Peter Zijlstra wrote:
> > > Due to FineIBT weakness, add an additional mitigation for BHI.
> > >
> > > Use the 5 bytes of the nop at -1 and the 4 byte poison to squirrel in a BHI mitigation.
> > >
> > > Relies on clang-cfi to emit an additional piece of magic in the kCFI pre-amble, identifying which function arguments are pointers.
> > >
> > > An additional u8 (next to the existing u32) is emitted like:
> > >
> > >	movl	0x12345678, %eax	// CFI hash
> > >	movb	0x12, %al		// CFI args
> > >
> > > This u8 is a bitmask, where BIT(n) indicates the n'th argument is a pointer, notably the 6 possible argument registers are:
> > >
> > >	rdi, rsi, rdx, rcx, r8 and r9
> > >
> > > Single bit can be inlined, while 2-4 bits have combinatoric stubs with the required magic in. Anything more will fall back to __bhi_args_all which additionally poisons %rsp for good measure, in case things overflowed to the stack.

Hi Peter,

Instead of:

	movl	0x12345678, %eax	// CFI hash
	movb	0x12, %al		// CFI args

Can we do this?

	movb	0x12, %al		// CFI args
	movl	0x12345678, %eax	// CFI hash

The latter ordering does not affect the kCFI hash's location, whereas the former ordering shifts the location of the kCFI hash backward by two bytes, which creates more work for the compiler.

Also, when I build LLVM and Linux with these patches, my kernel crashes. The root cause appears to be a call from bpf_dispatcher_nop_func:

   0xffffffff813e69f1 <+209>:   mov    %rbx,%rdi
   0xffffffff813e69f4 <+212>:   mov    $0x9b5328da,%r10d
   0xffffffff813e69fa <+218>:   sub    $0x10,%r11
   0xffffffff813e69fe <+222>:   nopl   0x0(%rax)
   0xffffffff813e6a02 <+226>:   call   *%r11     # This is the problematic call

to some function that I don't see in the kernel image, and that I suspect is being generated at runtime:

   0xffffffffa0000794:  int3                # The call instruction lands here...
   0xffffffffa0000795:  int3
   0xffffffffa0000796:  int3
   0xffffffffa0000797:  int3
   0xffffffffa0000798:  int3
   0xffffffffa0000799:  int3
   0xffffffffa000079a:  int3
   0xffffffffa000079b:  int3
   0xffffffffa000079c:  int3
   0xffffffffa000079d:  int3
   0xffffffffa000079e:  int3
   0xffffffffa000079f:  int3
   0xffffffffa00007a0:  int3
   0xffffffffa00007a1:  int3
   0xffffffffa00007a2:  int3
   0xffffffffa00007a3:  int3
   0xffffffffa00007a4:  endbr64                # ... but it should land here
   0xffffffffa00007a8:  sub    $0x9b5328da,%r10d
   0xffffffffa00007af:  je     0xffffffffa00007b3
   0xffffffffa00007b1:  ud2
   0xffffffffa00007b3:  cs cmovne %r10,%rdi
   0xffffffffa00007b8:  nopl   0x0(%rax,%rax,1)
   0xffffffffa00007bd:  nopl   (%rax)
   0xffffffffa00007c0:  push   %rbp
   0xffffffffa00007c1:  mov    %rsp,%rbp
   0xffffffffa00007c4:  endbr64

Regards,

Scott Constable

> On Fri, Sep 27, 2024 at 06:50:06PM -0700, Josh Poimboeuf wrote:
> > On Fri, Sep 27, 2024 at 09:49:10PM +0200, Peter Zijlstra wrote:
> > > @@ -1190,6 +1214,8 @@ static __init int cfi_parse_cmdline(char
> > >  			cfi_mode = CFI_KCFI;
> > >  		} else if (!strcmp(str, "fineibt")) {
> > >  			cfi_mode = CFI_FINEIBT;
> > > +		} else if (IS_ENABLED(CONFIG_X86_KERNEL_IBT_PLUS) && !strcmp(str, "fineibt+bhi")) {
> > > +			cfi_mode = CFI_FINEIBT_BHI;
> > >  		} else if (!strcmp(str, "norand")) {
> > >  			cfi_rand = false;
> > >  		} else {
> > 
> > Do we need to hook this in with bugs.c somehow so it skips the other 
> > BHI mitigations?
>
> Yeah.. those didn't exist when I started this code :-) But yeah, once we get to the point of doing this patch for real -- the compiler(s) have the required features implemented properly and everyrhing, this should be hooked up better.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-11-05 10:40 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-27 19:48 [PATCH 00/14] x86/ibt: FineIBT-BHI Peter Zijlstra
2024-09-27 19:48 ` [PATCH 01/14] x86/cfi: Wreck things Peter Zijlstra
2024-09-27 23:15   ` Josh Poimboeuf
2024-09-28 13:31     ` Peter Zijlstra
2024-09-30 21:42       ` Josh Poimboeuf
2024-09-27 19:48 ` [PATCH 02/14] x86/boot: Mark start_secondary() with __noendbr Peter Zijlstra
2024-09-27 19:48 ` [PATCH 03/14] x86/alternative: Simplify callthunk patching Peter Zijlstra
2024-09-27 23:27   ` Josh Poimboeuf
2024-09-27 19:49 ` [PATCH 04/14] objtool/x86: Add .tail_call_sites Peter Zijlstra
2024-09-27 23:42   ` Josh Poimboeuf
2024-10-09 15:25     ` Peter Zijlstra
2024-10-10  4:55       ` Josh Poimboeuf
2024-09-27 19:49 ` [PATCH 05/14] objtool: Rename the skylake hack to --direct-call Peter Zijlstra
2024-09-27 19:49 ` [PATCH 06/14] x86/traps: Prepare for ENDBR poison UD1 usage Peter Zijlstra
2024-09-27 19:49 ` [PATCH 07/14] x86/ibt: Clean up is_endbr() Peter Zijlstra
2024-09-28  0:04   ` Josh Poimboeuf
2024-09-28 13:08     ` Peter Zijlstra
2024-09-29 17:32   ` Alexei Starovoitov
2024-09-30  8:30     ` Peter Zijlstra
2024-09-30  9:33       ` Peter Zijlstra
2024-09-30 16:43         ` Alexei Starovoitov
2024-09-30 20:58         ` Andrii Nakryiko
2024-09-27 19:49 ` [PATCH 08/14] x86/ibt: Clean up poison_endbr() Peter Zijlstra
2024-09-27 19:49 ` [PATCH 09/14] x86/ibt: Implement IBT+ Peter Zijlstra
2024-09-28  1:07   ` Josh Poimboeuf
2024-09-28 13:12     ` Peter Zijlstra
2024-09-29 17:38   ` Alexei Starovoitov
2024-09-30  8:23     ` Peter Zijlstra
2024-09-30 17:00       ` Alexei Starovoitov
2024-11-05 10:40   ` Peter Zijlstra
2024-09-27 19:49 ` [PATCH 10/14] x86/early_printk: Harden early_serial Peter Zijlstra
2024-09-27 19:49 ` [PATCH 11/14] llvm: kCFI pointer stuff Peter Zijlstra
2024-09-29 17:53   ` Alexei Starovoitov
2024-09-30  8:27     ` Peter Zijlstra
2024-09-30 16:59       ` Alexei Starovoitov
2024-10-01 10:21         ` Peter Zijlstra
2024-10-02 16:48           ` Alexei Starovoitov
2024-10-30  6:29   ` Constable, Scott D
2024-10-30 20:07     ` Constable, Scott D
2024-09-27 19:49 ` [PATCH 12/14] x86: Hacks for hacked up llvm Peter Zijlstra
2024-09-27 19:49 ` [PATCH 13/14] x86: BHI stubs Peter Zijlstra
2024-09-28  1:37   ` Josh Poimboeuf
2024-09-28 13:23     ` Peter Zijlstra
2024-09-30 21:30   ` Josh Poimboeuf
2024-09-30 21:46     ` Josh Poimboeuf
2024-09-30 22:23     ` Andrew Cooper
2024-09-30 22:38       ` Josh Poimboeuf
2024-09-30 22:52         ` Andrew Cooper
2024-10-01 11:03         ` Peter Zijlstra
2024-10-01 11:20           ` Andrew Cooper
2024-10-03 12:17             ` Peter Zijlstra
2024-10-03 13:59               ` Andrew Cooper
2024-10-14 17:50                 ` Constable, Scott D
2024-10-14 21:54                   ` Andrew Cooper
2024-10-21 15:06                     ` Constable, Scott D
2024-10-29  5:59                       ` Joao Moreira
2024-09-27 19:49 ` [PATCH 14/14] x86/fineibt: Add FineIBT+BHI mitigation Peter Zijlstra
2024-09-28  1:50   ` Josh Poimboeuf
2024-09-28 13:16     ` Peter Zijlstra
2024-10-28  5:45       ` Constable, Scott D

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox