[PATCH] x86/kcfi: Optimize call sequence

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/kcfi: Optimize call sequence
@ 2026-06-12  7:15 Peter Zijlstra
  2026-06-16 18:55 ` Borislav Petkov
  2026-06-16 20:47 ` David Laight
  0 siblings, 2 replies; 3+ messages in thread
From: Peter Zijlstra @ 2026-06-12  7:15 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, hpa, samitolvanen, kees, nathan, scott.d.constable


As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:

   movl	$(-hash), %r10d
   addl	-15(%r11), %r10d
   je 1f
   ud2
1: cs call __x86_indirect_thunk_r11

(when used in conjunction with -mretpoline-external-thunk).

Notably, the Jcc here is always taken, resulting in lower throughput than would
be ideal. Replace it with the following sequence on boot:

   movl	$(-hash), %r10d
   addl	-15(%r11), %r10d
   jne . + 3
   test $0xd6, %al
   cs call __x86_indirect_thunk_r11

This jumps to the UDB instruction used as an immediate byte in the test
instruction. The test instruction will clobber eflags, but that is immaterial,
eflags is already changed by the preceding addl.

Intel recommends the FineIBT sequence on platforms that support IBT; older
platforms are still widely used and would benefit from this.

An earlier PoC was benchmarked by Scott:

Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)

BHI_DIS_S=1

  Benchmark            Baseline             IBT            kCFI        kCFI-opt
  -----------------------------------------------------------------------------
  iperf3 UDP           0.103764        0.103180        0.104311        0.102945
  hackbench            0.000885        0.000876        0.001996        0.000826
  lmbench syscall      0.005089        0.004486        0.016990        0.005852
  lmbench fork+exit    0.018454        0.019176        0.031085        0.015153
  lmbench fork+exec    0.017147        0.021613        0.029129        0.016337
  redis                0.032220        0.032655        0.045540        0.027946
  nginx+wrk            0.109033        0.112765        0.132557        0.102417
  fio randread         0.009704        0.009620        0.008548        0.000962
  fio seqwrite         0.006927        0.006707        0.019372        0.004590
  kbuild               0.056748        0.057324        0.064640        0.048136

BHI_DIS_S=0

  Benchmark            Baseline             IBT            kCFI        kCFI-opt
  -----------------------------------------------------------------------------
  iperf3 UDP           0.000077        0.000106        0.000186        0.000073
  hackbench            0.000123        0.000132        0.000367        0.000097
  lmbench syscall      0.023259        0.018319        0.040903        0.012772
  lmbench fork+exit    0.011494        0.011887        0.029079        0.016415
  lmbench fork+exec    0.037782        0.038994        0.055378        0.026381
  redis                0.002481        0.003152        0.017073        0.000184
  nginx+wrk            0.015478        0.016266        0.033637        0.000268
  fio randread         0.009836        0.007949        0.007096        0.000143
  fio seqwrite         0.014587        0.014165        0.041792        0.002157
  kbuild               0.055774        0.055249        0.062590        0.046546

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: hpa@zystor.com
Suggested-by: Scott D Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/alternative.c |   11 ++++++++++-
 arch/x86/kernel/cfi.c         |    6 ++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
  *  "Make conditional jumps most often not taken: The efficiency and throughput
  *   for not-taken branches is better than for taken branches on most
  *   processors. Therefore, it is good to place the most frequent branch first"
+ *
+ * NOTE: Update the kCFI caller sequence to make use of this observation.
+ * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
+ * clobbers flags, but those are clobbered by the hash test anyway.
  */
 
 /*
@@ -1518,9 +1522,10 @@ static int cfi_disable_callers(s32 *star
 static int cfi_enable_callers(s32 *start, s32 *end)
 {
 	/*
-	 * Re-enable kCFI, undo what cfi_disable_callers() did.
+	 * Re-enable (and update) kCFI, undo what cfi_disable_callers() did.
 	 */
 	const u8 mov[] = { 0x41, 0xba };
+	const u8 udne[] = { 0x75, 0x01, 0xa8, 0xd6 };
 	s32 *s;
 
 	for (s = start; s < end; s++) {
@@ -1532,6 +1537,10 @@ static int cfi_enable_callers(s32 *start
 		if (!hash) /* nocfi callers */
 			continue;
 
+		/*
+		 * See the kCFI/FineIBT comment above -- update note.
+		 */
+		text_poke_early(addr + 10, udne, 4);
 		text_poke_early(addr, mov, 2);
 	}
 
--- a/arch/x86/kernel/cfi.c
+++ b/arch/x86/kernel/cfi.c
@@ -72,6 +72,12 @@ enum bug_trap_type handle_cfi_failure(st
 
 	switch (cfi_mode) {
 	case CFI_KCFI:
+		/*
+		 * The updated kCFI sequence has "test $0xd6, %al" instead of
+		 * "ud2", adjust the offset.
+		 */
+		addr -= 1;
+
 		if (!is_cfi_trap(addr))
 			return BUG_TRAP_TYPE_NONE;
 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-12  7:15 [PATCH] x86/kcfi: Optimize call sequence Peter Zijlstra
@ 2026-06-16 18:55 ` Borislav Petkov
  2026-06-16 20:47 ` David Laight
  1 sibling, 0 replies; 3+ messages in thread
From: Borislav Petkov @ 2026-06-16 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Fri, Jun 12, 2026 at 09:15:06AM +0200, Peter Zijlstra wrote:
> 
> As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
> be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    je 1f
>    ud2
> 1: cs call __x86_indirect_thunk_r11
> 
> (when used in conjunction with -mretpoline-external-thunk).
> 
> Notably, the Jcc here is always taken, resulting in lower throughput than would
> be ideal. Replace it with the following sequence on boot:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    jne . + 3
>    test $0xd6, %al
>    cs call __x86_indirect_thunk_r11
> 
> This jumps to the UDB instruction used as an immediate byte in the test
> instruction. The test instruction will clobber eflags, but that is immaterial,
> eflags is already changed by the preceding addl.
> 
> Intel recommends the FineIBT sequence on platforms that support IBT; older
> platforms are still widely used and would benefit from this.
> 
> An earlier PoC was benchmarked by Scott:
> 
> Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)
> 
> BHI_DIS_S=1
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.103764        0.103180        0.104311        0.102945
>   hackbench            0.000885        0.000876        0.001996        0.000826
>   lmbench syscall      0.005089        0.004486        0.016990        0.005852
>   lmbench fork+exit    0.018454        0.019176        0.031085        0.015153
>   lmbench fork+exec    0.017147        0.021613        0.029129        0.016337
>   redis                0.032220        0.032655        0.045540        0.027946
>   nginx+wrk            0.109033        0.112765        0.132557        0.102417
>   fio randread         0.009704        0.009620        0.008548        0.000962
>   fio seqwrite         0.006927        0.006707        0.019372        0.004590
>   kbuild               0.056748        0.057324        0.064640        0.048136
> 
> BHI_DIS_S=0
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.000077        0.000106        0.000186        0.000073
>   hackbench            0.000123        0.000132        0.000367        0.000097
>   lmbench syscall      0.023259        0.018319        0.040903        0.012772
>   lmbench fork+exit    0.011494        0.011887        0.029079        0.016415
>   lmbench fork+exec    0.037782        0.038994        0.055378        0.026381
>   redis                0.002481        0.003152        0.017073        0.000184
>   nginx+wrk            0.015478        0.016266        0.033637        0.000268
>   fio randread         0.009836        0.007949        0.007096        0.000143
>   fio seqwrite         0.014587        0.014165        0.041792        0.002157
>   kbuild               0.055774        0.055249        0.062590        0.046546
> 
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Nathan Chancellor <nathan@kernel.org>
> Cc: hpa@zystor.com
> Suggested-by: Scott D Constable <scott.d.constable@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/alternative.c |   11 ++++++++++-
>  arch/x86/kernel/cfi.c         |    6 ++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-12  7:15 [PATCH] x86/kcfi: Optimize call sequence Peter Zijlstra
  2026-06-16 18:55 ` Borislav Petkov
@ 2026-06-16 20:47 ` David Laight
  1 sibling, 0 replies; 3+ messages in thread
From: David Laight @ 2026-06-16 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Fri, 12 Jun 2026 09:15:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
> be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    je 1f
>    ud2
> 1: cs call __x86_indirect_thunk_r11
> 
> (when used in conjunction with -mretpoline-external-thunk).
> 
> Notably, the Jcc here is always taken, resulting in lower throughput than would
> be ideal. Replace it with the following sequence on boot:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    jne . + 3
>    test $0xd6, %al
>    cs call __x86_indirect_thunk_r11
> 
> This jumps to the UDB instruction used as an immediate byte in the test
> instruction. The test instruction will clobber eflags, but that is immaterial,
> eflags is already changed by the preceding addl.
> 
> Intel recommends the FineIBT sequence on platforms that support IBT; older
> platforms are still widely used and would benefit from this.
> 
> An earlier PoC was benchmarked by Scott:
> 
> Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)
> 
> BHI_DIS_S=1
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.103764        0.103180        0.104311        0.102945
>   hackbench            0.000885        0.000876        0.001996        0.000826
>   lmbench syscall      0.005089        0.004486        0.016990        0.005852
>   lmbench fork+exit    0.018454        0.019176        0.031085        0.015153
>   lmbench fork+exec    0.017147        0.021613        0.029129        0.016337
>   redis                0.032220        0.032655        0.045540        0.027946
>   nginx+wrk            0.109033        0.112765        0.132557        0.102417
>   fio randread         0.009704        0.009620        0.008548        0.000962
>   fio seqwrite         0.006927        0.006707        0.019372        0.004590
>   kbuild               0.056748        0.057324        0.064640        0.048136
> 
> BHI_DIS_S=0
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.000077        0.000106        0.000186        0.000073
>   hackbench            0.000123        0.000132        0.000367        0.000097
>   lmbench syscall      0.023259        0.018319        0.040903        0.012772
>   lmbench fork+exit    0.011494        0.011887        0.029079        0.016415
>   lmbench fork+exec    0.037782        0.038994        0.055378        0.026381
>   redis                0.002481        0.003152        0.017073        0.000184
>   nginx+wrk            0.015478        0.016266        0.033637        0.000268
>   fio randread         0.009836        0.007949        0.007096        0.000143
>   fio seqwrite         0.014587        0.014165        0.041792        0.002157
>   kbuild               0.055774        0.055249        0.062590        0.046546
> 
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Nathan Chancellor <nathan@kernel.org>
> Cc: hpa@zystor.com
> Suggested-by: Scott D Constable <scott.d.constable@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/alternative.c |   11 ++++++++++-
>  arch/x86/kernel/cfi.c         |    6 ++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
>   *  "Make conditional jumps most often not taken: The efficiency and throughput
>   *   for not-taken branches is better than for taken branches on most
>   *   processors. Therefore, it is good to place the most frequent branch first"
> + *
> + * NOTE: Update the kCFI caller sequence to make use of this observation.
> + * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
> + * clobbers flags, but those are clobbered by the hash test anyway.

I think it would be better to give the byte sequences for both pairs of
instructions - it takes a bit of sleuthing to check they are the same size.

I think it would also be better it the code doing the patching checked
what it was overwriting.

Also, what actually generates the list of cfi locations in the first place?
If it is objtool, then maybe it could do the rewrite instead.

	David


>   */
>  
>  /*
> @@ -1518,9 +1522,10 @@ static int cfi_disable_callers(s32 *star
>  static int cfi_enable_callers(s32 *start, s32 *end)
>  {
>  	/*
> -	 * Re-enable kCFI, undo what cfi_disable_callers() did.
> +	 * Re-enable (and update) kCFI, undo what cfi_disable_callers() did.
>  	 */
>  	const u8 mov[] = { 0x41, 0xba };
> +	const u8 udne[] = { 0x75, 0x01, 0xa8, 0xd6 };
>  	s32 *s;
>  
>  	for (s = start; s < end; s++) {
> @@ -1532,6 +1537,10 @@ static int cfi_enable_callers(s32 *start
>  		if (!hash) /* nocfi callers */
>  			continue;
>  
> +		/*
> +		 * See the kCFI/FineIBT comment above -- update note.
> +		 */
> +		text_poke_early(addr + 10, udne, 4);
>  		text_poke_early(addr, mov, 2);
>  	}
>  
> --- a/arch/x86/kernel/cfi.c
> +++ b/arch/x86/kernel/cfi.c
> @@ -72,6 +72,12 @@ enum bug_trap_type handle_cfi_failure(st
>  
>  	switch (cfi_mode) {
>  	case CFI_KCFI:
> +		/*
> +		 * The updated kCFI sequence has "test $0xd6, %al" instead of
> +		 * "ud2", adjust the offset.
> +		 */
> +		addr -= 1;
> +
>  		if (!is_cfi_trap(addr))
>  			return BUG_TRAP_TYPE_NONE;
>  
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-16 20:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  7:15 [PATCH] x86/kcfi: Optimize call sequence Peter Zijlstra
2026-06-16 18:55 ` Borislav Petkov
2026-06-16 20:47 ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.