All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/kcfi: Optimize call sequence
@ 2026-06-12  7:15 Peter Zijlstra
  2026-06-16 18:55 ` Borislav Petkov
  2026-06-16 20:47 ` David Laight
  0 siblings, 2 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-12  7:15 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, hpa, samitolvanen, kees, nathan, scott.d.constable


As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:

   movl	$(-hash), %r10d
   addl	-15(%r11), %r10d
   je 1f
   ud2
1: cs call __x86_indirect_thunk_r11

(when used in conjunction with -mretpoline-external-thunk).

Notably, the Jcc here is always taken, resulting in lower throughput than would
be ideal. Replace it with the following sequence on boot:

   movl	$(-hash), %r10d
   addl	-15(%r11), %r10d
   jne . + 3
   test $0xd6, %al
   cs call __x86_indirect_thunk_r11

This jumps to the UDB instruction used as an immediate byte in the test
instruction. The test instruction will clobber eflags, but that is immaterial,
eflags is already changed by the preceding addl.

Intel recommends the FineIBT sequence on platforms that support IBT; older
platforms are still widely used and would benefit from this.

An earlier PoC was benchmarked by Scott:

Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)

BHI_DIS_S=1

  Benchmark            Baseline             IBT            kCFI        kCFI-opt
  -----------------------------------------------------------------------------
  iperf3 UDP           0.103764        0.103180        0.104311        0.102945
  hackbench            0.000885        0.000876        0.001996        0.000826
  lmbench syscall      0.005089        0.004486        0.016990        0.005852
  lmbench fork+exit    0.018454        0.019176        0.031085        0.015153
  lmbench fork+exec    0.017147        0.021613        0.029129        0.016337
  redis                0.032220        0.032655        0.045540        0.027946
  nginx+wrk            0.109033        0.112765        0.132557        0.102417
  fio randread         0.009704        0.009620        0.008548        0.000962
  fio seqwrite         0.006927        0.006707        0.019372        0.004590
  kbuild               0.056748        0.057324        0.064640        0.048136

BHI_DIS_S=0

  Benchmark            Baseline             IBT            kCFI        kCFI-opt
  -----------------------------------------------------------------------------
  iperf3 UDP           0.000077        0.000106        0.000186        0.000073
  hackbench            0.000123        0.000132        0.000367        0.000097
  lmbench syscall      0.023259        0.018319        0.040903        0.012772
  lmbench fork+exit    0.011494        0.011887        0.029079        0.016415
  lmbench fork+exec    0.037782        0.038994        0.055378        0.026381
  redis                0.002481        0.003152        0.017073        0.000184
  nginx+wrk            0.015478        0.016266        0.033637        0.000268
  fio randread         0.009836        0.007949        0.007096        0.000143
  fio seqwrite         0.014587        0.014165        0.041792        0.002157
  kbuild               0.055774        0.055249        0.062590        0.046546

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: hpa@zystor.com
Suggested-by: Scott D Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/alternative.c |   11 ++++++++++-
 arch/x86/kernel/cfi.c         |    6 ++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
  *  "Make conditional jumps most often not taken: The efficiency and throughput
  *   for not-taken branches is better than for taken branches on most
  *   processors. Therefore, it is good to place the most frequent branch first"
+ *
+ * NOTE: Update the kCFI caller sequence to make use of this observation.
+ * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
+ * clobbers flags, but those are clobbered by the hash test anyway.
  */
 
 /*
@@ -1518,9 +1522,10 @@ static int cfi_disable_callers(s32 *star
 static int cfi_enable_callers(s32 *start, s32 *end)
 {
 	/*
-	 * Re-enable kCFI, undo what cfi_disable_callers() did.
+	 * Re-enable (and update) kCFI, undo what cfi_disable_callers() did.
 	 */
 	const u8 mov[] = { 0x41, 0xba };
+	const u8 udne[] = { 0x75, 0x01, 0xa8, 0xd6 };
 	s32 *s;
 
 	for (s = start; s < end; s++) {
@@ -1532,6 +1537,10 @@ static int cfi_enable_callers(s32 *start
 		if (!hash) /* nocfi callers */
 			continue;
 
+		/*
+		 * See the kCFI/FineIBT comment above -- update note.
+		 */
+		text_poke_early(addr + 10, udne, 4);
 		text_poke_early(addr, mov, 2);
 	}
 
--- a/arch/x86/kernel/cfi.c
+++ b/arch/x86/kernel/cfi.c
@@ -72,6 +72,12 @@ enum bug_trap_type handle_cfi_failure(st
 
 	switch (cfi_mode) {
 	case CFI_KCFI:
+		/*
+		 * The updated kCFI sequence has "test $0xd6, %al" instead of
+		 * "ud2", adjust the offset.
+		 */
+		addr -= 1;
+
 		if (!is_cfi_trap(addr))
 			return BUG_TRAP_TYPE_NONE;
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-12  7:15 [PATCH] x86/kcfi: Optimize call sequence Peter Zijlstra
@ 2026-06-16 18:55 ` Borislav Petkov
  2026-06-16 20:47 ` David Laight
  1 sibling, 0 replies; 8+ messages in thread
From: Borislav Petkov @ 2026-06-16 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Fri, Jun 12, 2026 at 09:15:06AM +0200, Peter Zijlstra wrote:
> 
> As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
> be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    je 1f
>    ud2
> 1: cs call __x86_indirect_thunk_r11
> 
> (when used in conjunction with -mretpoline-external-thunk).
> 
> Notably, the Jcc here is always taken, resulting in lower throughput than would
> be ideal. Replace it with the following sequence on boot:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    jne . + 3
>    test $0xd6, %al
>    cs call __x86_indirect_thunk_r11
> 
> This jumps to the UDB instruction used as an immediate byte in the test
> instruction. The test instruction will clobber eflags, but that is immaterial,
> eflags is already changed by the preceding addl.
> 
> Intel recommends the FineIBT sequence on platforms that support IBT; older
> platforms are still widely used and would benefit from this.
> 
> An earlier PoC was benchmarked by Scott:
> 
> Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)
> 
> BHI_DIS_S=1
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.103764        0.103180        0.104311        0.102945
>   hackbench            0.000885        0.000876        0.001996        0.000826
>   lmbench syscall      0.005089        0.004486        0.016990        0.005852
>   lmbench fork+exit    0.018454        0.019176        0.031085        0.015153
>   lmbench fork+exec    0.017147        0.021613        0.029129        0.016337
>   redis                0.032220        0.032655        0.045540        0.027946
>   nginx+wrk            0.109033        0.112765        0.132557        0.102417
>   fio randread         0.009704        0.009620        0.008548        0.000962
>   fio seqwrite         0.006927        0.006707        0.019372        0.004590
>   kbuild               0.056748        0.057324        0.064640        0.048136
> 
> BHI_DIS_S=0
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.000077        0.000106        0.000186        0.000073
>   hackbench            0.000123        0.000132        0.000367        0.000097
>   lmbench syscall      0.023259        0.018319        0.040903        0.012772
>   lmbench fork+exit    0.011494        0.011887        0.029079        0.016415
>   lmbench fork+exec    0.037782        0.038994        0.055378        0.026381
>   redis                0.002481        0.003152        0.017073        0.000184
>   nginx+wrk            0.015478        0.016266        0.033637        0.000268
>   fio randread         0.009836        0.007949        0.007096        0.000143
>   fio seqwrite         0.014587        0.014165        0.041792        0.002157
>   kbuild               0.055774        0.055249        0.062590        0.046546
> 
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Nathan Chancellor <nathan@kernel.org>
> Cc: hpa@zystor.com
> Suggested-by: Scott D Constable <scott.d.constable@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/alternative.c |   11 ++++++++++-
>  arch/x86/kernel/cfi.c         |    6 ++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-12  7:15 [PATCH] x86/kcfi: Optimize call sequence Peter Zijlstra
  2026-06-16 18:55 ` Borislav Petkov
@ 2026-06-16 20:47 ` David Laight
  2026-06-17  7:08   ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: David Laight @ 2026-06-16 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Fri, 12 Jun 2026 09:15:06 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> As noted in commit 85a2d4a890dc ("x86,ibt: Use UDB instead of 0xEA") Jcc should
> be assumed not-taken, however the normal kCFI (ABI) emits the following sequence:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    je 1f
>    ud2
> 1: cs call __x86_indirect_thunk_r11
> 
> (when used in conjunction with -mretpoline-external-thunk).
> 
> Notably, the Jcc here is always taken, resulting in lower throughput than would
> be ideal. Replace it with the following sequence on boot:
> 
>    movl	$(-hash), %r10d
>    addl	-15(%r11), %r10d
>    jne . + 3
>    test $0xd6, %al
>    cs call __x86_indirect_thunk_r11
> 
> This jumps to the UDB instruction used as an immediate byte in the test
> instruction. The test instruction will clobber eflags, but that is immaterial,
> eflags is already changed by the preceding addl.
> 
> Intel recommends the FineIBT sequence on platforms that support IBT; older
> platforms are still widely used and would benefit from this.
> 
> An earlier PoC was benchmarked by Scott:
> 
> Indirect branch miss rate (br_misp_retired.indirect:k / br_inst_retired.indirect:k)
> 
> BHI_DIS_S=1
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.103764        0.103180        0.104311        0.102945
>   hackbench            0.000885        0.000876        0.001996        0.000826
>   lmbench syscall      0.005089        0.004486        0.016990        0.005852
>   lmbench fork+exit    0.018454        0.019176        0.031085        0.015153
>   lmbench fork+exec    0.017147        0.021613        0.029129        0.016337
>   redis                0.032220        0.032655        0.045540        0.027946
>   nginx+wrk            0.109033        0.112765        0.132557        0.102417
>   fio randread         0.009704        0.009620        0.008548        0.000962
>   fio seqwrite         0.006927        0.006707        0.019372        0.004590
>   kbuild               0.056748        0.057324        0.064640        0.048136
> 
> BHI_DIS_S=0
> 
>   Benchmark            Baseline             IBT            kCFI        kCFI-opt
>   -----------------------------------------------------------------------------
>   iperf3 UDP           0.000077        0.000106        0.000186        0.000073
>   hackbench            0.000123        0.000132        0.000367        0.000097
>   lmbench syscall      0.023259        0.018319        0.040903        0.012772
>   lmbench fork+exit    0.011494        0.011887        0.029079        0.016415
>   lmbench fork+exec    0.037782        0.038994        0.055378        0.026381
>   redis                0.002481        0.003152        0.017073        0.000184
>   nginx+wrk            0.015478        0.016266        0.033637        0.000268
>   fio randread         0.009836        0.007949        0.007096        0.000143
>   fio seqwrite         0.014587        0.014165        0.041792        0.002157
>   kbuild               0.055774        0.055249        0.062590        0.046546
> 
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: Nathan Chancellor <nathan@kernel.org>
> Cc: hpa@zystor.com
> Suggested-by: Scott D Constable <scott.d.constable@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/alternative.c |   11 ++++++++++-
>  arch/x86/kernel/cfi.c         |    6 ++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
>   *  "Make conditional jumps most often not taken: The efficiency and throughput
>   *   for not-taken branches is better than for taken branches on most
>   *   processors. Therefore, it is good to place the most frequent branch first"
> + *
> + * NOTE: Update the kCFI caller sequence to make use of this observation.
> + * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
> + * clobbers flags, but those are clobbered by the hash test anyway.

I think it would be better to give the byte sequences for both pairs of
instructions - it takes a bit of sleuthing to check they are the same size.

I think it would also be better it the code doing the patching checked
what it was overwriting.

Also, what actually generates the list of cfi locations in the first place?
If it is objtool, then maybe it could do the rewrite instead.

	David


>   */
>  
>  /*
> @@ -1518,9 +1522,10 @@ static int cfi_disable_callers(s32 *star
>  static int cfi_enable_callers(s32 *start, s32 *end)
>  {
>  	/*
> -	 * Re-enable kCFI, undo what cfi_disable_callers() did.
> +	 * Re-enable (and update) kCFI, undo what cfi_disable_callers() did.
>  	 */
>  	const u8 mov[] = { 0x41, 0xba };
> +	const u8 udne[] = { 0x75, 0x01, 0xa8, 0xd6 };
>  	s32 *s;
>  
>  	for (s = start; s < end; s++) {
> @@ -1532,6 +1537,10 @@ static int cfi_enable_callers(s32 *start
>  		if (!hash) /* nocfi callers */
>  			continue;
>  
> +		/*
> +		 * See the kCFI/FineIBT comment above -- update note.
> +		 */
> +		text_poke_early(addr + 10, udne, 4);
>  		text_poke_early(addr, mov, 2);
>  	}
>  
> --- a/arch/x86/kernel/cfi.c
> +++ b/arch/x86/kernel/cfi.c
> @@ -72,6 +72,12 @@ enum bug_trap_type handle_cfi_failure(st
>  
>  	switch (cfi_mode) {
>  	case CFI_KCFI:
> +		/*
> +		 * The updated kCFI sequence has "test $0xd6, %al" instead of
> +		 * "ud2", adjust the offset.
> +		 */
> +		addr -= 1;
> +
>  		if (!is_cfi_trap(addr))
>  			return BUG_TRAP_TYPE_NONE;
>  
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-16 20:47 ` David Laight
@ 2026-06-17  7:08   ` Peter Zijlstra
  2026-06-17  9:26     ` David Laight
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-17  7:08 UTC (permalink / raw)
  To: David Laight
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Tue, Jun 16, 2026 at 09:47:22PM +0100, David Laight wrote:

> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
> >   *  "Make conditional jumps most often not taken: The efficiency and throughput
> >   *   for not-taken branches is better than for taken branches on most
> >   *   processors. Therefore, it is good to place the most frequent branch first"
> > + *
> > + * NOTE: Update the kCFI caller sequence to make use of this observation.
> > + * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
> > + * clobbers flags, but those are clobbered by the hash test anyway.
> 
> I think it would be better to give the byte sequences for both pairs of
> instructions - it takes a bit of sleuthing to check they are the same size.

You mean, expand the comment like a few lines above, where we have the
kCFI/FineIBT contrast? Sure, I suppose I can make this comment longer
still.

> I think it would also be better it the code doing the patching checked
> what it was overwriting.

Ye of little faith :-)

> Also, what actually generates the list of cfi locations in the first place?
> If it is objtool, then maybe it could do the rewrite instead.

The list with UD2 locations is compiler generated. Also, objtool
typically avoids actually modifying code and generally prefers to just
ship additional sections such that the kernel can modify itself. There
is an exception to this, but there was definite grumbling about that.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-17  7:08   ` Peter Zijlstra
@ 2026-06-17  9:26     ` David Laight
  2026-06-17 11:12       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: David Laight @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Wed, 17 Jun 2026 09:08:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Jun 16, 2026 at 09:47:22PM +0100, David Laight wrote:
> 
> > > --- a/arch/x86/kernel/alternative.c
> > > +++ b/arch/x86/kernel/alternative.c
> > > @@ -1356,6 +1356,10 @@ early_param("cfi", cfi_parse_cmdline);
> > >   *  "Make conditional jumps most often not taken: The efficiency and throughput
> > >   *   for not-taken branches is better than for taken branches on most
> > >   *   processors. Therefore, it is good to place the most frequent branch first"
> > > + *
> > > + * NOTE: Update the kCFI caller sequence to make use of this observation.
> > > + * Replace the "je 1f; ud2" sequence with "jne +1; test $0xd6, %al". This
> > > + * clobbers flags, but those are clobbered by the hash test anyway.  
> > 
> > I think it would be better to give the byte sequences for both pairs of
> > instructions - it takes a bit of sleuthing to check they are the same size.  
> 
> You mean, expand the comment like a few lines above, where we have the
> kCFI/FineIBT contrast? Sure, I suppose I can make this comment longer
> still.

More detail and less waffle :-)
I had to read the earlier comment several times because it mentions using
udb and then gives a code snippet that contains ud2.
I then had to check the instruction encodings for both (and neither in is
the 286 and 386 books on my desk).

Just adding (0f,0b) after one of the ud2 and (d6) after a udb would help.

> > I think it would also be better it the code doing the patching checked
> > what it was overwriting.  
> 
> Ye of little faith :-)

I wouldn't want to have to debug the consequences of getting it wrong.
(The same goes for patching into function preamble.)

My 'little faith' comes from patching live kernel code with echo | dd :-)

> 
> > Also, what actually generates the list of cfi locations in the first place?
> > If it is objtool, then maybe it could do the rewrite instead.  
> 
> The list with UD2 locations is compiler generated.

I've never trusted compilers not to change their minds on how code
will be compiled.

> Also, objtool
> typically avoids actually modifying code and generally prefers to just
> ship additional sections such that the kernel can modify itself. There
> is an exception to this, but there was definite grumbling about that.

At least this one is an optimisation.
The advantage of getting objtool to do the change is that objdump will
then show the code that is being executed.

	David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-17  9:26     ` David Laight
@ 2026-06-17 11:12       ` Peter Zijlstra
  2026-06-17 12:36         ` David Laight
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-17 11:12 UTC (permalink / raw)
  To: David Laight
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Wed, Jun 17, 2026 at 10:26:43AM +0100, David Laight wrote:

> > > I think it would also be better it the code doing the patching checked
> > > what it was overwriting.  
> > 
> > Ye of little faith :-)
> 
> I wouldn't want to have to debug the consequences of getting it wrong.
> (The same goes for patching into function preamble.)

Been there, done that etc. :-) I'm the weirdo that's written all this
code.

> My 'little faith' comes from patching live kernel code with echo | dd :-)

The thing is, objtool validates the retpolines are preceded by UD2 as
marker for kCFI and complains when this is not so (there must not be
unannotated indirect calls). And the code that is patching is already
checking there is that mov into %r10d at the expected offset.

The update poke happens when both those are true; (leading mov and
trailing UD2), verifying things again has very little added value.

> > > Also, what actually generates the list of cfi locations in the first place?
> > > If it is objtool, then maybe it could do the rewrite instead.  
> > 
> > The list with UD2 locations is compiler generated.
> 
> I've never trusted compilers not to change their minds on how code
> will be compiled.

The whole kCFI sequence and placement is ABI, there is no changing that.
It is a very specific sequence that is guaranteed to be attached to
any indirect call/jmp/retpoline.

> > Also, objtool
> > typically avoids actually modifying code and generally prefers to just
> > ship additional sections such that the kernel can modify itself. There
> > is an exception to this, but there was definite grumbling about that.
> 
> At least this one is an optimisation.
> The advantage of getting objtool to do the change is that objdump will
> then show the code that is being executed.

Given the amount of self modifying code, that's a dream. Also, on
anything half recent from Intel, it'll all be rewritten to FineIBT,
which is wildly different from what objdump will be showing you.

The only way to truly see what's running is to disassemble the live
image -- either through /proc/kcore or some virtual machine gdb server.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-17 11:12       ` Peter Zijlstra
@ 2026-06-17 12:36         ` David Laight
  2026-06-17 12:47           ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: David Laight @ 2026-06-17 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Wed, 17 Jun 2026 13:12:07 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Jun 17, 2026 at 10:26:43AM +0100, David Laight wrote:
> 
> > > > I think it would also be better it the code doing the patching checked
> > > > what it was overwriting.    
> > > 
> > > Ye of little faith :-)  
> > 
> > I wouldn't want to have to debug the consequences of getting it wrong.
> > (The same goes for patching into function preamble.)  
> 
> Been there, done that etc. :-) I'm the weirdo that's written all this
> code.

And I'm one of the weirdos who knows enough asm (various) to understand
what it is all doing :-)

...
> The thing is, objtool validates the retpolines are preceded by UD2 as
> marker for kCFI and complains when this is not so (there must not be
> unannotated indirect calls). And the code that is patching is already
> checking there is that mov into %r10d at the expected offset.
> 
> The update poke happens when both those are true; (leading mov and
> trailing UD2), verifying things again has very little added value.

Ok, there is a check a bit earlier but not in this bit.
I'm just wary that just believing an address in some table could easily
lead to problems.

...
> > > Also, objtool
> > > typically avoids actually modifying code and generally prefers to just
> > > ship additional sections such that the kernel can modify itself. There
> > > is an exception to this, but there was definite grumbling about that.  
> > 
> > At least this one is an optimisation.
> > The advantage of getting objtool to do the change is that objdump will
> > then show the code that is being executed.  
> 
> Given the amount of self modifying code, that's a dream. Also, on
> anything half recent from Intel, it'll all be rewritten to FineIBT,
> which is wildly different from what objdump will be showing you.
> 
> The only way to truly see what's running is to disassemble the live
> image -- either through /proc/kcore or some virtual machine gdb server.

I did have a local change that generated different nop*3 so I could tell
what was lfence, stac, clac (etc).
Trying to check the compiler output was hard when there were blocks of
6 nop.

	David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86/kcfi: Optimize call sequence
  2026-06-17 12:36         ` David Laight
@ 2026-06-17 12:47           ` Peter Zijlstra
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-17 12:47 UTC (permalink / raw)
  To: David Laight
  Cc: x86, linux-kernel, hpa, samitolvanen, kees, nathan,
	scott.d.constable

On Wed, Jun 17, 2026 at 01:36:37PM +0100, David Laight wrote:

> I did have a local change that generated different nop*3 so I could tell
> what was lfence, stac, clac (etc).
> Trying to check the compiler output was hard when there were blocks of
> 6 nop.

You should use objtool more ;-)

$ defconfig-build/tools/objtool/objtool --disas=rseq_update_user_cs --wide defconfig-build/kernel/rseq.o
rseq_update_user_cs:
   f00:  rseq_update_user_cs+0x0                      mov    %rdi,%rax
   f03:  rseq_update_user_cs+0x3                      mov    0x80(%rsi),%rdi
   f0a:  rseq_update_user_cs+0xa                      mov    %gs:0x0(%rip),%rcx        # 0xf12 <current_task>
   f12:  rseq_update_user_cs+0x12                     testq  $0x8000000,(%rcx)
   f19:  rseq_update_user_cs+0x19                     jne    0xf27 <rseq_update_user_cs+0x27>
   f1b:  rseq_update_user_cs+0x1b                   | <alternative.f1b>           | X86_FEATURE_LA57
   f1b:  rseq_update_user_cs+0x1b                   | movabs $0x7ffffffff000,%r10 | movabs $0xfffffffffff000,%r10
   f25:  rseq_update_user_cs+0x25                     jmp    0xf3f <rseq_update_user_cs+0x3f>
   f27:  rseq_update_user_cs+0x27                     mov    $0xc0000000,%r8d
   f2d:  rseq_update_user_cs+0x2d                     lea    0x3fffe000(%r8),%r10
   f34:  rseq_update_user_cs+0x34                     testb  $0x8,0x5ab(%rcx)
   f3b:  rseq_update_user_cs+0x3b                     cmovne %r8,%r10
   f3f:  rseq_update_user_cs+0x3f                     cmp    %r10,%rdx
   f42:  rseq_update_user_cs+0x42                     jae    0xfda <rseq_update_user_cs+0xda>
   f48:  rseq_update_user_cs+0x48                   | <jump_table.f48>                        | JUMP
   f48:  rseq_update_user_cs+0x48                   | jmp    0xfe9 <rseq_update_user_cs+0xe9> | nop5
   f4d:  rseq_update_user_cs+0x4d                     movabs $0x123456789abcdef,%rcx
   f57:  rseq_update_user_cs+0x57                     cmp    %rcx,%rdx
   f5a:  rseq_update_user_cs+0x5a                     cmova  %rcx,%rdx
   f5e:  rseq_update_user_cs+0x5e                   | <alternative.f5e> | X86_FEATURE_SMAP
   f5e:  rseq_update_user_cs+0x5e                   | nop*3             | stac
   f61:  rseq_update_user_cs+0x61                   | <ex_table.f61>       | EXCEPTION
   f61:  rseq_update_user_cs+0x61                   | mov    0x8(%rdx),%r9 | resume at 0xff1 <rseq_update_user_cs+0xf1>
   f65:  rseq_update_user_cs+0x65                   | <ex_table.f65>        | EXCEPTION
   f65:  rseq_update_user_cs+0x65                   | mov    0x10(%rdx),%r8 | resume at 0xff3 <rseq_update_user_cs+0xf3>
   f69:  rseq_update_user_cs+0x69                     add    $0x18,%rdx
   f6d:  rseq_update_user_cs+0x6d                   | <ex_table.f6d>     | EXCEPTION
   f6d:  rseq_update_user_cs+0x6d                   | mov    (%rdx),%rcx | resume at 0xff5 <rseq_update_user_cs+0xf5>
   f70:  rseq_update_user_cs+0x70                     mov    %rdi,%rdx
   f73:  rseq_update_user_cs+0x73                     sub    %r9,%rdx
   f76:  rseq_update_user_cs+0x76                     cmp    %r8,%rdx
   f79:  rseq_update_user_cs+0x79                     jae    0xfbd <rseq_update_user_cs+0xbd>
   f7b:  rseq_update_user_cs+0x7b                     cmp    %r10,%rcx
   f7e:  rseq_update_user_cs+0x7e                     jae    0xfd7 <rseq_update_user_cs+0xd7>
   f80:  rseq_update_user_cs+0x80                     cmp    $0x4,%rcx
   f84:  rseq_update_user_cs+0x84                     jb     0xfd7 <rseq_update_user_cs+0xd7>
   f86:  rseq_update_user_cs+0x86                   | <ex_table.f86>         | EXCEPTION
   f86:  rseq_update_user_cs+0x86                   | mov    -0x4(%rcx),%edx | resume at 0xff7 <rseq_update_user_cs+0xf7>
   f89:  rseq_update_user_cs+0x89                     cmp    %edx,0xa9c(%rax)
   f8f:  rseq_update_user_cs+0x8f                     jne    0xfd7 <rseq_update_user_cs+0xd7>
   f91:  rseq_update_user_cs+0x91                     mov    0xa90(%rax),%rax
   f98:  rseq_update_user_cs+0x98                   | <ex_table.f98>        | EXCEPTION
   f98:  rseq_update_user_cs+0x98                   | movq   $0x0,0x8(%rax) | resume at 0xff7 <rseq_update_user_cs+0xf7>
   fa0:  rseq_update_user_cs+0xa0                     mov    %rcx,0x80(%rsi)
   fa7:  rseq_update_user_cs+0xa7                   | <alternative.fa7> | X86_FEATURE_SMAP
   fa7:  rseq_update_user_cs+0xa7                   | nop*3             | clac
   faa:  rseq_update_user_cs+0xaa                     mov    %r9,%rsi
   fad:  rseq_update_user_cs+0xad                     mov    %r8,%rdx
   fb0:  rseq_update_user_cs+0xb0                     call   0x150 <rseq_trace_ip_fixup>
   fb5:  rseq_update_user_cs+0xb5                     mov    $0x1,%al
   fb7:  rseq_update_user_cs+0xb7                     cs jmp 0xfbd <__x86_return_thunk>
   fbd:  rseq_update_user_cs+0xbd                     mov    0xa90(%rax),%rax
   fc4:  rseq_update_user_cs+0xc4                   | <ex_table.fc4>        | EXCEPTION
   fc4:  rseq_update_user_cs+0xc4                   | movq   $0x0,0x8(%rax) | resume at 0xff7 <rseq_update_user_cs+0xf7>
   fcc:  rseq_update_user_cs+0xcc                   | <alternative.fcc> | X86_FEATURE_SMAP
   fcc:  rseq_update_user_cs+0xcc                   | nop*3             | clac
   fcf:  rseq_update_user_cs+0xcf                     mov    $0x1,%al
   fd1:  rseq_update_user_cs+0xd1                     cs jmp 0xfd7 <__x86_return_thunk>
   fd7:  rseq_update_user_cs+0xd7                   | <alternative.fd7> | X86_FEATURE_SMAP
   fd7:  rseq_update_user_cs+0xd7                   | nop*3             | clac
   fda:  rseq_update_user_cs+0xda                     movb   $0x1,0xaa6(%rax)
   fe1:  rseq_update_user_cs+0xe1                     xor    %eax,%eax
   fe3:  rseq_update_user_cs+0xe3                     cs jmp 0xfe9 <__x86_return_thunk>
   fe9:  rseq_update_user_cs+0xe9                     mov    %rax,%rdi
   fec:  rseq_update_user_cs+0xec                     jmp    0xff1 <rseq_debug_update_user_cs>
   ff1:  rseq_update_user_cs+0xf1                     jmp    0xff7 <rseq_update_user_cs+0xf7>
   ff3:  rseq_update_user_cs+0xf3                     jmp    0xff7 <rseq_update_user_cs+0xf7>
   ff5:  rseq_update_user_cs+0xf5                     jmp    0xff7 <rseq_update_user_cs+0xf7>
   ff7:  rseq_update_user_cs+0xf7                   | <alternative.ff7> | X86_FEATURE_SMAP
   ff7:  rseq_update_user_cs+0xf7                   | nop*3             | clac
   ffa:  rseq_update_user_cs+0xfa                     xor    %eax,%eax
   ffc:  rseq_update_user_cs+0xfc                     cs jmp 0x1002 <__x86_return_thunk>


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-17 12:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  7:15 [PATCH] x86/kcfi: Optimize call sequence Peter Zijlstra
2026-06-16 18:55 ` Borislav Petkov
2026-06-16 20:47 ` David Laight
2026-06-17  7:08   ` Peter Zijlstra
2026-06-17  9:26     ` David Laight
2026-06-17 11:12       ` Peter Zijlstra
2026-06-17 12:36         ` David Laight
2026-06-17 12:47           ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.