next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
@ 2025-06-05 11:42 Naresh Kamboju
  2025-06-09 13:09 ` Masami Hiramatsu
  2025-06-10  0:29 ` Steven Rostedt
  0 siblings, 2 replies; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-05 11:42 UTC (permalink / raw)
  To: open list, Linux trace kernel, lkft-triage
  Cc: Stephen Rothwell, Masami Hiramatsu, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

Regressions found on qemu-x86_64 with compat mode (64-bit kernel
running on 32-bit userspace) while running LTP tracing test suite
on Linux next-20250605 tag kernel.

Regressions found on
 - LTP tracing

Regression Analysis:
 - New regression? Yes
 - Reproducible? Intermittent

Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

## Test log
ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
starting test ftrace-stress-test (ftrace_stress_test.sh 90)
<4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
<4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
6.15.0-next-20250605 #1 PREEMPT(voluntary)
<4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
<4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
<4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
12 e4 fe
<4>[   58.998610] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246
<4>[   58.998715] RAX: ffff912a042edd00 RBX: 000000000000000b RCX:
0000000000000000
<4>[   58.998727] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
ffff912a00f2c8c0
<4>[   58.998737] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09:
0000000000000000
<4>[   58.998748] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff912a00f2c8c0
<4>[   58.998759] R13: ffff912a00f2c840 R14: 0000000000000006 R15:
0000000000000000
<4>[   58.998804] FS:  0000000000000000(0000)
GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580
<4>[   58.998821] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>[   58.998832] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4:
00000000000006f0
<4>[   58.998915] Call Trace:
<4>[   58.999010]  <TASK>
<4>[   58.999077]  ? file_close_fd+0x32/0x60
<4>[   58.999147]  __ia32_sys_close+0x18/0x90
<4>[   58.999172]  ia32_sys_call+0x1c3c/0x27e0
<4>[   58.999183]  __do_fast_syscall_32+0x79/0x1e0
<4>[   58.999194]  do_fast_syscall_32+0x37/0x80
<4>[   58.999203]  do_SYSENTER_32+0x23/0x30
<4>[   58.999211]  entry_SYSENTER_compat_after_hwframe+0x84/0x8e
<4>[   58.999254] RIP: 0023:0xf7f0c579
<4>[   58.999459] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10
08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5
0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 2e 8d b4 26 00 00 00 00 8d b4 26
00 00 00
<4>[   58.999466] RSP: 002b:00000000fff98500 EFLAGS: 00000206
ORIG_RAX: 0000000000000006
<4>[   58.999479] RAX: ffffffffffffffda RBX: 000000000000000b RCX:
0000000000000000
<4>[   58.999484] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
<4>[   58.999488] RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000000
<4>[   58.999492] R10: 0000000000000000 R11: 0000000000000206 R12:
0000000000000000
<4>[   58.999497] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
<4>[   58.999534]  </TASK>
<4>[   58.999579] Modules linked in:
<4>[   58.999895] ---[ end trace 0000000000000000 ]---
<4>[   58.999892] Oops: int3: 0000 [#2] SMP PTI
<4>[   58.999997] RIP: 0010:_raw_spin_lock+0x5/0x50
<4>[   59.000008] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
12 e4 fe
<4>[   59.000010] CPU: 1 UID: 0 PID: 339 Comm: sh Tainted: G      D
         6.15.0-next-20250605 #1 PREEMPT(voluntary)
<4>[   59.000014] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246
<4>[   59.000021] RAX: ffff912a042edd00 RBX: 000000000000000b RCX:
0000000000000000
<4>[   59.000026] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
ffff912a00f2c8c0
<4>[   59.000030] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09:
0000000000000000
<4>[   59.000040] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff912a00f2c8c0
<4>[   59.000044] R13: ffff912a00f2c840 R14: 0000000000000006 R15:
0000000000000000
<4>[   59.000049] FS:  0000000000000000(0000)
GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580
<4>[   59.000054] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>[   59.000059] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4:
00000000000006f0
<4>[   59.000070] Tainted: [D]=DIE
<4>[   59.000080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
<4>[   59.000085] RIP: 0010:_raw_spin_lock+0x5/0x50
<4>[   59.000101] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
12 e4 fe
<4>[   59.000108] RSP: 0018:ffff9494000e0e88 EFLAGS: 00000097
<4>[   59.000117] RAX: 0000000000010002 RBX: ffff912a7bd29500 RCX:
ffff912a7bd2a400
<0>[   59.000179] Kernel panic - not syncing: Fatal exception in interrupt
<0>[   60.592321] Shutting down cpus with NMI
<0>[   60.593242] Kernel Offset: 0x20800000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
<0>[   60.618536] ---[ end Kernel panic - not syncing: Fatal exception
in interrupt ]---

## Source
* Kernel version: 6.15.0-next-20250605
* Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: 4f27f06ec12190c7c62c722e99ab6243dea81a94

## Build
* Test log: https://qa-reports.linaro.org/api/testruns/28675335/log_file/
* Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taHC8XHRq/
* Kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taHC8XHRq/config


--
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-05 11:42 next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Naresh Kamboju
@ 2025-06-09 13:09 ` Masami Hiramatsu
  2025-06-10  8:41   ` Masami Hiramatsu
  2025-06-10 13:20   ` Naresh Kamboju
  2025-06-10  0:29 ` Steven Rostedt
  1 sibling, 2 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-09 13:09 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: open list, Linux trace kernel, lkft-triage, Stephen Rothwell,
	Masami Hiramatsu, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Thu, 5 Jun 2025 17:12:10 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> Regressions found on qemu-x86_64 with compat mode (64-bit kernel
> running on 32-bit userspace) while running LTP tracing test suite
> on Linux next-20250605 tag kernel.
> 
> Regressions found on
>  - LTP tracing
> 
> Regression Analysis:
>  - New regression? Yes
>  - Reproducible? Intermittent
> 
> Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> 
> ## Test log
> ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
> <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> 6.15.0-next-20250605 #1 PREEMPT(voluntary)
> <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50

Interesting. This hits a stray int3 for ftrace on _raw_spin_lock.

Here is the compiled code of _raw_spin_lock.

ffffffff825daa00 <_raw_spin_lock>:
ffffffff825daa00:       f3 0f 1e fa             endbr64
ffffffff825daa04:       e8 47 a6 d5 fe          call   ffffffff81335050 <__fentry__>

Since int3 exception happens after decoded int3 (1 byte), the RIP
`_raw_spin_lock+0x05` is not an instruction boundary.

> <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> 12 e4 fe

And the call is already modified back to a 5-bytes nop when we
dump the code. Thus it may hit the intermediate int3 for transforming
code.

e8 47 a6 d5 fe
 (first step)
cc 47 a6 d5 fe
 (second step)
cc 1f 44 00 00 <- hit?
 (third step)
0f 1f 44 00 00 <- handle int3

It is very unlikely scenario (and I'm not sure qemu can correctly
emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4
before anoter CPU' runs third step in smp_text_poke_batch_finish(),
and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs
the thrid step and sets text_poke_array_refs 0, 
the smp_text_poke_int3_handler() returns 0 and causes the same
problem. 

<CPU0>					<CPU1>
					Start smp_text_poke_batch_finish().
					Finish second step.
Hit int3 (*)
					Finish third step.
					Run smp_text_poke_sync_each_cpu().(**)
					Clear text_poke_array_refs[cpu0]
Start smp_text_poke_int3_handler()
Failed to get text_poke_array_refs[cpu0]
Oops: int3


But as I said it is very unlikely, because as far as I know;

(*) smp_text_poke_int3_handler() is called directly from exc_int3()
   which is a kind of NMI, so other interrupt should not run.
(**) In the third step, smp_text_poke_batch_finish() sends IPI for
   sync core after removing int3. Thus any int3 exception handling
   should be finished.

Is this bug reproducible easier recently?

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-05 11:42 next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Naresh Kamboju
  2025-06-09 13:09 ` Masami Hiramatsu
@ 2025-06-10  0:29 ` Steven Rostedt
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-06-10  0:29 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: open list, Linux trace kernel, lkft-triage, Stephen Rothwell,
	Masami Hiramatsu, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86


[ Adding x86 and text_poke folks ]

On Thu, 5 Jun 2025 17:12:10 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> Regressions found on qemu-x86_64 with compat mode (64-bit kernel
> running on 32-bit userspace) while running LTP tracing test suite
> on Linux next-20250605 tag kernel.
> 
> Regressions found on
>  - LTP tracing
> 
> Regression Analysis:
>  - New regression? Yes
>  - Reproducible? Intermittent
> 
> Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> 
> ## Test log
> ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI

Did anything change with text_poke? Ftrace just happens to stress text_poke
more than anything else, as it updates tens of thousands of locations at a time.

The ftrace code hasn't changed in a while, but I think there's been updates
to text_poke.

The modifying of code and adding and removing the int3 handler needs to be
synchronized correctly or something like this bug can happen.

-- Steve


> <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> 6.15.0-next-20250605 #1 PREEMPT(voluntary)
> <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
> <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> 12 e4 fe
> <4>[   58.998610] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246
> <4>[   58.998715] RAX: ffff912a042edd00 RBX: 000000000000000b RCX:
> 0000000000000000
> <4>[   58.998727] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
> ffff912a00f2c8c0
> <4>[   58.998737] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09:
> 0000000000000000
> <4>[   58.998748] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff912a00f2c8c0
> <4>[   58.998759] R13: ffff912a00f2c840 R14: 0000000000000006 R15:
> 0000000000000000
> <4>[   58.998804] FS:  0000000000000000(0000)
> GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580
> <4>[   58.998821] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> <4>[   58.998832] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4:
> 00000000000006f0
> <4>[   58.998915] Call Trace:
> <4>[   58.999010]  <TASK>
> <4>[   58.999077]  ? file_close_fd+0x32/0x60
> <4>[   58.999147]  __ia32_sys_close+0x18/0x90
> <4>[   58.999172]  ia32_sys_call+0x1c3c/0x27e0
> <4>[   58.999183]  __do_fast_syscall_32+0x79/0x1e0
> <4>[   58.999194]  do_fast_syscall_32+0x37/0x80
> <4>[   58.999203]  do_SYSENTER_32+0x23/0x30
> <4>[   58.999211]  entry_SYSENTER_compat_after_hwframe+0x84/0x8e
> <4>[   58.999254] RIP: 0023:0xf7f0c579
> <4>[   58.999459] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10
> 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5
> 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 2e 8d b4 26 00 00 00 00 8d b4 26
> 00 00 00
> <4>[   58.999466] RSP: 002b:00000000fff98500 EFLAGS: 00000206
> ORIG_RAX: 0000000000000006
> <4>[   58.999479] RAX: ffffffffffffffda RBX: 000000000000000b RCX:
> 0000000000000000
> <4>[   58.999484] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> <4>[   58.999488] RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000000
> <4>[   58.999492] R10: 0000000000000000 R11: 0000000000000206 R12:
> 0000000000000000
> <4>[   58.999497] R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000000
> <4>[   58.999534]  </TASK>
> <4>[   58.999579] Modules linked in:
> <4>[   58.999895] ---[ end trace 0000000000000000 ]---
> <4>[   58.999892] Oops: int3: 0000 [#2] SMP PTI
> <4>[   58.999997] RIP: 0010:_raw_spin_lock+0x5/0x50
> <4>[   59.000008] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> 12 e4 fe
> <4>[   59.000010] CPU: 1 UID: 0 PID: 339 Comm: sh Tainted: G      D
>          6.15.0-next-20250605 #1 PREEMPT(voluntary)
> <4>[   59.000014] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246
> <4>[   59.000021] RAX: ffff912a042edd00 RBX: 000000000000000b RCX:
> 0000000000000000
> <4>[   59.000026] RDX: 0000000000000000 RSI: 0000000000000006 RDI:
> ffff912a00f2c8c0
> <4>[   59.000030] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09:
> 0000000000000000
> <4>[   59.000040] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff912a00f2c8c0
> <4>[   59.000044] R13: ffff912a00f2c840 R14: 0000000000000006 R15:
> 0000000000000000
> <4>[   59.000049] FS:  0000000000000000(0000)
> GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580
> <4>[   59.000054] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> <4>[   59.000059] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4:
> 00000000000006f0
> <4>[   59.000070] Tainted: [D]=DIE
> <4>[   59.000080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> <4>[   59.000085] RIP: 0010:_raw_spin_lock+0x5/0x50
> <4>[   59.000101] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> 12 e4 fe
> <4>[   59.000108] RSP: 0018:ffff9494000e0e88 EFLAGS: 00000097
> <4>[   59.000117] RAX: 0000000000010002 RBX: ffff912a7bd29500 RCX:
> ffff912a7bd2a400
> <0>[   59.000179] Kernel panic - not syncing: Fatal exception in interrupt
> <0>[   60.592321] Shutting down cpus with NMI
> <0>[   60.593242] Kernel Offset: 0x20800000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> <0>[   60.618536] ---[ end Kernel panic - not syncing: Fatal exception
> in interrupt ]---
> 
> ## Source
> * Kernel version: 6.15.0-next-20250605
> * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
> * Git sha: 4f27f06ec12190c7c62c722e99ab6243dea81a94
> 
> ## Build
> * Test log: https://qa-reports.linaro.org/api/testruns/28675335/log_file/
> * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taHC8XHRq/
> * Kernel config:
> https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taHC8XHRq/config
> 
> 
> --
> Linaro LKFT
> https://lkft.linaro.org


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-09 13:09 ` Masami Hiramatsu
@ 2025-06-10  8:41   ` Masami Hiramatsu
  2025-06-10 13:25     ` Steven Rostedt
  2025-06-10 13:20   ` Naresh Kamboju
  1 sibling, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-10  8:41 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Naresh Kamboju, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86

On Mon, 9 Jun 2025 22:09:34 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

[...]
> Here is the compiled code of _raw_spin_lock.
> 
> ffffffff825daa00 <_raw_spin_lock>:
> ffffffff825daa00:       f3 0f 1e fa             endbr64
> ffffffff825daa04:       e8 47 a6 d5 fe          call   ffffffff81335050 <__fentry__>
> 
> Since int3 exception happens after decoded int3 (1 byte), the RIP
> `_raw_spin_lock+0x05` is not an instruction boundary.
> 
> > <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> > 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> > 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> > 12 e4 fe
> 
> And the call is already modified back to a 5-bytes nop when we
> dump the code. Thus it may hit the intermediate int3 for transforming
> code.
> 
> e8 47 a6 d5 fe
>  (first step)
> cc 47 a6 d5 fe
>  (second step)
> cc 1f 44 00 00 <- hit?
>  (third step)
> 0f 1f 44 00 00 <- handle int3
> 
> It is very unlikely scenario (and I'm not sure qemu can correctly
> emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4
> before anoter CPU' runs third step in smp_text_poke_batch_finish(),
> and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs
> the thrid step and sets text_poke_array_refs 0, 
> the smp_text_poke_int3_handler() returns 0 and causes the same
> problem. 
> 
> <CPU0>					<CPU1>
> 					Start smp_text_poke_batch_finish().
> 					Finish second step.
> Hit int3 (*)
> 					Finish third step.
> 					Run smp_text_poke_sync_each_cpu().(**)
> 					Clear text_poke_array_refs[cpu0]
> Start smp_text_poke_int3_handler()
> Failed to get text_poke_array_refs[cpu0]
> Oops: int3
> 
> 
> But as I said it is very unlikely, because as far as I know;
> 
> (*) smp_text_poke_int3_handler() is called directly from exc_int3()
>    which is a kind of NMI, so other interrupt should not run.
> (**) In the third step, smp_text_poke_batch_finish() sends IPI for
>    sync core after removing int3. Thus any int3 exception handling
>    should be finished.

Maybe one possible scenario is to hit the int3 after the third step
somehow (on I-cache?).

------
<CPU0>					<CPU1>
					Start smp_text_poke_batch_finish().
					Start the third step. (remove INT3)
					on_each_cpu(do_sync_core)
do_sync_core(do SERIALIZE)
					Finish the third step.
Hit INT3 (from I-cache?)
					Clear text_poke_array_refs[cpu0]
Start smp_text_poke_int3_handler()
Failed to get text_poke_array_refs[cpu0]
Oops: int3
------


SERIALIZE instruction may flash pipeline, thus the processor needs
to reload the instruction. But it is not ensured to reload it from
memory because SERIALIZE does not invalidate the cache.

If that hypotheses is correct, we need to invalidate the cache
(flush TLB) in the third step, before the do_sync_core().

Or, if it is unsure, we can just evacuate the kernel from die("int3")
by retrying the new instruction, when the INT3 is disappeared.


Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-09 13:09 ` Masami Hiramatsu
  2025-06-10  8:41   ` Masami Hiramatsu
@ 2025-06-10 13:20   ` Naresh Kamboju
  2025-06-10 14:43     ` Masami Hiramatsu
  2025-06-10 14:53     ` next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Steven Rostedt
  1 sibling, 2 replies; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-10 13:20 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: open list, Linux trace kernel, lkft-triage, Stephen Rothwell,
	Arnd Bergmann, Dan Carpenter, Anders Roxell

On Mon, 9 Jun 2025 at 18:39, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Thu, 5 Jun 2025 17:12:10 +0530
> Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> > Regressions found on qemu-x86_64 with compat mode (64-bit kernel
> > running on 32-bit userspace) while running LTP tracing test suite
> > on Linux next-20250605 tag kernel.
> >
> > Regressions found on
> >  - LTP tracing
> >
> > Regression Analysis:
> >  - New regression? Yes
> >  - Reproducible? Intermittent
> >
> > Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
> >
> > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> >
> > ## Test log
> > ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> > starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> > <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
> > <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> > 6.15.0-next-20250605 #1 PREEMPT(voluntary)
> > <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> > BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
>
> Interesting. This hits a stray int3 for ftrace on _raw_spin_lock.
>
> Here is the compiled code of _raw_spin_lock.
>
> ffffffff825daa00 <_raw_spin_lock>:
> ffffffff825daa00:       f3 0f 1e fa             endbr64
> ffffffff825daa04:       e8 47 a6 d5 fe          call   ffffffff81335050 <__fentry__>
>
> Since int3 exception happens after decoded int3 (1 byte), the RIP
> `_raw_spin_lock+0x05` is not an instruction boundary.
>
> > <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> > 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> > 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> > 12 e4 fe
>
> And the call is already modified back to a 5-bytes nop when we
> dump the code. Thus it may hit the intermediate int3 for transforming
> code.
>
> e8 47 a6 d5 fe
>  (first step)
> cc 47 a6 d5 fe
>  (second step)
> cc 1f 44 00 00 <- hit?
>  (third step)
> 0f 1f 44 00 00 <- handle int3
>
> It is very unlikely scenario (and I'm not sure qemu can correctly
> emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4
> before anoter CPU' runs third step in smp_text_poke_batch_finish(),
> and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs
> the thrid step and sets text_poke_array_refs 0,
> the smp_text_poke_int3_handler() returns 0 and causes the same
> problem.
>
> <CPU0>                                  <CPU1>
>                                         Start smp_text_poke_batch_finish().
>                                         Finish second step.
> Hit int3 (*)
>                                         Finish third step.
>                                         Run smp_text_poke_sync_each_cpu().(**)
>                                         Clear text_poke_array_refs[cpu0]
> Start smp_text_poke_int3_handler()
> Failed to get text_poke_array_refs[cpu0]
> Oops: int3
>
>
> But as I said it is very unlikely, because as far as I know;
>
> (*) smp_text_poke_int3_handler() is called directly from exc_int3()
>    which is a kind of NMI, so other interrupt should not run.
> (**) In the third step, smp_text_poke_batch_finish() sends IPI for
>    sync core after removing int3. Thus any int3 exception handling
>    should be finished.
>
> Is this bug reproducible easier recently?

Yes. It is easy to reproduce.

>
> Thanks,
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

- Naresh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-10  8:41   ` Masami Hiramatsu
@ 2025-06-10 13:25     ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-06-10 13:25 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Naresh Kamboju, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86

On Tue, 10 Jun 2025 17:41:36 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> SERIALIZE instruction may flash pipeline, thus the processor needs
> to reload the instruction. But it is not ensured to reload it from
> memory because SERIALIZE does not invalidate the cache.

From my understanding, an IPI on a CPU is equivalent to a smp_mb() on that
CPU. There shouldn't be any need for flushing the cache.

> 
> If that hypotheses is correct, we need to invalidate the cache
> (flush TLB) in the third step, before the do_sync_core().

I'm not sure how the TLB would be affected.

-- Steve

> 
> Or, if it is unsure, we can just evacuate the kernel from die("int3")
> by retrying the new instruction, when the INT3 is disappeared.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-10 13:20   ` Naresh Kamboju
@ 2025-06-10 14:43     ` Masami Hiramatsu
  2025-06-10 14:47       ` [RFC PATCH 1/2] x86: Retry with new instruction if INT3 is disappaered Masami Hiramatsu (Google)
  2025-06-10 14:47       ` [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions Masami Hiramatsu (Google)
  2025-06-10 14:53     ` next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Steven Rostedt
  1 sibling, 2 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-10 14:43 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: open list, Linux trace kernel, lkft-triage, Stephen Rothwell,
	Arnd Bergmann, Dan Carpenter, Anders Roxell, Steven Rostedt,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86

On Tue, 10 Jun 2025 18:50:05 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> On Mon, 9 Jun 2025 at 18:39, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Thu, 5 Jun 2025 17:12:10 +0530
> > Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
> > > Regressions found on qemu-x86_64 with compat mode (64-bit kernel
> > > running on 32-bit userspace) while running LTP tracing test suite
> > > on Linux next-20250605 tag kernel.
> > >
> > > Regressions found on
> > >  - LTP tracing
> > >
> > > Regression Analysis:
> > >  - New regression? Yes
> > >  - Reproducible? Intermittent
> > >
> > > Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
> > >
> > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > >
> > > ## Test log
> > > ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> > > starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> > > <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
> > > <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> > > 6.15.0-next-20250605 #1 PREEMPT(voluntary)
> > > <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> > > BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > > <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
> >
> > Interesting. This hits a stray int3 for ftrace on _raw_spin_lock.
> >
> > Here is the compiled code of _raw_spin_lock.
> >
> > ffffffff825daa00 <_raw_spin_lock>:
> > ffffffff825daa00:       f3 0f 1e fa             endbr64
> > ffffffff825daa04:       e8 47 a6 d5 fe          call   ffffffff81335050 <__fentry__>
> >
> > Since int3 exception happens after decoded int3 (1 byte), the RIP
> > `_raw_spin_lock+0x05` is not an instruction boundary.
> >
> > > <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> > > 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> > > 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> > > 12 e4 fe
> >
> > And the call is already modified back to a 5-bytes nop when we
> > dump the code. Thus it may hit the intermediate int3 for transforming
> > code.
> >
> > e8 47 a6 d5 fe
> >  (first step)
> > cc 47 a6 d5 fe
> >  (second step)
> > cc 1f 44 00 00 <- hit?
> >  (third step)
> > 0f 1f 44 00 00 <- handle int3
> >
> > It is very unlikely scenario (and I'm not sure qemu can correctly
> > emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4
> > before anoter CPU' runs third step in smp_text_poke_batch_finish(),
> > and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs
> > the thrid step and sets text_poke_array_refs 0,
> > the smp_text_poke_int3_handler() returns 0 and causes the same
> > problem.
> >
> > <CPU0>                                  <CPU1>
> >                                         Start smp_text_poke_batch_finish().
> >                                         Finish second step.
> > Hit int3 (*)
> >                                         Finish third step.
> >                                         Run smp_text_poke_sync_each_cpu().(**)
> >                                         Clear text_poke_array_refs[cpu0]
> > Start smp_text_poke_int3_handler()
> > Failed to get text_poke_array_refs[cpu0]
> > Oops: int3
> >
> >
> > But as I said it is very unlikely, because as far as I know;
> >
> > (*) smp_text_poke_int3_handler() is called directly from exc_int3()
> >    which is a kind of NMI, so other interrupt should not run.
> > (**) In the third step, smp_text_poke_batch_finish() sends IPI for
> >    sync core after removing int3. Thus any int3 exception handling
> >    should be finished.
> >
> > Is this bug reproducible easier recently?
> 
> Yes. It is easy to reproduce.

Good, can you test the following 2 patches (I'll send a series)?
I think [1/2] may avoid the kernel crash, but still shows a
warning, and [2/2] may fix it if my guess is correct.

Thank you,

> 
> >
> > Thanks,
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> - Naresh


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/2] x86: Retry with new instruction if INT3 is disappaered
  2025-06-10 14:43     ` Masami Hiramatsu
@ 2025-06-10 14:47       ` Masami Hiramatsu (Google)
  2025-06-10 14:47       ` [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions Masami Hiramatsu (Google)
  1 sibling, 0 replies; 34+ messages in thread
From: Masami Hiramatsu (Google) @ 2025-06-10 14:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen
  Cc: Steven Rostedt, x86, Naresh Kamboju, open list,
	Linux trace kernel, lkft-triage, Stephen Rothwell, Arnd Bergmann,
	Dan Carpenter, Anders Roxell

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

An Oops caused by a stray INT3 is reported by LKFT.

 ## Test log
 ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
 starting test ftrace-stress-test (ftrace_stress_test.sh 90)
 <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
 <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
 6.15.0-next-20250605 #1 PREEMPT(voluntary)
 <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
 BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
 <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
 12 e4 fe

But INT3(cc) is not shown in the dumped code. This means there is a
chance to handle an INT3 exception when the INT3 is replaecd with
the original instruction.

To evacuate the kernel from this stuation, when the kernel failed to
handle the INT3, check whether there is an INT3 at the trapped
address. If there isn't, retry executing the new instruction.

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 arch/x86/kernel/traps.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c5c897a86418..f489e86c1b5e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -880,6 +880,29 @@ static void do_int3_user(struct pt_regs *regs)
 	cond_local_irq_disable(regs);
 }
 
+static int handle_disappeared_int3(struct pt_regs *regs)
+{
+	unsigned long addr = instruction_pointer(regs) - INT3_INSN_SIZE;
+	unsigned char opcode;
+	int ret;
+
+	/*
+	 * Evacuate the kernel from disappeared int3, which was there when
+	 * the exception happens, but it is removed now by another CPU.
+	 */
+	ret = copy_from_kernel_nofault(&opcode, (void *)addr, INT3_INSN_SIZE);
+	if (ret < 0)
+		return ret;
+	if (opcode == INT3_INSN_OPCODE)
+		return -EFAULT;
+
+	/* There is no INT3 here. Retry with the new instruction. */
+	WARN_ONCE(1, "A disappeared INT3 was handled at %pS.", (void *)addr);
+	instruction_pointer_set(regs, addr);
+
+	return 0;
+}
+
 DEFINE_IDTENTRY_RAW(exc_int3)
 {
 	/*
@@ -907,7 +930,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 		irqentry_state_t irq_state = irqentry_nmi_enter(regs);
 
 		instrumentation_begin();
-		if (!do_int3(regs))
+		if (!do_int3(regs) && handle_disappeared_int3(regs) < 0)
 			die("int3", regs, 0);
 		instrumentation_end();
 		irqentry_nmi_exit(regs, irq_state);


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-10 14:43     ` Masami Hiramatsu
  2025-06-10 14:47       ` [RFC PATCH 1/2] x86: Retry with new instruction if INT3 is disappaered Masami Hiramatsu (Google)
@ 2025-06-10 14:47       ` Masami Hiramatsu (Google)
  2025-06-10 15:50         ` Steven Rostedt
  2025-06-11 11:30         ` Peter Zijlstra
  1 sibling, 2 replies; 34+ messages in thread
From: Masami Hiramatsu (Google) @ 2025-06-10 14:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen
  Cc: Steven Rostedt, x86, Naresh Kamboju, open list,
	Linux trace kernel, lkft-triage, Stephen Rothwell, Arnd Bergmann,
	Dan Carpenter, Anders Roxell

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Invalidate the cache after replacing INT3 with the new instruction.
This will prevent the other CPUs seeing the removed INT3 in their
cache after serializing the pipeline.

LKFT reported an oops by INT3 but there is no INT3 shown in the
dumped code. This means the INT3 is removed after the CPU hits
INT3.

 ## Test log
 ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
 starting test ftrace-stress-test (ftrace_stress_test.sh 90)
 <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
 <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
 6.15.0-next-20250605 #1 PREEMPT(voluntary)
 <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
 BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
 <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
 12 e4 fe

Maybe one possible scenario is to hit the int3 after the third step
somehow (on I-cache).

------
<CPU0>					<CPU1>
					Start smp_text_poke_batch_finish().
					Start the third step. (remove INT3)
					on_each_cpu(do_sync_core)
do_sync_core(do SERIALIZE)
					Finish the third step.
Hit INT3 (from I-cache)
					Clear text_poke_array_refs[cpu0]
Start smp_text_poke_int3_handler()
Failed to get text_poke_array_refs[cpu0]
Oops: int3
------

SERIALIZE instruction flashes pipeline, thus the processor needs
to reload the instruction. But it is not ensured to reload it from
memory because SERIALIZE does not invalidate the cache.

To prevent reloading replaced INT3, we need to invalidate the cache
(flush TLB) in the third step, before the do_sync_core().

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 arch/x86/kernel/alternative.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ecfe7b497cad..1b606db48017 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2949,8 +2949,16 @@ void smp_text_poke_batch_finish(void)
 		do_sync++;
 	}
 
-	if (do_sync)
+	if (do_sync) {
+		/*
+		 * Flush the instructions on the cache, then serialize the
+		 * pipeline of each CPU.
+		 */
+		flush_tlb_kernel_range((unsigned long)text_poke_addr(&text_poke_array.vec[0]),
+				       (unsigned long)text_poke_addr(text_poke_array.vec +
+								text_poke_array.nr_entries - 1));
 		smp_text_poke_sync_each_cpu();
+	}
 
 	/*
 	 * Remove and wait for refs to be zero.


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-10 13:20   ` Naresh Kamboju
  2025-06-10 14:43     ` Masami Hiramatsu
@ 2025-06-10 14:53     ` Steven Rostedt
  2025-06-12 13:09       ` Naresh Kamboju
  1 sibling, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-06-10 14:53 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Masami Hiramatsu, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 10 Jun 2025 18:50:05 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> > Is this bug reproducible easier recently?  
> 
> Yes. It is easy to reproduce.

Can you test before and after this commit:

  4334336e769b ("x86/alternatives: Improve code-patching scalability by
  removing false sharing in poke_int3_handler()")

I think that may be the culprit.

Even if Masami's patches work, I want to know what exactly caused it.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-10 14:47       ` [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions Masami Hiramatsu (Google)
@ 2025-06-10 15:50         ` Steven Rostedt
  2025-06-11  0:21           ` Masami Hiramatsu
  2025-06-11 10:26           ` Masami Hiramatsu
  2025-06-11 11:30         ` Peter Zijlstra
  1 sibling, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-06-10 15:50 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, Naresh Kamboju, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Tue, 10 Jun 2025 23:47:48 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> Maybe one possible scenario is to hit the int3 after the third step
> somehow (on I-cache).
> 
> ------
> <CPU0>					<CPU1>
> 					Start smp_text_poke_batch_finish().
> 					Start the third step. (remove INT3)
> 					on_each_cpu(do_sync_core)
> do_sync_core(do SERIALIZE)
> 					Finish the third step.
> Hit INT3 (from I-cache)
> 					Clear text_poke_array_refs[cpu0]
> Start smp_text_poke_int3_handler()

I believe your analysis is the issue here. The commit that changed the ref
counter from a global to per cpu didn't cause the issue, it just made the
race window bigger.

> Failed to get text_poke_array_refs[cpu0]
> Oops: int3
> ------
> 
> SERIALIZE instruction flashes pipeline, thus the processor needs
> to reload the instruction. But it is not ensured to reload it from
> memory because SERIALIZE does not invalidate the cache.
> 
> To prevent reloading replaced INT3, we need to invalidate the cache
> (flush TLB) in the third step, before the do_sync_core().
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  arch/x86/kernel/alternative.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ecfe7b497cad..1b606db48017 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -2949,8 +2949,16 @@ void smp_text_poke_batch_finish(void)
>  		do_sync++;
>  	}
>  
> -	if (do_sync)
> +	if (do_sync) {
> +		/*
> +		 * Flush the instructions on the cache, then serialize the
> +		 * pipeline of each CPU.

The IPI interrupt should flush the cache. And the TLB should not be an
issue here. If anything, this may work just because it will make the race
smaller. 

I'm thinking this may be a QEMU bug. If QEMU doesn't flush the icache on an
IPI then this would indeed be an problem.

-- Steve


> +		 */
> +		flush_tlb_kernel_range((unsigned long)text_poke_addr(&text_poke_array.vec[0]),
> +				       (unsigned long)text_poke_addr(text_poke_array.vec +
> +								text_poke_array.nr_entries - 1));
>  		smp_text_poke_sync_each_cpu();
> +	}
>  
>  	/*
>  	 * Remove and wait for refs to be zero.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-10 15:50         ` Steven Rostedt
@ 2025-06-11  0:21           ` Masami Hiramatsu
  2025-06-11 10:26           ` Masami Hiramatsu
  1 sibling, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-11  0:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, Naresh Kamboju, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Tue, 10 Jun 2025 11:50:30 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 10 Jun 2025 23:47:48 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > Maybe one possible scenario is to hit the int3 after the third step
> > somehow (on I-cache).
> > 
> > ------
> > <CPU0>					<CPU1>
> > 					Start smp_text_poke_batch_finish().
> > 					Start the third step. (remove INT3)
> > 					on_each_cpu(do_sync_core)
> > do_sync_core(do SERIALIZE)
> > 					Finish the third step.
> > Hit INT3 (from I-cache)
> > 					Clear text_poke_array_refs[cpu0]
> > Start smp_text_poke_int3_handler()
> 
> I believe your analysis is the issue here. The commit that changed the ref
> counter from a global to per cpu didn't cause the issue, it just made the
> race window bigger.

Agreed. That is a suspicious commit, but even though, as you said
it might just cause the bug easier. Here I wrote refcount as a
per-cpu array because of showing the current code.

> 
> > Failed to get text_poke_array_refs[cpu0]
> > Oops: int3
> > ------
> > 
> > SERIALIZE instruction flashes pipeline, thus the processor needs
> > to reload the instruction. But it is not ensured to reload it from
> > memory because SERIALIZE does not invalidate the cache.
> > 
> > To prevent reloading replaced INT3, we need to invalidate the cache
> > (flush TLB) in the third step, before the do_sync_core().
> > 
> > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >  arch/x86/kernel/alternative.c |   10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index ecfe7b497cad..1b606db48017 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -2949,8 +2949,16 @@ void smp_text_poke_batch_finish(void)
> >  		do_sync++;
> >  	}
> >  
> > -	if (do_sync)
> > +	if (do_sync) {
> > +		/*
> > +		 * Flush the instructions on the cache, then serialize the
> > +		 * pipeline of each CPU.
> 
> The IPI interrupt should flush the cache. And the TLB should not be an
> issue here. If anything, this may work just because it will make the race
> smaller. 

I'm not sure, I'm searching it in the Intel SDM.

> 
> I'm thinking this may be a QEMU bug. If QEMU doesn't flush the icache on an
> IPI then this would indeed be an problem.

Does the qemu manage its icache? (Is that possible to manage it?)
And I guess it is using KVM to run VM, thus the actual cache or TLB
operation has been done by KVM.

Thanks,

> 
> -- Steve
> 
> 
> > +		 */
> > +		flush_tlb_kernel_range((unsigned long)text_poke_addr(&text_poke_array.vec[0]),
> > +				       (unsigned long)text_poke_addr(text_poke_array.vec +
> > +								text_poke_array.nr_entries - 1));
> >  		smp_text_poke_sync_each_cpu();
> > +	}
> >  
> >  	/*
> >  	 * Remove and wait for refs to be zero.
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-10 15:50         ` Steven Rostedt
  2025-06-11  0:21           ` Masami Hiramatsu
@ 2025-06-11 10:26           ` Masami Hiramatsu
  2025-06-11 14:20             ` Steven Rostedt
  1 sibling, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-11 10:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, Naresh Kamboju, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Tue, 10 Jun 2025 11:50:30 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 10 Jun 2025 23:47:48 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > Maybe one possible scenario is to hit the int3 after the third step
> > somehow (on I-cache).
> > 
> > ------
> > <CPU0>					<CPU1>
> > 					Start smp_text_poke_batch_finish().
> > 					Start the third step. (remove INT3)
> > 					on_each_cpu(do_sync_core)
> > do_sync_core(do SERIALIZE)
> > 					Finish the third step.
> > Hit INT3 (from I-cache)
> > 					Clear text_poke_array_refs[cpu0]
> > Start smp_text_poke_int3_handler()
> 
> I believe your analysis is the issue here. The commit that changed the ref
> counter from a global to per cpu didn't cause the issue, it just made the
> race window bigger.
> 

Ah, OK. It seems more easier to explain. Since we use the
trap gate for #BP, it does not clear the IF automatically.
Thus there is a time window between executing INT3 on icache
(or already in the pipeline) and its handler disables
interrupts. If the IPI is received in the time window,
this bug happens.

<CPU0>					<CPU1>
					Start smp_text_poke_batch_finish().
					Start the third step. (remove INT3)
Hit INT3 (from icache/pipeline)
					on_each_cpu(do_sync_core)
----
do_sync_core(do SERIALIZE)
----
					Finish the third step.
Handle #BP including CLI
					Clear text_poke_array_refs[cpu0]
preparing stack
Start smp_text_poke_int3_handler()
Failed to get text_poke_array_refs[cpu0]

In this case, per-cpu text_poke_array_refs will make a time
window bigger because clearing text_poke_array_refs is faster.

If this is correct, flushing cache does not matter (it
can make the window smaller.)

One possible solution is to send IPI again which ensures the
current #BP handler exits. It can make the window small enough.

Another solution is removing WARN_ONCE() from [1/2], which
means we accept this scenario, but avoid catastrophic result.

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-10 14:47       ` [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions Masami Hiramatsu (Google)
  2025-06-10 15:50         ` Steven Rostedt
@ 2025-06-11 11:30         ` Peter Zijlstra
  2025-06-12  0:17           ` Masami Hiramatsu
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-11 11:30 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, x86, Naresh Kamboju, open list,
	Linux trace kernel, lkft-triage, Stephen Rothwell, Arnd Bergmann,
	Dan Carpenter, Anders Roxell

On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Invalidate the cache after replacing INT3 with the new instruction.
> This will prevent the other CPUs seeing the removed INT3 in their
> cache after serializing the pipeline.
> 
> LKFT reported an oops by INT3 but there is no INT3 shown in the
> dumped code. This means the INT3 is removed after the CPU hits
> INT3.
> 
>  ## Test log
>  ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
>  starting test ftrace-stress-test (ftrace_stress_test.sh 90)
>  <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
>  <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
>  6.15.0-next-20250605 #1 PREEMPT(voluntary)
>  <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>  BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
>  <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
>  00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
>  0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
>  12 e4 fe
> 
> Maybe one possible scenario is to hit the int3 after the third step
> somehow (on I-cache).
> 
> ------
> <CPU0>					<CPU1>
> 					Start smp_text_poke_batch_finish().
> 					Start the third step. (remove INT3)
> 					on_each_cpu(do_sync_core)
> do_sync_core(do SERIALIZE)
> 					Finish the third step.
> Hit INT3 (from I-cache)
> 					Clear text_poke_array_refs[cpu0]
> Start smp_text_poke_int3_handler()
> Failed to get text_poke_array_refs[cpu0]
> Oops: int3
> ------
> 
> SERIALIZE instruction flashes pipeline, thus the processor needs
> to reload the instruction. But it is not ensured to reload it from
> memory because SERIALIZE does not invalidate the cache.
> 
> To prevent reloading replaced INT3, we need to invalidate the cache
> (flush TLB) in the third step, before the do_sync_core().

This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A
store should cause the invalidation per MESI and all that. This means
the only place where the old instruction can stick around is in the
uarch micro-ops cache and all that, and SERIALIZE will very much flush
those.

Also, TLB flush != I$ flush. There is clflush_cache_range() for this.
But still, this really should not be needed.

Also, this is all qemu, and qemu is known to have gotten this terribly
wrong in the past.

If you all cannot reproduce on real hardware, I'm considering this a
qemu bug.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-11 10:26           ` Masami Hiramatsu
@ 2025-06-11 14:20             ` Steven Rostedt
  2025-06-11 15:42               ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-06-11 14:20 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, Naresh Kamboju, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86


[ I just noticed that you continued on the thread without the x86 folks Cc ]

On Wed, 11 Jun 2025 19:26:10 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Tue, 10 Jun 2025 11:50:30 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Tue, 10 Jun 2025 23:47:48 +0900
> > "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> >   
> > > Maybe one possible scenario is to hit the int3 after the third step
> > > somehow (on I-cache).
> > > 
> > > ------
> > > <CPU0>					<CPU1>
> > > 					Start smp_text_poke_batch_finish().
> > > 					Start the third step. (remove INT3)
> > > 					on_each_cpu(do_sync_core)
> > > do_sync_core(do SERIALIZE)
> > > 					Finish the third step.
> > > Hit INT3 (from I-cache)
> > > 					Clear text_poke_array_refs[cpu0]
> > > Start smp_text_poke_int3_handler()  
> > 
> > I believe your analysis is the issue here. The commit that changed the ref
> > counter from a global to per cpu didn't cause the issue, it just made the
> > race window bigger.
> >   
> 
> Ah, OK. It seems more easier to explain. Since we use the
> trap gate for #BP, it does not clear the IF automatically.
> Thus there is a time window between executing INT3 on icache
> (or already in the pipeline) and its handler disables
> interrupts. If the IPI is received in the time window,
> this bug happens.
> 
> <CPU0>					<CPU1>
> 					Start smp_text_poke_batch_finish().
> 					Start the third step. (remove INT3)
> Hit INT3 (from icache/pipeline)
> 					on_each_cpu(do_sync_core)
> ----
> do_sync_core(do SERIALIZE)
> ----
> 					Finish the third step.
> Handle #BP including CLI
> 					Clear text_poke_array_refs[cpu0]
> preparing stack
> Start smp_text_poke_int3_handler()
> Failed to get text_poke_array_refs[cpu0]
> 
> In this case, per-cpu text_poke_array_refs will make a time
> window bigger because clearing text_poke_array_refs is faster.
> 
> If this is correct, flushing cache does not matter (it
> can make the window smaller.)
> 
> One possible solution is to send IPI again which ensures the
> current #BP handler exits. It can make the window small enough.
> 
> Another solution is removing WARN_ONCE() from [1/2], which
> means we accept this scenario, but avoid catastrophic result.

If interrupts are enabled when the break point hits and just enters the
int3 handler, does that also mean it can schedule?

If that's the case, then we either have to remove the WARN_ONCE() or we
would have to do something like a synchronize_rcu_tasks().

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-11 14:20             ` Steven Rostedt
@ 2025-06-11 15:42               ` Steven Rostedt
  2025-06-12  0:04                 ` Masami Hiramatsu
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-06-11 15:42 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, Naresh Kamboju, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Wed, 11 Jun 2025 10:20:10 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> If interrupts are enabled when the break point hits and just enters the
> int3 handler, does that also mean it can schedule?

I added this:

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c5c897a86418..0f3153322ad2 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -854,6 +854,8 @@ static bool do_int3(struct pt_regs *regs)
 {
 	int res;
 
+	if (!irqs_disabled())
+		printk("IRQS NOT DISABLED\n");
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP,
 			 SIGTRAP) == NOTIFY_STOP)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ecfe7b497cad..2856805d9ed1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2728,6 +2728,12 @@ noinstr int smp_text_poke_int3_handler(struct pt_regs *regs)
 	int ret = 0;
 	void *ip;
 
+	if (!irqs_disabled()) {
+		instrumentation_begin();
+		printk("IRQS NOT DISABLED\n");
+		instrumentation_end();
+	}
+
 	if (user_mode(regs))
 		return 0;
 


And it didn't trigger when enabling function tracing. Are you sure
interrupts are enabled here?

-- Steve

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-11 15:42               ` Steven Rostedt
@ 2025-06-12  0:04                 ` Masami Hiramatsu
  0 siblings, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-12  0:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, x86, Naresh Kamboju, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Wed, 11 Jun 2025 11:42:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 11 Jun 2025 10:20:10 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > If interrupts are enabled when the break point hits and just enters the
> > int3 handler, does that also mean it can schedule?
> 
> I added this:
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index c5c897a86418..0f3153322ad2 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -854,6 +854,8 @@ static bool do_int3(struct pt_regs *regs)
>  {
>  	int res;
>  
> +	if (!irqs_disabled())
> +		printk("IRQS NOT DISABLED\n");
>  #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
>  	if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP,
>  			 SIGTRAP) == NOTIFY_STOP)
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ecfe7b497cad..2856805d9ed1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -2728,6 +2728,12 @@ noinstr int smp_text_poke_int3_handler(struct pt_regs *regs)
>  	int ret = 0;
>  	void *ip;
>  
> +	if (!irqs_disabled()) {
> +		instrumentation_begin();
> +		printk("IRQS NOT DISABLED\n");
> +		instrumentation_end();
> +	}
> +
>  	if (user_mode(regs))
>  		return 0;
>  
> 
> 
> And it didn't trigger when enabling function tracing. Are you sure
> interrupts are enabled here?

Oops, I saw Xen's code. I confirmed that the asm_exc_int3 is
registered as GATE_INTERRUPT. Hmm. Thus this might be a qemu
bug as Peter said, because there is no chance to interrupt
the IPI after hitting #BP.

Thank you,

> 
> -- Steve
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-11 11:30         ` Peter Zijlstra
@ 2025-06-12  0:17           ` Masami Hiramatsu
  2025-06-12 16:24             ` Naresh Kamboju
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-12  0:17 UTC (permalink / raw)
  To: Peter Zijlstra, Naresh Kamboju
  Cc: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	Steven Rostedt, x86, Naresh Kamboju, open list,
	Linux trace kernel, lkft-triage, Stephen Rothwell, Arnd Bergmann,
	Dan Carpenter, Anders Roxell

On Wed, 11 Jun 2025 13:30:01 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Invalidate the cache after replacing INT3 with the new instruction.
> > This will prevent the other CPUs seeing the removed INT3 in their
> > cache after serializing the pipeline.
> > 
> > LKFT reported an oops by INT3 but there is no INT3 shown in the
> > dumped code. This means the INT3 is removed after the CPU hits
> > INT3.
> > 
> >  ## Test log
> >  ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> >  starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> >  <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
> >  <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> >  6.15.0-next-20250605 #1 PREEMPT(voluntary)
> >  <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> >  BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> >  <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
> >  <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> >  00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> >  0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> >  12 e4 fe
> > 
> > Maybe one possible scenario is to hit the int3 after the third step
> > somehow (on I-cache).
> > 
> > ------
> > <CPU0>					<CPU1>
> > 					Start smp_text_poke_batch_finish().
> > 					Start the third step. (remove INT3)
> > 					on_each_cpu(do_sync_core)
> > do_sync_core(do SERIALIZE)
> > 					Finish the third step.
> > Hit INT3 (from I-cache)
> > 					Clear text_poke_array_refs[cpu0]
> > Start smp_text_poke_int3_handler()
> > Failed to get text_poke_array_refs[cpu0]
> > Oops: int3
> > ------
> > 
> > SERIALIZE instruction flashes pipeline, thus the processor needs
> > to reload the instruction. But it is not ensured to reload it from
> > memory because SERIALIZE does not invalidate the cache.
> > 
> > To prevent reloading replaced INT3, we need to invalidate the cache
> > (flush TLB) in the third step, before the do_sync_core().
> 
> This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A
> store should cause the invalidation per MESI and all that. This means
> the only place where the old instruction can stick around is in the
> uarch micro-ops cache and all that, and SERIALIZE will very much flush
> those.

OK, thanks for pointing it out!

> 
> Also, TLB flush != I$ flush. There is clflush_cache_range() for this.
> But still, this really should not be needed.
> 
> Also, this is all qemu, and qemu is known to have gotten this terribly
> wrong in the past.

What about KVM? We need to ask Naresh how it is running on the machine.
Naresh, can you tell us how the VM is running? Does that use KVM?
And if so, how the kvm is configured(it may depend on the real hardware)?

> 
> If you all cannot reproduce on real hardware, I'm considering this a
> qemu bug.

OK, if it is a qemu's bug, dropping [2/2], but I think we still need
[1/2] to avoid kernel crash (with a warning message without dump).

Thank you,

> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-10 14:53     ` next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Steven Rostedt
@ 2025-06-12 13:09       ` Naresh Kamboju
  2025-06-13  8:27         ` Masami Hiramatsu
  0 siblings, 1 reply; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-12 13:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 10 Jun 2025 at 20:22, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 10 Jun 2025 18:50:05 +0530
> Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> > > Is this bug reproducible easier recently?
> >
> > Yes. It is easy to reproduce.
>
> Can you test before and after this commit:
>
>   4334336e769b ("x86/alternatives: Improve code-patching scalability by
>   removing false sharing in poke_int3_handler()")
>
> I think that may be the culprit.
>
> Even if Masami's patches work, I want to know what exactly caused it.

Steven,

Since the reported regressions are intermittent, It is not easy to bisect.
However, The commit merged into Linux next-20250414 tag and then
started noticing from next-20250415 onwards this regression on both
x86_64 devices and qemu-x86_64 intermittently with and without
compat mode.

 - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/testrun/28685600/suite/log-parser-test/test/oops-oops-int3-smp-pti/history/
 - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/testrun/28685600/suite/log-parser-test/test/oops-oops-int3-smp-pti/history/?page=2

And above commit landed into Linus master branch on 2025-05-13 and
then started noticing this regression intermittently on x86 with and without
compat mode.

  - https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.16-rc1/testrun/28711641/suite/log-parser-test/test/oops-oops-oops-smp-pti/history/?page=1

Masami San,

case 1) compat mode x86_64 (64-bit kernel + 32-bit rootfs)
I have tested your patch on top of linux next-20250606 tag and tested
on real x86_64 (64-bit kernel + 32-bit rootfs) hardware for 7 test runs.

ftrace_regression01 - pass
ftrace_regression02 - pass
ftrace-stress-test - pass
dynamic_debug01 - Hangs (No crash log on serial console)

Case 1.1)
Above case noticed on qemu-x86_64 with compat mode ^ with
12 test runs.

- https://lkft.validation.linaro.org/scheduler/job/8312811#L1687

case 2) x86_64 (64-bit kernel + 64-bit rootfs)
I have tested your patch on top of linux next-20250606 tag and tested
on real x86_64 (64-bit kernel + 64-bit rootfs) hardware for 4 runs and out of
these 3 runs failed and found these kernel warnings, kernel BUG and
invalid opcode while running LTP tracing test cases.

Here I am sharing the crash log snippet and boot and test log links  and
build link.

Test logs:
[  112.596591] Ring buffer clock went backwards: 113864910133 -> 112596588266
[  115.829620] cat (5762) used greatest stack depth: 10936 bytes left
[  120.922517] ------------[ cut here ]------------
[  120.927198] WARNING: CPU: 2 PID: 6639 at
kernel/trace/trace_functions_graph.c:985 print_graph_entry+0x579/0x590
[  120.937364] Modules linked in: x86_pkg_temp_thermal
[  120.942405] CPU: 2 UID: 0 PID: 6639 Comm: cat Tainted: G S
        6.15.0-next-20250606 #1 PREEMPT(voluntary)
[  120.953380] Tainted: [S]=CPU_OUT_OF_SPEC
[  120.957477] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.7 12/07/2021
[  120.965036] RIP: 0010:print_graph_entry+0x579/0x590

Run 1:
- https://lkft.validation.linaro.org/scheduler/job/8311136#L1700


ftrace-stress-test: [   58.963898] /usr/local/bin/kirk[340]: starting
test ftrace-stress-test (ftrace_stress_test.sh 90)
[   60.316588] ------------[ cut here ]------------
[   60.316588] ------------[ cut here ]------------
[   60.316590] ------------[ cut here ]------------
[   60.316593] ------------[ cut here ]------------
[   60.316593] ------------[ cut here ]------------
[   60.316594] ------------[ cut here ]------------
[   60.316594] kernel BUG at kernel/entry/common.c:328!
[   60.316594] kernel BUG at kernel/entry/common.c:328!
[   60.316595] kernel BUG at kernel/entry/common.c:328!
[   60.316600] Oops: invalid opcode: 0000 [#1] SMP PTI
[   60.316604] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S
       6.15.0-next-20250606 #1 PREEMPT(voluntary)
[   60.316608] Tainted: [S]=CPU_OUT_OF_SPEC
[   60.316609] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.7 12/07/2021
[   60.316614] ------------[ cut here ]------------
[   60.316615] kernel BUG at kernel/entry/common.c:328!
[   60.316617] Oops: invalid opcode: 0000 [#2] SMP PTI
[   60.316620] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S
       6.15.0-next-20250606 #1 PREEMPT(voluntary)
[   60.316622] Tainted: [S]=CPU_OUT_OF_SPEC
[   60.316623] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.7 12/07/2021
[   60.316625] RIP: 0010:irqentry_nmi_enter+0x6c/0x70

Run 2:
- https://lkft.validation.linaro.org/scheduler/job/8311138#L1703

ftrace-stress-test: [   78.877495] /usr/local/bin/kirk[343]: starting
test ftrace-stress-test (ftrace_stress_test.sh 90)
[   78.977303] Scheduler tracepoints stat_sleep, stat_iowait,
stat_blocked and stat_runtime require the kernel parameter
schedstats=enable or kernel.sched_schedstats=1
[   82.299799] cat (2322) used greatest stack depth: 11520 bytes left
[   82.327708] cat (2327) used greatest stack depth: 11256 bytes left
[   82.632183] cat (2375) used greatest stack depth: 10992 bytes left
[  137.335901] ------------[ cut here ]------------
[  137.335901] ------------[ cut here ]------------
[  137.335902] ------------[ cut here ]------------
[  137.335907] kernel BUG at kernel/entry/common.c:328!
[  137.335908] ------------[ cut here ]------------
[  137.335909] ------------[ cut here ]------------
[  137.335912] kernel BUG at kernel/entry/common.c:328!
[  137.335912] kernel BUG at kernel/entry/common.c:328!
[  137.335915] Oops: invalid opcode: 0000 [#1] SMP PTI
[  137.335921] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S
      6.15.0-next-20250606 #1 PREEMPT(voluntary)
[  137.335926] Tainted: [S]=CPU_OUT_OF_SPEC
[  137.335929] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.7 12/07/2021
[  137.335937] ------------[ cut here ]------------
[  137.335939] kernel BUG at kernel/entry/common.c:328!
[  137.335945] Oops: invalid opcode: 0000 [#2] SMP PTI
[  137.335949] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S
      6.15.0-next-20250606 #1 PREEMPT(voluntary)
[  137.335953] Tainted: [S]=CPU_OUT_OF_SPEC
[  137.335956] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.7 12/07/2021
[  137.335959] RIP: 0010:irqentry_nmi_enter+0x6c/0x70

Run 3:
- https://lkft.validation.linaro.org/scheduler/job/8311139#L1703

Build log:
 - https://storage.tuxsuite.com/public/linaro/naresh/builds/2yM9krm5KgE5a57QFvOqw9UrSgQ/

- Naresh

>
> -- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-12  0:17           ` Masami Hiramatsu
@ 2025-06-12 16:24             ` Naresh Kamboju
  2025-06-13  3:09               ` Masami Hiramatsu
  0 siblings, 1 reply; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-12 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, x86, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Thu, 12 Jun 2025 at 05:47, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 11 Jun 2025 13:30:01 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
> > > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > >
> > > Invalidate the cache after replacing INT3 with the new instruction.
> > > This will prevent the other CPUs seeing the removed INT3 in their
> > > cache after serializing the pipeline.
> > >
> > > LKFT reported an oops by INT3 but there is no INT3 shown in the
> > > dumped code. This means the INT3 is removed after the CPU hits
> > > INT3.
> > >
> > >  ## Test log
> > >  ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> > >  starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> > >  <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
> > >  <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> > >  6.15.0-next-20250605 #1 PREEMPT(voluntary)
> > >  <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> > >  BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > >  <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
> > >  <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> > >  00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> > >  0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> > >  12 e4 fe
> > >
> > > Maybe one possible scenario is to hit the int3 after the third step
> > > somehow (on I-cache).
> > >
> > > ------
> > > <CPU0>                                      <CPU1>
> > >                                     Start smp_text_poke_batch_finish().
> > >                                     Start the third step. (remove INT3)
> > >                                     on_each_cpu(do_sync_core)
> > > do_sync_core(do SERIALIZE)
> > >                                     Finish the third step.
> > > Hit INT3 (from I-cache)
> > >                                     Clear text_poke_array_refs[cpu0]
> > > Start smp_text_poke_int3_handler()
> > > Failed to get text_poke_array_refs[cpu0]
> > > Oops: int3
> > > ------
> > >
> > > SERIALIZE instruction flashes pipeline, thus the processor needs
> > > to reload the instruction. But it is not ensured to reload it from
> > > memory because SERIALIZE does not invalidate the cache.
> > >
> > > To prevent reloading replaced INT3, we need to invalidate the cache
> > > (flush TLB) in the third step, before the do_sync_core().
> >
> > This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A
> > store should cause the invalidation per MESI and all that. This means
> > the only place where the old instruction can stick around is in the
> > uarch micro-ops cache and all that, and SERIALIZE will very much flush
> > those.
>
> OK, thanks for pointing it out!
>
> >
> > Also, TLB flush != I$ flush. There is clflush_cache_range() for this.
> > But still, this really should not be needed.
> >
> > Also, this is all qemu, and qemu is known to have gotten this terribly
> > wrong in the past.
>
> What about KVM? We need to ask Naresh how it is running on the machine.
> Naresh, can you tell us how the VM is running? Does that use KVM?
> And if so, how the kvm is configured(it may depend on the real hardware)?

We do not use KVM and are running the Qemu version (10.0.0).

>
> >
> > If you all cannot reproduce on real hardware, I'm considering this a
> > qemu bug.

It is reproducible intermittently on x86_64 device and qemu-x86 device
with and without compat mode.

This link is showing how intermittent it is on Linux next tree.

 - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/testrun/28685600/suite/log-parser-test/test/oops-oops-int3-smp-pti/history/?page=2

- Naresh

>
> OK, if it is a qemu's bug, dropping [2/2], but I think we still need
> [1/2] to avoid kernel crash (with a warning message without dump).
>
> Thank you,
>
> >
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions
  2025-06-12 16:24             ` Naresh Kamboju
@ 2025-06-13  3:09               ` Masami Hiramatsu
  0 siblings, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-13  3:09 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, Steven Rostedt, x86, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Thu, 12 Jun 2025 21:54:05 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> On Thu, 12 Jun 2025 at 05:47, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Wed, 11 Jun 2025 13:30:01 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
> > > > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > >
> > > > Invalidate the cache after replacing INT3 with the new instruction.
> > > > This will prevent the other CPUs seeing the removed INT3 in their
> > > > cache after serializing the pipeline.
> > > >
> > > > LKFT reported an oops by INT3 but there is no INT3 shown in the
> > > > dumped code. This means the INT3 is removed after the CPU hits
> > > > INT3.
> > > >
> > > >  ## Test log
> > > >  ftrace-stress-test: <12>[   21.971153] /usr/local/bin/kirk[277]:
> > > >  starting test ftrace-stress-test (ftrace_stress_test.sh 90)
> > > >  <4>[   58.997439] Oops: int3: 0000 [#1] SMP PTI
> > > >  <4>[   58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted
> > > >  6.15.0-next-20250605 #1 PREEMPT(voluntary)
> > > >  <4>[   58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> > > >  BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > > >  <4>[   58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
> > > >  <4>[   58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00
> > > >  00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
> > > >  0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15
> > > >  12 e4 fe
> > > >
> > > > Maybe one possible scenario is to hit the int3 after the third step
> > > > somehow (on I-cache).
> > > >
> > > > ------
> > > > <CPU0>                                      <CPU1>
> > > >                                     Start smp_text_poke_batch_finish().
> > > >                                     Start the third step. (remove INT3)
> > > >                                     on_each_cpu(do_sync_core)
> > > > do_sync_core(do SERIALIZE)
> > > >                                     Finish the third step.
> > > > Hit INT3 (from I-cache)
> > > >                                     Clear text_poke_array_refs[cpu0]
> > > > Start smp_text_poke_int3_handler()
> > > > Failed to get text_poke_array_refs[cpu0]
> > > > Oops: int3
> > > > ------
> > > >
> > > > SERIALIZE instruction flashes pipeline, thus the processor needs
> > > > to reload the instruction. But it is not ensured to reload it from
> > > > memory because SERIALIZE does not invalidate the cache.
> > > >
> > > > To prevent reloading replaced INT3, we need to invalidate the cache
> > > > (flush TLB) in the third step, before the do_sync_core().
> > >
> > > This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A
> > > store should cause the invalidation per MESI and all that. This means
> > > the only place where the old instruction can stick around is in the
> > > uarch micro-ops cache and all that, and SERIALIZE will very much flush
> > > those.
> >
> > OK, thanks for pointing it out!
> >
> > >
> > > Also, TLB flush != I$ flush. There is clflush_cache_range() for this.
> > > But still, this really should not be needed.
> > >
> > > Also, this is all qemu, and qemu is known to have gotten this terribly
> > > wrong in the past.
> >
> > What about KVM? We need to ask Naresh how it is running on the machine.
> > Naresh, can you tell us how the VM is running? Does that use KVM?
> > And if so, how the kvm is configured(it may depend on the real hardware)?
> 
> We do not use KVM and are running the Qemu version (10.0.0).
> 
> >
> > >
> > > If you all cannot reproduce on real hardware, I'm considering this a
> > > qemu bug.
> 
> It is reproducible intermittently on x86_64 device and qemu-x86 device
> with and without compat mode.

Interesting, so it seems not a KVM/qemu issue, but a real bug in the
INT3 (maybe text_poke?).

> 
> This link is showing how intermittent it is on Linux next tree.
> 
>  - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/testrun/28685600/suite/log-parser-test/test/oops-oops-int3-smp-pti/history/?page=2

I found this example did not remove INT3 but failed to handle it.

https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250501/testrun/28300874/suite/log-parser-test/test/oops-oops-int3-smp-pti/details/

>>>>>>
[   77.103476] Oops: int3: 0000 [#1] SMP PTI
[   77.103481] CPU: 2 UID: 0 PID: 10062 Comm: cat Not tainted 6.15.0-rc4-next-20250501 #1 PREEMPT_{RT,(full)} 
[   77.103484] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021
[   77.103485] RIP: 0010:kmem_cache_alloc_noprof+0x10a/0x2c0
[   77.103490] Code: 4c 89 e7 e8 28 e4 cd 00 66 90 f7 c5 00 00 40 00 0f 85 89 01 00 00 f6 43 09 20 0f 85 7f 01 00 00 4c 8b 24 24 48 8b 74 24 38 cc <1f> 44 00 00 48 8b 44 24 08 65 48 2b 05 cd e5 23 02 0f 85 8e 01 00
[   77.103491] RSP: 0018:ffffa0954960bac0 EFLAGS: 00000202
[   77.103493] RAX: 0000000000000001 RBX: ffff9105c0229700 RCX: 0000000000000007
[   77.103494] RDX: ffff9105c7589180 RSI: ffffffffb6fc247e RDI: ffff9105c7589180
[   77.103495] RBP: 0000000000000cc0 R08: 0000000000000006 R09: 00000000000000c0
[   77.103496] R10: ffffa0954960bbb8 R11: ffff9105cd06310c R12: ffff9105c3583300
[   77.103497] R13: 00000000000000c0 R14: ffffffffb6fc247e R15: ffff9105c3b7c200
[   77.103499] FS:  0000000000000000(0000) GS:ffff9109668b7000(0000) knlGS:0000000000000000
[   77.103500] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   77.103501] CR2: 00007ffdbe756f50 CR3: 0000000103e24003 CR4: 00000000003726f0
[   77.103502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   77.103503] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   77.103504] Call Trace:
[   77.103505]  <TASK>
[   77.103507]  vm_area_dup+0x1e/0xe0
[   77.103510]  __split_vma+0xa0/0x320
[   77.103513]  vms_gather_munmap_vmas+0xab/0x230
[   77.103514]  __mmap_region+0x211/0xb80
[   77.103521]  do_mmap+0x3fa/0x5a0
[   77.103524]  vm_mmap_pgoff+0xfc/0x1d0
[   77.103528]  ksys_mmap_pgoff+0x149/0x1f0
[   77.103531]  ? do_syscall_64+0x7e/0x1d0
[   77.103535]  do_syscall_64+0xb2/0x1d0
[   77.103537]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>>>>>

The code pattern looks like a text_poke_batch() 

cc <1f> 44 00 00 = BYTES_NOP5 with INT3.

But since it is not at the entry of the symbol, it may not a ftrace entry,
maybe a tracepoint?

-------
void *kmem_cache_alloc_noprof(struct kmem_cache *s, gfp_t gfpflags)
{
	void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE, _RET_IP_,
				    s->object_size);

	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);

	return ret;
}
-------

Hmm, it seems like smp_text_poke_batch_finish() in the first step
(add INT3 on NOP) or the second step (right before removing INT3).

Thanks,

> 
> - Naresh
> 
> >
> > OK, if it is a qemu's bug, dropping [2/2], but I think we still need
> > [1/2] to avoid kernel crash (with a warning message without dump).
> >
> > Thank you,
> >
> > >
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-12 13:09       ` Naresh Kamboju
@ 2025-06-13  8:27         ` Masami Hiramatsu
  2025-06-13 12:01           ` Masami Hiramatsu
  2025-06-16  7:36           ` Masami Hiramatsu
  0 siblings, 2 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-13  8:27 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Steven Rostedt, Masami Hiramatsu, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Thu, 12 Jun 2025 18:39:41 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> On Tue, 10 Jun 2025 at 20:22, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Tue, 10 Jun 2025 18:50:05 +0530
> > Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
> > > > Is this bug reproducible easier recently?
> > >
> > > Yes. It is easy to reproduce.
> >
> > Can you test before and after this commit:
> >
> >   4334336e769b ("x86/alternatives: Improve code-patching scalability by
> >   removing false sharing in poke_int3_handler()")
> >
> > I think that may be the culprit.
> >
> > Even if Masami's patches work, I want to know what exactly caused it.
> 
> Steven,
> 
> Since the reported regressions are intermittent, It is not easy to bisect.
> However, The commit merged into Linux next-20250414 tag and then
> started noticing from next-20250415 onwards this regression on both
> x86_64 devices and qemu-x86_64 intermittently with and without
> compat mode.
> 
>  - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/testrun/28685600/suite/log-parser-test/test/oops-oops-int3-smp-pti/history/
>  - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/testrun/28685600/suite/log-parser-test/test/oops-oops-int3-smp-pti/history/?page=2
> 
> And above commit landed into Linus master branch on 2025-05-13 and
> then started noticing this regression intermittently on x86 with and without
> compat mode.
> 
>   - https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.16-rc1/testrun/28711641/suite/log-parser-test/test/oops-oops-oops-smp-pti/history/?page=1
> 
> Masami San,
> 
> case 1) compat mode x86_64 (64-bit kernel + 32-bit rootfs)
> I have tested your patch on top of linux next-20250606 tag and tested
> on real x86_64 (64-bit kernel + 32-bit rootfs) hardware for 7 test runs.
> 
> ftrace_regression01 - pass
> ftrace_regression02 - pass
> ftrace-stress-test - pass
> dynamic_debug01 - Hangs (No crash log on serial console)

Hm, this last one seems different reason.

> 
> Case 1.1)
> Above case noticed on qemu-x86_64 with compat mode ^ with
> 12 test runs.
> 
> - https://lkft.validation.linaro.org/scheduler/job/8312811#L1687
> 
> case 2) x86_64 (64-bit kernel + 64-bit rootfs)
> I have tested your patch on top of linux next-20250606 tag and tested
> on real x86_64 (64-bit kernel + 64-bit rootfs) hardware for 4 runs and out of
> these 3 runs failed and found these kernel warnings, kernel BUG and
> invalid opcode while running LTP tracing test cases.
> 
> Here I am sharing the crash log snippet and boot and test log links  and
> build link.
> 
> Test logs:
> [  112.596591] Ring buffer clock went backwards: 113864910133 -> 112596588266
> [  115.829620] cat (5762) used greatest stack depth: 10936 bytes left
> [  120.922517] ------------[ cut here ]------------
> [  120.927198] WARNING: CPU: 2 PID: 6639 at
> kernel/trace/trace_functions_graph.c:985 print_graph_entry+0x579/0x590
> [  120.937364] Modules linked in: x86_pkg_temp_thermal
> [  120.942405] CPU: 2 UID: 0 PID: 6639 Comm: cat Tainted: G S
>         6.15.0-next-20250606 #1 PREEMPT(voluntary)
> [  120.953380] Tainted: [S]=CPU_OUT_OF_SPEC
> [  120.957477] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.7 12/07/2021
> [  120.965036] RIP: 0010:print_graph_entry+0x579/0x590
> 
> Run 1:
> - https://lkft.validation.linaro.org/scheduler/job/8311136#L1700

The warning came from;
----
		/* Save this function pointer to see if the exit matches */
		if (call->depth < FTRACE_RETFUNC_DEPTH &&
		    !WARN_ON_ONCE(call->depth < 0))
			cpu_data->enter_funcs[call->depth] = call->func;
	}
----

Hit the "call->depth < 0". Thus this is function graph tracer's
problem.


> ftrace-stress-test: [   58.963898] /usr/local/bin/kirk[340]: starting
> test ftrace-stress-test (ftrace_stress_test.sh 90)
> [   60.316588] ------------[ cut here ]------------
> [   60.316588] ------------[ cut here ]------------
> [   60.316590] ------------[ cut here ]------------
> [   60.316593] ------------[ cut here ]------------
> [   60.316593] ------------[ cut here ]------------
> [   60.316594] ------------[ cut here ]------------
> [   60.316594] kernel BUG at kernel/entry/common.c:328!
> [   60.316594] kernel BUG at kernel/entry/common.c:328!
> [   60.316595] kernel BUG at kernel/entry/common.c:328!
> [   60.316600] Oops: invalid opcode: 0000 [#1] SMP PTI
> [   60.316604] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S
>        6.15.0-next-20250606 #1 PREEMPT(voluntary)
> [   60.316608] Tainted: [S]=CPU_OUT_OF_SPEC
> [   60.316609] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.7 12/07/2021
> [   60.316614] ------------[ cut here ]------------
> [   60.316615] kernel BUG at kernel/entry/common.c:328!
> [   60.316617] Oops: invalid opcode: 0000 [#2] SMP PTI
> [   60.316620] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S
>        6.15.0-next-20250606 #1 PREEMPT(voluntary)
> [   60.316622] Tainted: [S]=CPU_OUT_OF_SPEC
> [   60.316623] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.7 12/07/2021
> [   60.316625] RIP: 0010:irqentry_nmi_enter+0x6c/0x70
> 
> Run 2:
> - https://lkft.validation.linaro.org/scheduler/job/8311138#L1703

Interesting. This hits the max nestable number of NMI.

/*
 * nmi_enter() can nest up to 15 times; see NMI_BITS.
 */
#define __nmi_enter()						\
	do {							\
		lockdep_off();					\
		arch_nmi_enter();				\
		BUG_ON(in_nmi() == NMI_MASK);			\  <=====
		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
	} while (0)


> 
> ftrace-stress-test: [   78.877495] /usr/local/bin/kirk[343]: starting
> test ftrace-stress-test (ftrace_stress_test.sh 90)
> [   78.977303] Scheduler tracepoints stat_sleep, stat_iowait,
> stat_blocked and stat_runtime require the kernel parameter
> schedstats=enable or kernel.sched_schedstats=1
> [   82.299799] cat (2322) used greatest stack depth: 11520 bytes left
> [   82.327708] cat (2327) used greatest stack depth: 11256 bytes left
> [   82.632183] cat (2375) used greatest stack depth: 10992 bytes left
> [  137.335901] ------------[ cut here ]------------
> [  137.335901] ------------[ cut here ]------------
> [  137.335902] ------------[ cut here ]------------
> [  137.335907] kernel BUG at kernel/entry/common.c:328!
> [  137.335908] ------------[ cut here ]------------
> [  137.335909] ------------[ cut here ]------------
> [  137.335912] kernel BUG at kernel/entry/common.c:328!
> [  137.335912] kernel BUG at kernel/entry/common.c:328!
> [  137.335915] Oops: invalid opcode: 0000 [#1] SMP PTI
> [  137.335921] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S
>       6.15.0-next-20250606 #1 PREEMPT(voluntary)
> [  137.335926] Tainted: [S]=CPU_OUT_OF_SPEC
> [  137.335929] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.7 12/07/2021
> [  137.335937] ------------[ cut here ]------------
> [  137.335939] kernel BUG at kernel/entry/common.c:328!
> [  137.335945] Oops: invalid opcode: 0000 [#2] SMP PTI
> [  137.335949] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S
>       6.15.0-next-20250606 #1 PREEMPT(voluntary)
> [  137.335953] Tainted: [S]=CPU_OUT_OF_SPEC
> [  137.335956] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.7 12/07/2021
> [  137.335959] RIP: 0010:irqentry_nmi_enter+0x6c/0x70
> 
> Run 3:
> - https://lkft.validation.linaro.org/scheduler/job/8311139#L1703

This is the same as Run 2, and clearer.

In do_int3(), if we hit a disappeared int3, it is evacuated after
all. This means kprobe_int3_handler() is hit, and call get_kprobe()
to find the corresponding kprobes. But,

ffffffff8150a040 <get_kprobe>:
ffffffff8150a040:       f3 0f 1e fa             endbr64
ffffffff8150a044:       e8 07 b0 e2 ff          call   ffffffff81335050 <__fentry__>
ffffffff8150a049:       48 b8 eb 83 b5 80 46    movabs $0x61c8864680b583eb,%rax
ffffffff8150a050:       86 c8 61 

It hits the ftrace and hooked by fgraph, and eventually returns
via ftrace_return_to_handler()

[  137.338572] RIP: 0010:ftrace_return_to_handler+0xd5/0x1f0
[  137.338577] Code: 00 89 55 c8 48 85 ff 74 07 4c 89 b7 80 00 00 00 49 8b 94 24 38 0b 00 00 48 98 48 8b 04 c2 48 c1 e8 0c 0f b7 c0 48 89 45 b8 cc <90> 48 8b 05 e3 ac c2 01 48 63 80 f8 00 00 00 48 0f a3 45 b8 72 39

This address is;

$ eu-addr2line -fi -e vmlinux ftrace_return_to_handler+0xd5
arch_static_branch inlined at /builds/linux/kernel/trace/fgraph.c:839:6 in ftrace_return_to_handler
/builds/linux/arch/x86/include/asm/jump_label.h:36:2
__ftrace_return_to_handler
/builds/linux/kernel/trace/fgraph.c:839:6
ftrace_return_to_handler
/builds/linux/kernel/trace/fgraph.c:874:9

It is for static_branch, which also uses a text_poke.

-----
#ifdef CONFIG_HAVE_STATIC_CALL
	if (static_branch_likely(&fgraph_do_direct)) { <======
		if (test_bit(fgraph_direct_gops->idx, &bitmap))
			static_call(fgraph_retfunc)(&trace, fgraph_direct_gops, fregs);
-----

But actually, this static_branch modifies the kernel code with
smp_text_poke_single() (note, this is a wrapper of smp_text_poke_batch).

And this is MISSED by the smp_text_poke_int3_handler() again and
go through the kprobes, and hit ftrace (fgraph) and caused this
loop.

So the fundamental issue is that smp_text_poke_batch missed
handling INT3. 

I guess some text_poke user do not get text_mutex?

Thank you,


> 
> Build log:
>  - https://storage.tuxsuite.com/public/linaro/naresh/builds/2yM9krm5KgE5a57QFvOqw9UrSgQ/
> 
> - Naresh
> 
> >
> > -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-13  8:27         ` Masami Hiramatsu
@ 2025-06-13 12:01           ` Masami Hiramatsu
  2025-06-16  7:36           ` Masami Hiramatsu
  1 sibling, 0 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-13 12:01 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Naresh Kamboju, Steven Rostedt, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Fri, 13 Jun 2025 17:27:53 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> In do_int3(), if we hit a disappeared int3, it is evacuated after
> all. This means kprobe_int3_handler() is hit, and call get_kprobe()
> to find the corresponding kprobes. But,
> 
> ffffffff8150a040 <get_kprobe>:
> ffffffff8150a040:       f3 0f 1e fa             endbr64
> ffffffff8150a044:       e8 07 b0 e2 ff          call   ffffffff81335050 <__fentry__>
> ffffffff8150a049:       48 b8 eb 83 b5 80 46    movabs $0x61c8864680b583eb,%rax
> ffffffff8150a050:       86 c8 61 

BTW, I think this get_kprobe() should be "notrace" because this
is called from int3 handler.

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-13  8:27         ` Masami Hiramatsu
  2025-06-13 12:01           ` Masami Hiramatsu
@ 2025-06-16  7:36           ` Masami Hiramatsu
  2025-06-17 10:41             ` Masami Hiramatsu
  1 sibling, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-16  7:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Naresh Kamboju, Steven Rostedt, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Fri, 13 Jun 2025 17:27:53 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Run 3:
> > - https://lkft.validation.linaro.org/scheduler/job/8311139#L1703
> 
> This is the same as Run 2, and clearer.
> 
> In do_int3(), if we hit a disappeared int3, it is evacuated after
> all. This means kprobe_int3_handler() is hit, and call get_kprobe()
> to find the corresponding kprobes. But,
> 
> ffffffff8150a040 <get_kprobe>:
> ffffffff8150a040:       f3 0f 1e fa             endbr64
> ffffffff8150a044:       e8 07 b0 e2 ff          call   ffffffff81335050 <__fentry__>
> ffffffff8150a049:       48 b8 eb 83 b5 80 46    movabs $0x61c8864680b583eb,%rax
> ffffffff8150a050:       86 c8 61 
> 
> It hits the ftrace and hooked by fgraph, and eventually returns
> via ftrace_return_to_handler()
> 
> [  137.338572] RIP: 0010:ftrace_return_to_handler+0xd5/0x1f0
> [  137.338577] Code: 00 89 55 c8 48 85 ff 74 07 4c 89 b7 80 00 00 00 49 8b 94 24 38 0b 00 00 48 98 48 8b 04 c2 48 c1 e8 0c 0f b7 c0 48 89 45 b8 cc <90> 48 8b 05 e3 ac c2 01 48 63 80 f8 00 00 00 48 0f a3 45 b8 72 39
> 
> This address is;
> 
> $ eu-addr2line -fi -e vmlinux ftrace_return_to_handler+0xd5
> arch_static_branch inlined at /builds/linux/kernel/trace/fgraph.c:839:6 in ftrace_return_to_handler
> /builds/linux/arch/x86/include/asm/jump_label.h:36:2
> __ftrace_return_to_handler
> /builds/linux/kernel/trace/fgraph.c:839:6
> ftrace_return_to_handler
> /builds/linux/kernel/trace/fgraph.c:874:9
> 
> It is for static_branch, which also uses a text_poke.
> 
> -----
> #ifdef CONFIG_HAVE_STATIC_CALL
> 	if (static_branch_likely(&fgraph_do_direct)) { <======
> 		if (test_bit(fgraph_direct_gops->idx, &bitmap))
> 			static_call(fgraph_retfunc)(&trace, fgraph_direct_gops, fregs);
> -----
> 
> But actually, this static_branch modifies the kernel code with
> smp_text_poke_single() (note, this is a wrapper of smp_text_poke_batch).
> 
> And this is MISSED by the smp_text_poke_int3_handler() again and
> go through the kprobes, and hit ftrace (fgraph) and caused this
> loop.
> 
> So the fundamental issue is that smp_text_poke_batch missed
> handling INT3. 
> 
> I guess some text_poke user do not get text_mutex?

Hmm, I've checked the smp_text_poke_* users, but it seems no problem.
Basically, those smp_text_poke* user locks text_mutex, and another
suspicious ftrace_start_up is also set under ftrace_lock.
ftrace_arch_code_modify_post_process() is also paired with
ftrace_arch_code_modify_prepare() and under ftrace_lock.


smp_text_poke_single()
  ftrace_mod_jmp()
    ftrace_enable_ftrace_graph_caller()
      ftrace_modify_all_code() -> see [*1]
    ftrace_disable_ftrace_graph_caller()
      ftrace_modify_all_code() -> see [*1]
  ftrace_update_ftrace_func()
    update_ftrace_func()
      ftrace_modify_all_code() -> see [*1]
      
smp_text_poke_batch_add()
  arch_jump_label_transform_queue() -> lock text_mutex
  ftrace_replace_code()
    ftrace_modify_all_code() <------[*1]
      arch_ftrace_update_code()
        ftrace_run_update_code() -> lock text_mutex
  ftrace_modify_code_direct() (only if ftrace_poke_late != 0)
    ftrace_make_nop()
      __ftrace_replace_code() <----[*3]
        ftrace_replace_code(weak) --> Not used on x86 (overridden)
          ftrace_modify_all_code()  <--- [*1]
            arch_ftrace_update_code() <---- [*4]
              ftrace_run_update_code()-> lock text_mutex
            __ftrace_modify_code()
              ftrace_run_stop_machine()
                arch_ftrace_update_code(weak) -> overridden on x86 see [*4]
        ftrace_module_enable() -> lock text_mutex (see below)
      ftrace_init_nop()
        ftrace_nop_initialize()
          ftrace_update_code()
            ftrace_module_enable() -> lock text_mutex
              prepare_coming_module()
                load_modole()
            ftrace_process_locs() -> lock ftrace_lock.
              ftrace_init() -> OK (ftrace_poke_late == 0 because its early)
              ftrace_module_init() -> OK (ftrace_poke_late == 0 because module is not live)
                load_module()
    ftrace_make_call()
      __ftrace_replace_code() -> see [*3]


smp_text_poke_batch_finish()
  arch_jump_label_transform_apply() -> lock text_mutex
  ftrace_arch_code_modify_post_process() -> must be OK because this unlock text_mutex
    ftrace_run_update_code()-> paired with ftrace_arch_code_modify_prepare()
    ftrace_module_enable()-> paired with ftrace_arch_code_modify_prepare() (depends on ftrace_lock && ftrace_start_up)
  ftrace_replace_code()
    ftrace_modify_all_code() -> see [*1]


ftrace_start_up <does variable set under ftrace_lock ?>
  ftrace_startup()
    ftrace_startup_subops()
      register_ftrace_graph() -> lock ftrace_lock
    register_ftrace_function_probe() -> lock ftrace_lock
    register_ftrace_function_nolock() -> lock ftrace_lock
  ftrace_shutdown()
    unregister_ftrace_function() -> lock ftrace_lock


ftrace_arch_code_modify_prepare() < this set ftrace_poke_late = 1>
  ftrace_module_enable() -> lock ftrace_lock.
  ftrace_run_update_code()
    ftrace_run_modify_code()
      ftrace_ops_update_code()
        __ftrace_hash_move_and_update_ops()
          ftrace_update_ops()
            ftrace_startup_subops()
              register_ftrace_graph()  -> lock ftrace_lock
            ftrace_shutdown_subops()
              unregister_ftrace_graph() -> lock ftrace_lock
            ftrace_hash_move_and_update_subops()
              ftrace_hash_move_and_update_ops() -> [*2]
          ftrace_hash_move_and_update_ops()  <-- [*2]
            process_mod_list() -> lock ftrace_lock
            register_ftrace_function_probe() -> lock ftrace_lock
            unregister_ftrace_function_probe_func() -> lock ftrace_lock
            ftrace_set_hash() -> lock ftrace_lock
            ftrace_regex_release() -> lock ftrace_lock
      unregister_ftrace_function_probe_func() -> lock ftrace_lock
    ftrace_startup_enable()
      ftrace_startup_all()
        ftrace_pid_reset() -> lock ftrace_lock
        pid_write() -> lock ftrace_lock
      ftrace_startup()
        ftrace_startup_subops()
          register_ftrace_graph() -> lock ftrace_lock
        register_ftrace_function_probe() -> lock ftrace_lock
        register_ftrace_function_nolock() -> lock ftrace_lock
      ftrace_startup_sysctl()
        ftrace_enable_sysctl() -> lock ftrace_lock
    ftrace_shutdown()
      ftrace_shutdown_subops()
        unregister_ftrace_graph() -> lock ftrace_lock
      unregister_ftrace_function_probe_func() -> lock ftrace_lock
      ftrace_destroy_filter_files() -> lock ftrace_lock
      unregister_ftrace_function() -> lock ftrace_lock
    ftrace_shutdown_sysctl()
      ftrace_enable_sysctl() -> lock ftrace_lock


Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-16  7:36           ` Masami Hiramatsu
@ 2025-06-17 10:41             ` Masami Hiramatsu
  2025-06-17 12:10               ` Naresh Kamboju
                                 ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-17 10:41 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Naresh Kamboju, Steven Rostedt, open list, Linux trace kernel,
	lkft-triage, Stephen Rothwell, Arnd Bergmann, Dan Carpenter,
	Anders Roxell

On Mon, 16 Jun 2025 16:36:59 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > So the fundamental issue is that smp_text_poke_batch missed
> > handling INT3. 
> > 
> > I guess some text_poke user do not get text_mutex?
> 
> Hmm, I've checked the smp_text_poke_* users, but it seems no problem.
> Basically, those smp_text_poke* user locks text_mutex, and another
> suspicious ftrace_start_up is also set under ftrace_lock.
> ftrace_arch_code_modify_post_process() is also paired with
> ftrace_arch_code_modify_prepare() and under ftrace_lock.

Eventually, I found a bug in text_poke, and jump_label
(tracepoint) hit the bug.

The jump_label uses 2 different APIs (single and batch)
which independently takes text_mutex lock. 

smp_text_poke_single()
  __jump_label_transform()
    jump_label_transform() --> lock text_mutex

smp_text_poke_batch_add()
  arch_jump_label_transform_queue() -> lock text_mutex

smp_text_poke_batch_finish()
  arch_jump_label_transform_apply() -> lock text_mutex

This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives:
Remove the mixed-patching restriction on smp_text_poke_single()"),
but smp_text_poke_single() still expects that the batched
APIs are run in the same text_mutex lock region.
Thus if user calls those APIs in the below order;

arch_jump_label_transform_queue(addr1)
jump_label_transform(addr2)
arch_jump_label_transform_apply()

And if the addr1 > addr2, the bsearch on the array
does not work, and failed to handle int3!

This can explain the disappeared int3 case. If it happens
right before int3 is overwritten, that int3 will be 
overwritten when the int3 handler dumps the code, but
text_poke_array_refs is still 1.

It seems that commit c8976ade0c1b ("x86/alternatives: 
Simplify smp_text_poke_single() by using tp_vec and existing APIs")
introduced this problem, because it shares the global array in
the text_poke_batch and text_poke_single. Before that commit,
text_poke_single (text_poke_bp) uses its local variable.

To fix this issue, Use smp_text_poke_batch_add() in
smp_text_poke_single(), which checks whether the array
sorted and the array index does not overflow.

Please test below;

From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Tue, 17 Jun 2025 19:18:37 +0900
Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken
 text_poke array

Since smp_text_poke_single() does not expect there is another
text_poke request is queued, it can make text_poke_array not
sorted or cause a buffer overflow on the text_poke_array.vec[].
This will cause an Oops in int3, or kernel page fault if it causes
a buffer overflow.

Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
so that it correctly flush the queue if needed.

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 arch/x86/kernel/alternative.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ecfe7b497cad..8038951650c6 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c
  */
 void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate)
 {
-	__smp_text_poke_batch_add(addr, opcode, len, emulate);
+	smp_text_poke_batch_add(addr, opcode, len, emulate);
 	smp_text_poke_batch_finish();
 }
-- 
2.50.0.rc2.692.g299adb8693-goog

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 10:41             ` Masami Hiramatsu
@ 2025-06-17 12:10               ` Naresh Kamboju
  2025-06-17 12:25                 ` Steven Rostedt
  2025-06-17 14:29               ` Steven Rostedt
  2025-06-17 16:45               ` Naresh Kamboju
  2 siblings, 1 reply; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-17 12:10 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

Hi Masami,

On Tue, 17 Jun 2025 at 16:12, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon, 16 Jun 2025 16:36:59 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > So the fundamental issue is that smp_text_poke_batch missed
> > > handling INT3.
> > >
> > > I guess some text_poke user do not get text_mutex?
> >
> > Hmm, I've checked the smp_text_poke_* users, but it seems no problem.
> > Basically, those smp_text_poke* user locks text_mutex, and another
> > suspicious ftrace_start_up is also set under ftrace_lock.
> > ftrace_arch_code_modify_post_process() is also paired with
> > ftrace_arch_code_modify_prepare() and under ftrace_lock.
>
> Eventually, I found a bug in text_poke, and jump_label
> (tracepoint) hit the bug.
>
> The jump_label uses 2 different APIs (single and batch)
> which independently takes text_mutex lock.
>
> smp_text_poke_single()
>   __jump_label_transform()
>     jump_label_transform() --> lock text_mutex
>
> smp_text_poke_batch_add()
>   arch_jump_label_transform_queue() -> lock text_mutex
>
> smp_text_poke_batch_finish()
>   arch_jump_label_transform_apply() -> lock text_mutex
>
> This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives:
> Remove the mixed-patching restriction on smp_text_poke_single()"),
> but smp_text_poke_single() still expects that the batched
> APIs are run in the same text_mutex lock region.
> Thus if user calls those APIs in the below order;
>
> arch_jump_label_transform_queue(addr1)
> jump_label_transform(addr2)
> arch_jump_label_transform_apply()
>
> And if the addr1 > addr2, the bsearch on the array
> does not work, and failed to handle int3!
>
> This can explain the disappeared int3 case. If it happens
> right before int3 is overwritten, that int3 will be
> overwritten when the int3 handler dumps the code, but
> text_poke_array_refs is still 1.
>
> It seems that commit c8976ade0c1b ("x86/alternatives:
> Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> introduced this problem, because it shares the global array in
> the text_poke_batch and text_poke_single. Before that commit,
> text_poke_single (text_poke_bp) uses its local variable.
>
> To fix this issue, Use smp_text_poke_batch_add() in
> smp_text_poke_single(), which checks whether the array
> sorted and the array index does not overflow.
>
> Please test below;
>

Do you mean only this single patch on top of the Linux next ?

>
> From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> Date: Tue, 17 Jun 2025 19:18:37 +0900
> Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken
>  text_poke array
>
> Since smp_text_poke_single() does not expect there is another
> text_poke request is queued, it can make text_poke_array not
> sorted or cause a buffer overflow on the text_poke_array.vec[].
> This will cause an Oops in int3, or kernel page fault if it causes
> a buffer overflow.
>
> Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
> so that it correctly flush the queue if needed.
>
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  arch/x86/kernel/alternative.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ecfe7b497cad..8038951650c6 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c
>   */
>  void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate)
>  {
> -       __smp_text_poke_batch_add(addr, opcode, len, emulate);
> +       smp_text_poke_batch_add(addr, opcode, len, emulate);
>         smp_text_poke_batch_finish();
>  }
> --
> 2.50.0.rc2.692.g299adb8693-goog
>
>
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 12:10               ` Naresh Kamboju
@ 2025-06-17 12:25                 ` Steven Rostedt
  2025-06-17 12:31                   ` Naresh Kamboju
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-06-17 12:25 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Masami Hiramatsu, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 17 Jun 2025 17:40:25 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> > Please test below;
> >  
> 
> Do you mean only this single patch on top of the Linux next ?

Looking at Masami's analysis, yeah, I think you only need that one patch.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 12:25                 ` Steven Rostedt
@ 2025-06-17 12:31                   ` Naresh Kamboju
  0 siblings, 0 replies; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-17 12:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 17 Jun 2025 at 17:55, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 17 Jun 2025 17:40:25 +0530
> Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> > > Please test below;
> > >
> >
> > Do you mean only this single patch on top of the Linux next ?
>
> Looking at Masami's analysis, yeah, I think you only need that one patch.

Testing is in progress.

>
> -- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 10:41             ` Masami Hiramatsu
  2025-06-17 12:10               ` Naresh Kamboju
@ 2025-06-17 14:29               ` Steven Rostedt
  2025-06-17 23:40                 ` Masami Hiramatsu
  2025-06-17 16:45               ` Naresh Kamboju
  2 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-06-17 14:29 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Naresh Kamboju, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 17 Jun 2025 19:41:59 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Eventually, I found a bug in text_poke, and jump_label
> (tracepoint) hit the bug.
> 
> The jump_label uses 2 different APIs (single and batch)
> which independently takes text_mutex lock. 
> 
> smp_text_poke_single()
>   __jump_label_transform()
>     jump_label_transform() --> lock text_mutex
> 
> smp_text_poke_batch_add()
>   arch_jump_label_transform_queue() -> lock text_mutex
> 
> smp_text_poke_batch_finish()
>   arch_jump_label_transform_apply() -> lock text_mutex
> 
> This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives:
> Remove the mixed-patching restriction on smp_text_poke_single()"),
> but smp_text_poke_single() still expects that the batched
> APIs are run in the same text_mutex lock region.
> Thus if user calls those APIs in the below order;
> 
> arch_jump_label_transform_queue(addr1)
> jump_label_transform(addr2)
> arch_jump_label_transform_apply()
> 
> And if the addr1 > addr2, the bsearch on the array
> does not work, and failed to handle int3!
> 

Nice catch!

> This can explain the disappeared int3 case. If it happens
> right before int3 is overwritten, that int3 will be 
> overwritten when the int3 handler dumps the code, but
> text_poke_array_refs is still 1.
> 
> It seems that commit c8976ade0c1b ("x86/alternatives: 
> Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> introduced this problem, because it shares the global array in
> the text_poke_batch and text_poke_single. Before that commit,
> text_poke_single (text_poke_bp) uses its local variable.
> 
> To fix this issue, Use smp_text_poke_batch_add() in
> smp_text_poke_single(), which checks whether the array
> sorted and the array index does not overflow.
> 
> Please test below;
> 
> 
> >From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001  
> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> Date: Tue, 17 Jun 2025 19:18:37 +0900
> Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken
>  text_poke array
> 
> Since smp_text_poke_single() does not expect there is another
> text_poke request is queued, it can make text_poke_array not
> sorted or cause a buffer overflow on the text_poke_array.vec[].
> This will cause an Oops in int3, or kernel page fault if it causes
> a buffer overflow.

I would add more of what you found above in the change log. And the issue
that was triggered I don't think was because of a buffer overflow. It was
because an entry was added to the text_poke_array out of order causing the
bsearch to fail.

Please add to the change log that the issue is that smp_text_poke_single()
can be called while smp_text_poke_batch*() is being used. The locking is
around the called functions but nothing prevents them from being intermingled.

This means that if we have:

   CPU 0                           CPU 1                      CPU 2
   -----                           -----                      -----

 smp_text_poke_batch_add()

                                smp_text_poke_single() <<-- Adds out of order

                                                            <int3>
                                                           [Fails o find address in
                                                            text_poke_array ]
                                                           OOPS!

No overflow. This could possibly happen with just two entries!

> 
> Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
> so that it correctly flush the queue if needed.
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by
> using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google)

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

> <mhiramat@kernel.org> ---
>  arch/x86/kernel/alternative.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ecfe7b497cad..8038951650c6 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr,
> const void *opcode, size_t len, c */
>  void __ref smp_text_poke_single(void *addr, const void *opcode, size_t
> len, const void *emulate) {
> -	__smp_text_poke_batch_add(addr, opcode, len, emulate);
> +	smp_text_poke_batch_add(addr, opcode, len, emulate);
>  	smp_text_poke_batch_finish();
>  }


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 10:41             ` Masami Hiramatsu
  2025-06-17 12:10               ` Naresh Kamboju
  2025-06-17 14:29               ` Steven Rostedt
@ 2025-06-17 16:45               ` Naresh Kamboju
  2025-06-17 23:05                 ` Masami Hiramatsu
  2 siblings, 1 reply; 34+ messages in thread
From: Naresh Kamboju @ 2025-06-17 16:45 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 17 Jun 2025 at 16:12, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon, 16 Jun 2025 16:36:59 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > So the fundamental issue is that smp_text_poke_batch missed
> > > handling INT3.
> > >
> > > I guess some text_poke user do not get text_mutex?
> >
> > Hmm, I've checked the smp_text_poke_* users, but it seems no problem.
> > Basically, those smp_text_poke* user locks text_mutex, and another
> > suspicious ftrace_start_up is also set under ftrace_lock.
> > ftrace_arch_code_modify_post_process() is also paired with
> > ftrace_arch_code_modify_prepare() and under ftrace_lock.
>
> Eventually, I found a bug in text_poke, and jump_label
> (tracepoint) hit the bug.
>
> The jump_label uses 2 different APIs (single and batch)
> which independently takes text_mutex lock.
>
> smp_text_poke_single()
>   __jump_label_transform()
>     jump_label_transform() --> lock text_mutex
>
> smp_text_poke_batch_add()
>   arch_jump_label_transform_queue() -> lock text_mutex
>
> smp_text_poke_batch_finish()
>   arch_jump_label_transform_apply() -> lock text_mutex
>
> This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives:
> Remove the mixed-patching restriction on smp_text_poke_single()"),
> but smp_text_poke_single() still expects that the batched
> APIs are run in the same text_mutex lock region.
> Thus if user calls those APIs in the below order;
>
> arch_jump_label_transform_queue(addr1)
> jump_label_transform(addr2)
> arch_jump_label_transform_apply()
>
> And if the addr1 > addr2, the bsearch on the array
> does not work, and failed to handle int3!
>
> This can explain the disappeared int3 case. If it happens
> right before int3 is overwritten, that int3 will be
> overwritten when the int3 handler dumps the code, but
> text_poke_array_refs is still 1.
>
> It seems that commit c8976ade0c1b ("x86/alternatives:
> Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> introduced this problem, because it shares the global array in
> the text_poke_batch and text_poke_single. Before that commit,
> text_poke_single (text_poke_bp) uses its local variable.
>
> To fix this issue, Use smp_text_poke_batch_add() in
> smp_text_poke_single(), which checks whether the array
> sorted and the array index does not overflow.
>
> Please test below;
>
>
> From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> Date: Tue, 17 Jun 2025 19:18:37 +0900
> Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken
>  text_poke array
>
> Since smp_text_poke_single() does not expect there is another
> text_poke request is queued, it can make text_poke_array not
> sorted or cause a buffer overflow on the text_poke_array.vec[].
> This will cause an Oops in int3, or kernel page fault if it causes
> a buffer overflow.
>
> Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
> so that it correctly flush the queue if needed.
>

I’ve applied the patch on top of Linux next-20250617 and ran
the LTP tracing tests. I'm happy to report that the previously
observed kernel panic has been resolved.

Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>

> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  arch/x86/kernel/alternative.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index ecfe7b497cad..8038951650c6 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c
>   */
>  void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate)
>  {
> -       __smp_text_poke_batch_add(addr, opcode, len, emulate);
> +       smp_text_poke_batch_add(addr, opcode, len, emulate);
>         smp_text_poke_batch_finish();
>  }
> --
> 2.50.0.rc2.692.g299adb8693-goog
>
>
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 16:45               ` Naresh Kamboju
@ 2025-06-17 23:05                 ` Masami Hiramatsu
  2025-06-17 23:32                   ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-17 23:05 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Steven Rostedt, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 17 Jun 2025 22:15:20 +0530
Naresh Kamboju <naresh.kamboju@linaro.org> wrote:

> On Tue, 17 Jun 2025 at 16:12, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Mon, 16 Jun 2025 16:36:59 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >
> > > > So the fundamental issue is that smp_text_poke_batch missed
> > > > handling INT3.
> > > >
> > > > I guess some text_poke user do not get text_mutex?
> > >
> > > Hmm, I've checked the smp_text_poke_* users, but it seems no problem.
> > > Basically, those smp_text_poke* user locks text_mutex, and another
> > > suspicious ftrace_start_up is also set under ftrace_lock.
> > > ftrace_arch_code_modify_post_process() is also paired with
> > > ftrace_arch_code_modify_prepare() and under ftrace_lock.
> >
> > Eventually, I found a bug in text_poke, and jump_label
> > (tracepoint) hit the bug.
> >
> > The jump_label uses 2 different APIs (single and batch)
> > which independently takes text_mutex lock.
> >
> > smp_text_poke_single()
> >   __jump_label_transform()
> >     jump_label_transform() --> lock text_mutex
> >
> > smp_text_poke_batch_add()
> >   arch_jump_label_transform_queue() -> lock text_mutex
> >
> > smp_text_poke_batch_finish()
> >   arch_jump_label_transform_apply() -> lock text_mutex
> >
> > This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives:
> > Remove the mixed-patching restriction on smp_text_poke_single()"),
> > but smp_text_poke_single() still expects that the batched
> > APIs are run in the same text_mutex lock region.
> > Thus if user calls those APIs in the below order;
> >
> > arch_jump_label_transform_queue(addr1)
> > jump_label_transform(addr2)
> > arch_jump_label_transform_apply()
> >
> > And if the addr1 > addr2, the bsearch on the array
> > does not work, and failed to handle int3!
> >
> > This can explain the disappeared int3 case. If it happens
> > right before int3 is overwritten, that int3 will be
> > overwritten when the int3 handler dumps the code, but
> > text_poke_array_refs is still 1.
> >
> > It seems that commit c8976ade0c1b ("x86/alternatives:
> > Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> > introduced this problem, because it shares the global array in
> > the text_poke_batch and text_poke_single. Before that commit,
> > text_poke_single (text_poke_bp) uses its local variable.
> >
> > To fix this issue, Use smp_text_poke_batch_add() in
> > smp_text_poke_single(), which checks whether the array
> > sorted and the array index does not overflow.
> >
> > Please test below;
> >
> >
> > From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001
> > From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> > Date: Tue, 17 Jun 2025 19:18:37 +0900
> > Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken
> >  text_poke array
> >
> > Since smp_text_poke_single() does not expect there is another
> > text_poke request is queued, it can make text_poke_array not
> > sorted or cause a buffer overflow on the text_poke_array.vec[].
> > This will cause an Oops in int3, or kernel page fault if it causes
> > a buffer overflow.
> >
> > Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
> > so that it correctly flush the queue if needed.
> >
> 
> I’ve applied the patch on top of Linux next-20250617 and ran
> the LTP tracing tests. I'm happy to report that the previously
> observed kernel panic has been resolved.
> 
> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>

Thank you for testing!
This is a good chance for me to setup LTP environment locally :)

Thanks!

> 
> > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> > Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs")
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> >  arch/x86/kernel/alternative.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index ecfe7b497cad..8038951650c6 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c
> >   */
> >  void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate)
> >  {
> > -       __smp_text_poke_batch_add(addr, opcode, len, emulate);
> > +       smp_text_poke_batch_add(addr, opcode, len, emulate);
> >         smp_text_poke_batch_finish();
> >  }
> > --
> > 2.50.0.rc2.692.g299adb8693-goog
> >
> >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 23:05                 ` Masami Hiramatsu
@ 2025-06-17 23:32                   ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-06-17 23:32 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Naresh Kamboju, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Wed, 18 Jun 2025 08:05:54 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>  
> 
> Thank you for testing!
> This is a good chance for me to setup LTP environment locally :)

It's a beast and so far, it continues to fail to build for me :-p

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 14:29               ` Steven Rostedt
@ 2025-06-17 23:40                 ` Masami Hiramatsu
  2025-06-19 14:00                   ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Masami Hiramatsu @ 2025-06-17 23:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Naresh Kamboju, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Tue, 17 Jun 2025 10:29:51 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > >From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001  
> > From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
> > Date: Tue, 17 Jun 2025 19:18:37 +0900
> > Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken
> >  text_poke array
> > 
> > Since smp_text_poke_single() does not expect there is another
> > text_poke request is queued, it can make text_poke_array not
> > sorted or cause a buffer overflow on the text_poke_array.vec[].
> > This will cause an Oops in int3, or kernel page fault if it causes
> > a buffer overflow.
> 
> I would add more of what you found above in the change log. And the issue
> that was triggered I don't think was because of a buffer overflow. It was
> because an entry was added to the text_poke_array out of order causing the
> bsearch to fail.

There are two patterns of bugs I saw, one is "Oops: int3" and another is
"#PF in smp_text_poke_batch_finish (or smp_text_poke_int3_handler)".
The latter comes from buffer overflow.

-----
[  164.164215] BUG: unable to handle page fault for address: ffffffff32c00000
[  164.166999] #PF: supervisor read access in kernel mode
[  164.169096] #PF: error_code(0x0000) - not-present page
[  164.171143] PGD 8364b067 P4D 8364b067 PUD 0 
[  164.172954] Oops: Oops: 0000 [#1] SMP PTI
[  164.174581] CPU: 4 UID: 0 PID: 2702 Comm: sh Tainted: G        W           6.15.0-next-20250606-00002-g75b4e49588c2 #239 PREEMPT(voluntary) 
[  164.179193] Tainted: [W]=WARN
[  164.180926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  164.184696] RIP: 0010:smp_text_poke_batch_finish+0xb9/0x400
[  164.186873] Code: e4 4c 8d 6d c2 85 c9 74 39 48 63 03 b9 01 00 00 00 4c 89 ea 41 83 c4 01 48 c7 c7 d0 f7 f7 b2 48 83 c3 10 48 8d b0 00 00 c0 b2 <0f> b6 80 00 00 c0 b2 88 43 ff e8 68 e3 ff ff 44 3b 25 d1 29 5f 02
-----

This is because smp_text_poke_single() overwrites the
text_poke_array.vec[TEXT_POKE_ARRAY_MAX], which is nr_entries (and
the variables next to text_poke_array.)

-----
static struct smp_text_poke_array {
	struct smp_text_poke_loc vec[TEXT_POKE_ARRAY_MAX];
	int nr_entries;
} text_poke_array;
-----

> 
> Please add to the change log that the issue is that smp_text_poke_single()
> can be called while smp_text_poke_batch*() is being used. The locking is
> around the called functions but nothing prevents them from being intermingled.

OK.

> 
> This means that if we have:
> 
>    CPU 0                           CPU 1                      CPU 2
>    -----                           -----                      -----
> 
>  smp_text_poke_batch_add()
> 
>                                 smp_text_poke_single() <<-- Adds out of order
> 
>                                                             <int3>
>                                                            [Fails o find address in
>                                                             text_poke_array ]
>                                                            OOPS!

Thanks for the chart!

> 
> No overflow. This could possibly happen with just two entries!

Yes, that was actually I observed (by a debug patch)

> 
> > 
> > Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add()
> > so that it correctly flush the queue if needed.
> > 
> > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/
> > Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by
> > using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google)
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thank you!

> 
> -- Steve
> 
> > <mhiramat@kernel.org> ---
> >  arch/x86/kernel/alternative.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index ecfe7b497cad..8038951650c6 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr,
> > const void *opcode, size_t len, c */
> >  void __ref smp_text_poke_single(void *addr, const void *opcode, size_t
> > len, const void *emulate) {
> > -	__smp_text_poke_batch_add(addr, opcode, len, emulate);
> > +	smp_text_poke_batch_add(addr, opcode, len, emulate);
> >  	smp_text_poke_batch_finish();
> >  }
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
  2025-06-17 23:40                 ` Masami Hiramatsu
@ 2025-06-19 14:00                   ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-06-19 14:00 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Naresh Kamboju, open list, Linux trace kernel, lkft-triage,
	Stephen Rothwell, Arnd Bergmann, Dan Carpenter, Anders Roxell

On Wed, 18 Jun 2025 08:40:22 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:


> > I would add more of what you found above in the change log. And the issue
> > that was triggered I don't think was because of a buffer overflow. It was
> > because an entry was added to the text_poke_array out of order causing the
> > bsearch to fail.  
> 
> There are two patterns of bugs I saw, one is "Oops: int3" and another is
> "#PF in smp_text_poke_batch_finish (or smp_text_poke_int3_handler)".
> The latter comes from buffer overflow.
> 
> -----
> [  164.164215] BUG: unable to handle page fault for address: ffffffff32c00000
> [  164.166999] #PF: supervisor read access in kernel mode
> [  164.169096] #PF: error_code(0x0000) - not-present page
> [  164.171143] PGD 8364b067 P4D 8364b067 PUD 0 
> [  164.172954] Oops: Oops: 0000 [#1] SMP PTI
> [  164.174581] CPU: 4 UID: 0 PID: 2702 Comm: sh Tainted: G        W           6.15.0-next-20250606-00002-g75b4e49588c2 #239 PREEMPT(voluntary) 
> [  164.179193] Tainted: [W]=WARN
> [  164.180926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [  164.184696] RIP: 0010:smp_text_poke_batch_finish+0xb9/0x400
> [  164.186873] Code: e4 4c 8d 6d c2 85 c9 74 39 48 63 03 b9 01 00 00 00 4c 89 ea 41 83 c4 01 48 c7 c7 d0 f7 f7 b2 48 83 c3 10 48 8d b0 00 00 c0 b2 <0f> b6 80 00 00 c0 b2 88 43 ff e8 68 e3 ff ff 44 3b 25 d1 29 5f 02
> -----
> 
> This is because smp_text_poke_single() overwrites the
> text_poke_array.vec[TEXT_POKE_ARRAY_MAX], which is nr_entries (and
> the variables next to text_poke_array.)

Interesting. It must be that the stress test was able to get in and add
a bunch of individual entries while a batch was being performed.

Still, both are a bug and solved by the same solution ;-)

(Two for the price of one!)

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-06-19 14:00 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 11:42 next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Naresh Kamboju
2025-06-09 13:09 ` Masami Hiramatsu
2025-06-10  8:41   ` Masami Hiramatsu
2025-06-10 13:25     ` Steven Rostedt
2025-06-10 13:20   ` Naresh Kamboju
2025-06-10 14:43     ` Masami Hiramatsu
2025-06-10 14:47       ` [RFC PATCH 1/2] x86: Retry with new instruction if INT3 is disappaered Masami Hiramatsu (Google)
2025-06-10 14:47       ` [RFC PATCH 2/2] x86: alternative: Invalidate the cache for updated instructions Masami Hiramatsu (Google)
2025-06-10 15:50         ` Steven Rostedt
2025-06-11  0:21           ` Masami Hiramatsu
2025-06-11 10:26           ` Masami Hiramatsu
2025-06-11 14:20             ` Steven Rostedt
2025-06-11 15:42               ` Steven Rostedt
2025-06-12  0:04                 ` Masami Hiramatsu
2025-06-11 11:30         ` Peter Zijlstra
2025-06-12  0:17           ` Masami Hiramatsu
2025-06-12 16:24             ` Naresh Kamboju
2025-06-13  3:09               ` Masami Hiramatsu
2025-06-10 14:53     ` next-20250605: Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic Steven Rostedt
2025-06-12 13:09       ` Naresh Kamboju
2025-06-13  8:27         ` Masami Hiramatsu
2025-06-13 12:01           ` Masami Hiramatsu
2025-06-16  7:36           ` Masami Hiramatsu
2025-06-17 10:41             ` Masami Hiramatsu
2025-06-17 12:10               ` Naresh Kamboju
2025-06-17 12:25                 ` Steven Rostedt
2025-06-17 12:31                   ` Naresh Kamboju
2025-06-17 14:29               ` Steven Rostedt
2025-06-17 23:40                 ` Masami Hiramatsu
2025-06-19 14:00                   ` Steven Rostedt
2025-06-17 16:45               ` Naresh Kamboju
2025-06-17 23:05                 ` Masami Hiramatsu
2025-06-17 23:32                   ` Steven Rostedt
2025-06-10  0:29 ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).