Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
       [not found] <CA+G9fYsETJQm0Ue7hGsb+nbsiMikwycOV3V0DPr6WC2r61KRBQ@mail.gmail.com>
@ 2023-06-21 15:31 ` Arnd Bergmann
  2023-06-21 16:06   ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2023-06-21 15:31 UTC (permalink / raw)
  To: Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland
  Cc: Peter Zijlstra, linux-kernel, x86

On Wed, Jun 21, 2023, at 16:16, Naresh Kamboju wrote:
> Hi Team,
>
> FYI,
> qemu-x86_64 booting with 8.0.0 still see int3: when running LTP tracing testing.
>
> docker.io/linaro/tuxrun-qemu:v8.0.0 qemu-system-x86_64

Thanks for the report. I've added the x86 and kernel lists as well
as Peter Zijlstra to Cc.

> Running tests.......
> ftrace_regression01 1 TPASS: Finished running the test
> <4>[   49.874424] int3: 0000 [#1] PREEMPT SMP PTI
> <4>[   49.874634] int3: 0000 [#2] PREEMPT SMP PTI
> <4>[   49.874656] int3: 0000 [#3] PREEMPT SMP PTI
> <4>[   49.874669] int3: 0000 [#4] PREEMPT SMP PTI
> <4>[   49.874680] int3: 0000 [#5] PREEMPT SMP PTI
> <4>[   49.874691] int3: 0000 [#6] PREEMPT SMP PTI
> <4>[   49.874703] int3: 0000 [#7] PREEMPT SMP PTI
> <4>[   49.874714] int3: 0000 [#8] PREEMPT SMP PTI
> <4>[   49.874725] int3: 0000 [#9] PREEMPT SMP PTI

At first looked a bit like it's hitting on multiple CPUs, but 
I also see that the instance only has two CPUs, so apparently
this instance is hitting the same bug recursively from
the exception handler.

> <4>[   49.874736] int3: 0000 [#10] PREEMPT SMP PTI
> <4>[   49.874750] int3: 0000 [#11] PREEMPT SMP PTI
> <4>[   49.874761] int3: 0000 [#12] PREEMPT SMP PTI
> <4>[   49.874772] int3: 0000 [#13] PREEMPT SMP PTI
> <4>[   49.874783] int3: 0000 [#14] PREEMPT SMP PTI
> <4>[   49.874828] int3: 0000 [#15] PREEMPT SMP PTI
> <4>[   49.874894] int3: 0000 [#16] PREEMPT SMP PTI
> <4>[   49.874907] int3: 0000 [#17] PREEMPT SMP PTI
> <4>[   49.874918] int3: 0000 [#18] PREEMPT SMP PTI
> <4>[   49.874930] int3: 0000 [#19] PREEMPT SMP PTI
> <4>[   49.874941] int3: 0000 [#20] PREEMPT SMP PTI
> <4>[   49.874953] int3: 0000 [#21] PREEMPT SMP PTI
> <4>[   49.874964] int3: 0000 [#22] PREEMPT SMP PTI
> <4>[   49.874976] int3: 0000 [#23] PREEMPT SMP PTI
> <4>[   49.874987] int3: 0000 [#24] PREEMPT SMP PTI
> <4>[   49.874999] int3: 0000 [#25] PREEMPT SMP PTI
> <4>[   49.875010] int3: 0000 [#26] PREEMPT SMP PTI
> <4>[   49.875021] int3: 0000 [#27] PREEMPT SMP PTI
> <4>[   49.875032] int3: 0000 [#28] PREEMPT SMP PTI
> <4>[   49.875043] int3: 0000 [#29] PREEMPT SMP PTI
> <4>[   49.875054] int3: 0000 [#30] PREEMPT SMP PTI
> <4>[   49.875282] ------------[ cut here ]------------
> <4>[   49.875687] ------------[ cut here ]------------
> <2>[   49.875904] kernel BUG at kernel/entry/common.c:454!
> <4>[   49.876652] invalid opcode: 0000 [#31] PREEMPT SMP PTI
> <4>[   49.876845] CPU: 1 PID: 317 Comm: ftrace_regressi Not tainted
> 6.4.0-rc7-next-20230621 #1
> <4>[   49.876994] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> <4>[   49.877119] RIP: 0010:irqentry_nmi_enter+0x68/0x70
> <4>[   49.877756] Code: 48 8b 0d 7b 3a 6a 73 c7 81 88 08 00 00 00 00
> 00 00 85 c0 0f 95 c3 e8 27 02 00 00 e8 22 93 fe fe 89 d8 5b 5d c3 cc
> cc cc cc cc <0f> 0b 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
> 90 90 90
> <4>[   49.877929] RSP: 0000:fffffe3e79ab6ee8 EFLAGS: 00000046
> <4>[   49.878425] RAX: 0000000000000000 RBX: fffffe3e79ab6f58 RCX:
> 00000000fe00ffdb
> <4>[   49.878444] RDX: 00000000ffffa445 RSI: 0000000000000000 RDI:
> fffffe3e79ab6f58
> <4>[   49.878460] RBP: fffffe3e79ab6ef0 R08: 0000000000000000 R09:
> 0000000000000000
> <4>[   49.878475] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffffa44502ab5a00
> <4>[   49.878490] R13: 0000000000000000 R14: 0000000000000000 R15:
> ffffab17000ccfd8
> <4>[   49.878540] FS:  00007f91069b4740(0000)
> GS:ffffa4457bd00000(0000) knlGS:0000000000000000
> <4>[   49.878579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[   49.878596] CR2: ffffab17000ccfd8 CR3: 0000000102be8000 CR4:
> 00000000000006e0
> <4>[   49.878695] Call Trace:
> <4>[   49.878892]  <#DF>
> <4>[   49.879037]  ? __die_body+0x6c/0xc0
> <4>[   49.879123]  ? die+0xae/0xe0
> <4>[   49.879151]  ? do_trap+0x8d/0x160
> <4>[   49.879178]  ? irqentry_nmi_enter+0x68/0x70
> <4>[   49.879199]  ? handle_invalid_op+0x7f/0xd0
> <4>[   49.879213]  ? irqentry_nmi_enter+0x68/0x70
> <4>[   49.879237]  ? exc_invalid_op+0x36/0x50
> <4>[   49.879259]  ? asm_exc_invalid_op+0x1f/0x30
> <4>[   49.879316]  ? irqentry_nmi_enter+0x68/0x70
> <4>[   49.879344]  exc_double_fault+0x107/0x1b0
> <4>[   49.879416]  asm_exc_double_fault+0x23/0x30
> <4>[   49.879479] RIP: 0010:0xffffffffc00ac0ac

I don't know much about x86 exception handling, but my guess is
that this is where the stack overflows, so this backtrace
is not all that useful. Looking at the full log from your link,
I see that recursion through asm_exc_int3:

<4>[   49.886694]  ? exc_int3+0x62/0x80
<4>[   49.886714]  ? asm_exc_int3+0x3e/0x50
<4>[   49.886759]  ? preempt_count_sub+0x5/0x80
<4>[   49.886783]  ? preempt_count_sub+0x5/0x80
<4>[   49.886805]  ? irq_work_queue+0x40/0x80
<4>[   49.886826]  ? defer_console_output+0x49/0x80
<4>[   49.886840]  ? vprintk+0x42/0x60
<4>[   49.886857]  ? _printk+0x5d/0x80
<4>[   49.886891]  ? die+0x9c/0xe0
<4>[   49.886922]  ? exc_int3+0x62/0x80
<4>[   49.886942]  ? asm_exc_int3+0x3e/0x50
<4>[   49.886966]  ? __pfx_tick_sched_timer+0x10/0x10
<4>[   49.886980]  ? __pfx_read_tsc+0x10/0x10
<4>[   49.887010]  ? preempt_count_sub+0x5/0x80
<4>[   49.887035]  ? preempt_count_sub+0x5/0x80
<4>[   49.887057]  ? __hrtimer_run_queues+0xee/0x330
<4>[   49.887070]  ? _raw_spin_unlock_irqrestore+0x28/0x50
<4>[   49.887085]  ? __hrtimer_run_queues+0xee/0x330
<4>[   49.887136]  ? hrtimer_interrupt+0xf6/0x390
<4>[   49.887167]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
<4>[   49.887189]  ? sysvec_apic_timer_interrupt+0x7a/0x90
<4>[   49.887206]  </IRQ>
<4>[   49.887219]  <TASK>
<4>[   49.887231]  ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
<4>[   49.887252]  ? sched_rt_period_timer+0x4/0x390
<4>[   49.887272]  ? __pfx_do_sync_core+0x10/0x10
<4>[   49.887305]  ? insn_get_displacement+0x9/0x160
<4>[   49.887329]  ? insn_get_displacement+0x9/0x160
<4>[   49.887344]  ? insn_get_immediate+0xd2/0x270
<4>[   49.887365]  ? insn_decode+0x113/0x150
<4>[   49.887384]  ? text_poke_loc_init+0xea/0x220
<4>[   49.887429]  ? sched_rt_period_timer+0x4/0x390
<4>[   49.887451]  ? text_poke_queue+0x89/0xa0
<4>[   49.887477]  ? ftrace_replace_code+0x149/0x1f0
<4>[   49.887508]  ? ftrace_modify_all_code+0x71/0x140
<4>[   49.887533]  ? arch_ftrace_update_code+0xd/0x20
<4>[   49.887550]  ? ftrace_shutdown+0xf5/0x220
<4>[   49.887577]  ? unregister_ftrace_function+0x2e/0x150
<4>[   49.887608]  ? stack_trace_sysctl+0x82/0xb0
<4>[   49.887636]  ? proc_sys_call_handler+0x18b/0x280
<4>[   49.887675]  ? proc_sys_write+0x17/0x20
<4>[   49.887692]  ? vfs_write+0x324/0x3f0
<4>[   49.887742]  ? ksys_write+0x75/0xe0
<4>[   49.887771]  ? __x64_sys_write+0x1f/0x30
<4>[   49.887787]  ? do_syscall_64+0x48/0xa0
<4>[   49.887802]  ? sysvec_apic_timer_interrupt+0x4d/0x90
<4>[   49.887822]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
<4>[   49.887878]  </TASK>

This looks like a timer interrupt happens inside of
text_poke_loc_init(), i.e. while the kernel is modifying
itself, and presumably adding (or removing) an int3
instruction that is later hit inside of the timer function.

While our previous theory was that this was most likely
a qemu bug in dealing with self-modifying code, this might
actually hint at a problem in the kernel after all.

I think Peter understands this function best, he probably
sees more here than I do.

> Link:
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230621/testrun/17699662/suite/log-parser-test/tests/
>
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230621/testrun/17699662/suite/log-parser-test/test/check-kernel-panic/log
>
> - Naresh

   Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-06-21 15:31 ` qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing Arnd Bergmann
@ 2023-06-21 16:06   ` Peter Zijlstra
  2023-07-04  7:46     ` Richard W.M. Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2023-06-21 16:06 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86

On Wed, Jun 21, 2023 at 05:31:15PM +0200, Arnd Bergmann wrote:

> I don't know much about x86 exception handling, but my guess is
> that this is where the stack overflows, so this backtrace
> is not all that useful. Looking at the full log from your link,
> I see that recursion through asm_exc_int3:
> 
> <4>[   49.886694]  ? exc_int3+0x62/0x80
> <4>[   49.886714]  ? asm_exc_int3+0x3e/0x50
> <4>[   49.886759]  ? preempt_count_sub+0x5/0x80
> <4>[   49.886783]  ? preempt_count_sub+0x5/0x80
> <4>[   49.886805]  ? irq_work_queue+0x40/0x80
> <4>[   49.886826]  ? defer_console_output+0x49/0x80
> <4>[   49.886840]  ? vprintk+0x42/0x60
> <4>[   49.886857]  ? _printk+0x5d/0x80
> <4>[   49.886891]  ? die+0x9c/0xe0
> <4>[   49.886922]  ? exc_int3+0x62/0x80
> <4>[   49.886942]  ? asm_exc_int3+0x3e/0x50

So this is exc_int3(), but it is past poke_int3_handler() which is the
counter-part of text_poke_bp_batch() -- die() seems to suggest we're
even past do_int3().

The fact that it continued past poke_int3_handler() suggests that it is
a 'stray' int3, not covered by an active text modification site.

> <4>[   49.886966]  ? __pfx_tick_sched_timer+0x10/0x10
> <4>[   49.886980]  ? __pfx_read_tsc+0x10/0x10
> <4>[   49.887010]  ? preempt_count_sub+0x5/0x80
> <4>[   49.887035]  ? preempt_count_sub+0x5/0x80
> <4>[   49.887057]  ? __hrtimer_run_queues+0xee/0x330
> <4>[   49.887070]  ? _raw_spin_unlock_irqrestore+0x28/0x50
> <4>[   49.887085]  ? __hrtimer_run_queues+0xee/0x330
> <4>[   49.887136]  ? hrtimer_interrupt+0xf6/0x390
> <4>[   49.887167]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> <4>[   49.887189]  ? sysvec_apic_timer_interrupt+0x7a/0x90
> <4>[   49.887206]  </IRQ>
> <4>[   49.887219]  <TASK>
> <4>[   49.887231]  ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
> <4>[   49.887252]  ? sched_rt_period_timer+0x4/0x390
> <4>[   49.887272]  ? __pfx_do_sync_core+0x10/0x10
> <4>[   49.887305]  ? insn_get_displacement+0x9/0x160
> <4>[   49.887329]  ? insn_get_displacement+0x9/0x160
> <4>[   49.887344]  ? insn_get_immediate+0xd2/0x270
> <4>[   49.887365]  ? insn_decode+0x113/0x150
> <4>[   49.887384]  ? text_poke_loc_init+0xea/0x220
> <4>[   49.887429]  ? sched_rt_period_timer+0x4/0x390
> <4>[   49.887451]  ? text_poke_queue+0x89/0xa0

This is the text_poke_queue() stage, it is still collecting sites to
poke but hasn't actually started yet. The actual poking happens in
text_poke_bp_batch(), which isn't in the call-chain afaict.

> <4>[   49.887477]  ? ftrace_replace_code+0x149/0x1f0
> <4>[   49.887508]  ? ftrace_modify_all_code+0x71/0x140
> <4>[   49.887533]  ? arch_ftrace_update_code+0xd/0x20
> <4>[   49.887550]  ? ftrace_shutdown+0xf5/0x220
> <4>[   49.887577]  ? unregister_ftrace_function+0x2e/0x150
> <4>[   49.887608]  ? stack_trace_sysctl+0x82/0xb0
> <4>[   49.887636]  ? proc_sys_call_handler+0x18b/0x280
> <4>[   49.887675]  ? proc_sys_write+0x17/0x20
> <4>[   49.887692]  ? vfs_write+0x324/0x3f0
> <4>[   49.887742]  ? ksys_write+0x75/0xe0
> <4>[   49.887771]  ? __x64_sys_write+0x1f/0x30
> <4>[   49.887787]  ? do_syscall_64+0x48/0xa0
> <4>[   49.887802]  ? sysvec_apic_timer_interrupt+0x4d/0x90
> <4>[   49.887822]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[   49.887878]  </TASK>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-06-21 16:06   ` Peter Zijlstra
@ 2023-07-04  7:46     ` Richard W.M. Jones
  2023-07-04 13:21       ` Richard W.M. Jones
  2023-07-05 16:28       ` Richard W.M. Jones
  0 siblings, 2 replies; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-04  7:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnd Bergmann, Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86, Richard Henderson

On Wed, Jun 21, 2023 at 06:06:55PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 05:31:15PM +0200, Arnd Bergmann wrote:
> 
> > I don't know much about x86 exception handling, but my guess is
> > that this is where the stack overflows, so this backtrace
> > is not all that useful. Looking at the full log from your link,
> > I see that recursion through asm_exc_int3:
> > 
> > <4>[   49.886694]  ? exc_int3+0x62/0x80
> > <4>[   49.886714]  ? asm_exc_int3+0x3e/0x50
> > <4>[   49.886759]  ? preempt_count_sub+0x5/0x80
> > <4>[   49.886783]  ? preempt_count_sub+0x5/0x80
> > <4>[   49.886805]  ? irq_work_queue+0x40/0x80
> > <4>[   49.886826]  ? defer_console_output+0x49/0x80
> > <4>[   49.886840]  ? vprintk+0x42/0x60
> > <4>[   49.886857]  ? _printk+0x5d/0x80
> > <4>[   49.886891]  ? die+0x9c/0xe0
> > <4>[   49.886922]  ? exc_int3+0x62/0x80
> > <4>[   49.886942]  ? asm_exc_int3+0x3e/0x50
> 
> So this is exc_int3(), but it is past poke_int3_handler() which is the
> counter-part of text_poke_bp_batch() -- die() seems to suggest we're
> even past do_int3().
> 
> The fact that it continued past poke_int3_handler() suggests that it is
> a 'stray' int3, not covered by an active text modification site.
> 
> > <4>[   49.886966]  ? __pfx_tick_sched_timer+0x10/0x10
> > <4>[   49.886980]  ? __pfx_read_tsc+0x10/0x10
> > <4>[   49.887010]  ? preempt_count_sub+0x5/0x80
> > <4>[   49.887035]  ? preempt_count_sub+0x5/0x80
> > <4>[   49.887057]  ? __hrtimer_run_queues+0xee/0x330
> > <4>[   49.887070]  ? _raw_spin_unlock_irqrestore+0x28/0x50
> > <4>[   49.887085]  ? __hrtimer_run_queues+0xee/0x330
> > <4>[   49.887136]  ? hrtimer_interrupt+0xf6/0x390
> > <4>[   49.887167]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> > <4>[   49.887189]  ? sysvec_apic_timer_interrupt+0x7a/0x90
> > <4>[   49.887206]  </IRQ>
> > <4>[   49.887219]  <TASK>
> > <4>[   49.887231]  ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
> > <4>[   49.887252]  ? sched_rt_period_timer+0x4/0x390
> > <4>[   49.887272]  ? __pfx_do_sync_core+0x10/0x10
> > <4>[   49.887305]  ? insn_get_displacement+0x9/0x160
> > <4>[   49.887329]  ? insn_get_displacement+0x9/0x160
> > <4>[   49.887344]  ? insn_get_immediate+0xd2/0x270
> > <4>[   49.887365]  ? insn_decode+0x113/0x150
> > <4>[   49.887384]  ? text_poke_loc_init+0xea/0x220
> > <4>[   49.887429]  ? sched_rt_period_timer+0x4/0x390
> > <4>[   49.887451]  ? text_poke_queue+0x89/0xa0
> 
> This is the text_poke_queue() stage, it is still collecting sites to
> poke but hasn't actually started yet. The actual poking happens in
> text_poke_bp_batch(), which isn't in the call-chain afaict.
> 
> > <4>[   49.887477]  ? ftrace_replace_code+0x149/0x1f0
> > <4>[   49.887508]  ? ftrace_modify_all_code+0x71/0x140
> > <4>[   49.887533]  ? arch_ftrace_update_code+0xd/0x20
> > <4>[   49.887550]  ? ftrace_shutdown+0xf5/0x220
> > <4>[   49.887577]  ? unregister_ftrace_function+0x2e/0x150
> > <4>[   49.887608]  ? stack_trace_sysctl+0x82/0xb0
> > <4>[   49.887636]  ? proc_sys_call_handler+0x18b/0x280
> > <4>[   49.887675]  ? proc_sys_write+0x17/0x20
> > <4>[   49.887692]  ? vfs_write+0x324/0x3f0
> > <4>[   49.887742]  ? ksys_write+0x75/0xe0
> > <4>[   49.887771]  ? __x64_sys_write+0x1f/0x30
> > <4>[   49.887787]  ? do_syscall_64+0x48/0xa0
> > <4>[   49.887802]  ? sysvec_apic_timer_interrupt+0x4d/0x90
> > <4>[   49.887822]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > <4>[   49.887878]  </TASK>

Did anything happen with this bug later?  I see nothing on any public
list, nor any public bug tracker.

We have been having the same sort of problem
(https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's another
of those bugs that requires hundreds or thousands of boot iterations
before you can see it.  There is a test in comment 27 but it requires
guestfish and some hacking to work.  I'll try to come up with a
cleaner test later.

We see stack traces like:

[    3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    3.082266] clocksource: Switched to clocksource acpi_pm
[    3.090201] NET: Registered PF_INET protocol family
[    3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
[    3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
[    3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
[    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
[    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
[    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
[    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
[    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
[    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
[    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
[    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
[    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
[    3.093098] PKRU: 55555554
[    3.093098] Call Trace:
[    3.093098]  <TASK>
[    3.093098]  ? die+0x31/0x80
[    3.093098]  ? exc_int3+0x10e/0x120
[    3.093098]  ? asm_exc_int3+0x39/0x40
[    3.093098]  ? __mod_timer+0x1c3/0x370
[    3.093098]  ? __mod_timer+0x1c3/0x370
[    3.093098]  queue_delayed_work_on+0x23/0x30
[    3.093098]  neigh_table_init+0x1bb/0x2e0
[    3.093098]  arp_init+0x12/0x50
[    3.093098]  inet_init+0x15b/0x2f0
[    3.093098]  ? __pfx_inet_init+0x10/0x10
[    3.093098]  do_one_initcall+0x58/0x230
[    3.093098]  kernel_init_freeable+0x199/0x2d0
[    3.093098]  ? __pfx_kernel_init+0x10/0x10
[    3.093098]  kernel_init+0x15/0x1b0
[    3.093098]  ret_from_fork+0x2c/0x50
[    3.093098]  </TASK>
[    3.093098] Modules linked in:
[    3.093098] ---[ end trace 0000000000000000 ]---
[    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
[    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
[    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
[    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
[    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
[    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
[    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
[    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
[    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
[    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
[    3.093098] PKRU: 55555554
[    3.093098] Kernel panic - not syncing: Fatal exception in interrupt

There are many variations, but the common pattern seems to be
<something in the clock or timer code> -> int3 exception

It only happens under qemu TCG (software emulation).

It goes away if we recompile qemu without MTTCG support.

It only happens with -smp enabled, we are using qemu -smp 4

We are using qemu-system-x86_64 full system emulation on x86_64 host
(ie. forcing KVM off).

It happens with the latest upstream kernel and qemu, compiled from
source.

I'd love to have a better way to debug this or collect more
diagnostics if you have any suggestions.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-04  7:46     ` Richard W.M. Jones
@ 2023-07-04 13:21       ` Richard W.M. Jones
  2023-07-05 16:28       ` Richard W.M. Jones
  1 sibling, 0 replies; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-04 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnd Bergmann, Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86, Richard Henderson

On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
> We have been having the same sort of problem
> (https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's another
> of those bugs that requires hundreds or thousands of boot iterations
> before you can see it.  There is a test in comment 27 but it requires
> guestfish and some hacking to work.  I'll try to come up with a
> cleaner test later.

FWIW here's a better test.  It only uses qemu-system-x86_64 & a
vmlinuz file of your choice, and is very fast.  It usually hits the
bug in seconds.

  https://github.com/rwmjones/bootbootboot/tree/bz2216496

NB: You will need to change the definition of VMLINUX in
config-bz2216496-qemu.h (and generally read the code before running,
but hopefully it should just work apart from choosing a kernel image).

> I'd love to have a better way to debug this or collect more
> diagnostics if you have any suggestions.

Still open to any suggestions about better ways to debug this, or
anything you'd like me to try out.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-04  7:46     ` Richard W.M. Jones
  2023-07-04 13:21       ` Richard W.M. Jones
@ 2023-07-05 16:28       ` Richard W.M. Jones
       [not found]         ` <CAFXwXrmbpuFNf5=nQxiTteo8fpCdAbK4pEAN176Cq0yvwZcfFw@mail.gmail.com>
  2023-07-05 21:50         ` Richard W.M. Jones
  1 sibling, 2 replies; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-05 16:28 UTC (permalink / raw)
  To: Peter Zijlstra, Richard Henderson
  Cc: Arnd Bergmann, Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86, Paolo Bonzini

On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
> We have been having the same sort of problem
> (https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's another
> of those bugs that requires hundreds or thousands of boot iterations
> before you can see it.  There is a test in comment 27 but it requires
> guestfish and some hacking to work.  I'll try to come up with a
> cleaner test later.
> 
> We see stack traces like:
> 
> [    3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [    3.082266] clocksource: Switched to clocksource acpi_pm
> [    3.090201] NET: Registered PF_INET protocol family
> [    3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
> [    3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
> [    3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> [    3.093098] PKRU: 55555554
> [    3.093098] Call Trace:
> [    3.093098]  <TASK>
> [    3.093098]  ? die+0x31/0x80
> [    3.093098]  ? exc_int3+0x10e/0x120
> [    3.093098]  ? asm_exc_int3+0x39/0x40
> [    3.093098]  ? __mod_timer+0x1c3/0x370
> [    3.093098]  ? __mod_timer+0x1c3/0x370
> [    3.093098]  queue_delayed_work_on+0x23/0x30
> [    3.093098]  neigh_table_init+0x1bb/0x2e0
> [    3.093098]  arp_init+0x12/0x50
> [    3.093098]  inet_init+0x15b/0x2f0
> [    3.093098]  ? __pfx_inet_init+0x10/0x10
> [    3.093098]  do_one_initcall+0x58/0x230
> [    3.093098]  kernel_init_freeable+0x199/0x2d0
> [    3.093098]  ? __pfx_kernel_init+0x10/0x10
> [    3.093098]  kernel_init+0x15/0x1b0
> [    3.093098]  ret_from_fork+0x2c/0x50
> [    3.093098]  </TASK>
> [    3.093098] Modules linked in:
> [    3.093098] ---[ end trace 0000000000000000 ]---
> [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> [    3.093098] PKRU: 55555554
> [    3.093098] Kernel panic - not syncing: Fatal exception in interrupt
> 
> There are many variations, but the common pattern seems to be
> <something in the clock or timer code> -> int3 exception
> 
> It only happens under qemu TCG (software emulation).
> 
> It goes away if we recompile qemu without MTTCG support.
> 
> It only happens with -smp enabled, we are using qemu -smp 4
> 
> We are using qemu-system-x86_64 full system emulation on x86_64 host
> (ie. forcing KVM off).
> 
> It happens with the latest upstream kernel and qemu, compiled from
> source.

I got a bit further on this one.

The bug happens in the code that updates the static branch used for at
least these two keys:

  static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
  DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);

There are probably others (it seems a generic problem with how static
branches are handled by TCG), but I only see the bug for those two.

When the static branch is updated we end up calling
arch/x86/kernel/alternative.c:text_poke_bp_batch().  It's best to read
the description of that function to see where int3 is used:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c#n2086

I modified the qemu TCG int3 helper so it dumps the code at %rip when
the interrupt fires, and I can actually see the changes in the above
function happen, first int3 being written, then the end of the nop,
then the int3 being overwritten with the first byte of the nop.

Unfortunately the int3 still fires after the code has been completely
rewritten to its final (ie nop) value.

This seems to indicate to me that neither the self-write to the kernel
text segment, nor sync_core (implemented by a "iret to self" trick)
invalidates the qemu TCG translation block containing the old int3
helper call.  Thus we (qemu) never "see" the new nop, we keep
emulating int3, and that causes the kernel to crash.

I added print statements inside tb_invalidate_phys_page() and this
function seems never to be called at all.  It's my understanding that
at least the kernel writing to its text segment ought to cause
tb_invalidate_phys_page() to be called, but I'm not super-familiar
with this qemu code.

Richard Henderson - do you have any suggestions?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
       [not found]         ` <CAFXwXrmbpuFNf5=nQxiTteo8fpCdAbK4pEAN176Cq0yvwZcfFw@mail.gmail.com>
@ 2023-07-05 16:35           ` Richard W.M. Jones
       [not found]             ` <CAFXwXrk1FEZPUO-zqNVJZ6YCHKUkgNehwmyDYuOr5fx8ff0OCA@mail.gmail.com>
  2023-07-05 16:37           ` Richard W.M. Jones
  1 sibling, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-05 16:35 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Zijlstra, Arnd Bergmann, Naresh Kamboju, Anders Roxell,
	Daniel Díaz, Benjamin Copeland, lkml - Kernel Mailing List,
	the arch/x86 maintainers, Paolo Bonzini

On Wed, Jul 05, 2023 at 06:32:30PM +0200, Richard Henderson wrote:
> https://gitlab.com/qemu-project/qemu/-/commit/
> 3307e08c6f142bb3d2406cfbc0ee19359748b51a

I've got that patch in my qemu tree.  Do you suspect it's
wrong and want me to try reverting it?

Rich.

> r~
> 
> On Wed, 5 July 2023, 18:28 Richard W.M. Jones, <rjones@redhat.com> wrote:
> 
>     On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
>     > We have been having the same sort of problem
>     > (https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's another
>     > of those bugs that requires hundreds or thousands of boot iterations
>     > before you can see it.  There is a test in comment 27 but it requires
>     > guestfish and some hacking to work.  I'll try to come up with a
>     > cleaner test later.
>     >
>     > We see stack traces like:
>     >
>     > [    3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff,
>     max_idle_ns: 2085701024 ns
>     > [    3.082266] clocksource: Switched to clocksource acpi_pm
>     > [    3.090201] NET: Registered PF_INET protocol family
>     > [    3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
>     > [    3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted
>     6.4.0-10173-ga901a3568fd2 #8
>     > [    3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>     rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>     > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
>     > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8
>     f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66
>     <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
>     > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
>     > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>     0000000000280003
>     > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI:
>     ffffffff8441e4b8
>     > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09:
>     ffffffff8441e4b8
>     > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12:
>     0000000000000000
>     > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15:
>     ffffffff8441e4b8
>     > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000)
>     knlGS:0000000000000000
>     > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4:
>     0000000000750ee0
>     > [    3.093098] PKRU: 55555554
>     > [    3.093098] Call Trace:
>     > [    3.093098]  <TASK>
>     > [    3.093098]  ? die+0x31/0x80
>     > [    3.093098]  ? exc_int3+0x10e/0x120
>     > [    3.093098]  ? asm_exc_int3+0x39/0x40
>     > [    3.093098]  ? __mod_timer+0x1c3/0x370
>     > [    3.093098]  ? __mod_timer+0x1c3/0x370
>     > [    3.093098]  queue_delayed_work_on+0x23/0x30
>     > [    3.093098]  neigh_table_init+0x1bb/0x2e0
>     > [    3.093098]  arp_init+0x12/0x50
>     > [    3.093098]  inet_init+0x15b/0x2f0
>     > [    3.093098]  ? __pfx_inet_init+0x10/0x10
>     > [    3.093098]  do_one_initcall+0x58/0x230
>     > [    3.093098]  kernel_init_freeable+0x199/0x2d0
>     > [    3.093098]  ? __pfx_kernel_init+0x10/0x10
>     > [    3.093098]  kernel_init+0x15/0x1b0
>     > [    3.093098]  ret_from_fork+0x2c/0x50
>     > [    3.093098]  </TASK>
>     > [    3.093098] Modules linked in:
>     > [    3.093098] ---[ end trace 0000000000000000 ]---
>     > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
>     > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8
>     f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66
>     <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
>     > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
>     > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>     0000000000280003
>     > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI:
>     ffffffff8441e4b8
>     > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09:
>     ffffffff8441e4b8
>     > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12:
>     0000000000000000
>     > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15:
>     ffffffff8441e4b8
>     > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000)
>     knlGS:0000000000000000
>     > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4:
>     0000000000750ee0
>     > [    3.093098] PKRU: 55555554
>     > [    3.093098] Kernel panic - not syncing: Fatal exception in interrupt
>     >
>     > There are many variations, but the common pattern seems to be
>     > <something in the clock or timer code> -> int3 exception
>     >
>     > It only happens under qemu TCG (software emulation).
>     >
>     > It goes away if we recompile qemu without MTTCG support.
>     >
>     > It only happens with -smp enabled, we are using qemu -smp 4
>     >
>     > We are using qemu-system-x86_64 full system emulation on x86_64 host
>     > (ie. forcing KVM off).
>     >
>     > It happens with the latest upstream kernel and qemu, compiled from
>     > source.
> 
>     I got a bit further on this one.
> 
>     The bug happens in the code that updates the static branch used for at
>     least these two keys:
> 
>       static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
>       DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);
> 
>     There are probably others (it seems a generic problem with how static
>     branches are handled by TCG), but I only see the bug for those two.
> 
>     When the static branch is updated we end up calling
>     arch/x86/kernel/alternative.c:text_poke_bp_batch().  It's best to read
>     the description of that function to see where int3 is used:
> 
>       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/
>     arch/x86/kernel/alternative.c#n2086
> 
>     I modified the qemu TCG int3 helper so it dumps the code at %rip when
>     the interrupt fires, and I can actually see the changes in the above
>     function happen, first int3 being written, then the end of the nop,
>     then the int3 being overwritten with the first byte of the nop.
> 
>     Unfortunately the int3 still fires after the code has been completely
>     rewritten to its final (ie nop) value.
> 
>     This seems to indicate to me that neither the self-write to the kernel
>     text segment, nor sync_core (implemented by a "iret to self" trick)
>     invalidates the qemu TCG translation block containing the old int3
>     helper call.  Thus we (qemu) never "see" the new nop, we keep
>     emulating int3, and that causes the kernel to crash.
> 
>     I added print statements inside tb_invalidate_phys_page() and this
>     function seems never to be called at all.  It's my understanding that
>     at least the kernel writing to its text segment ought to cause
>     tb_invalidate_phys_page() to be called, but I'm not super-familiar
>     with this qemu code.
> 
>     Richard Henderson - do you have any suggestions?
> 
>     Rich.
> 
>     --
>     Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/
>     ~rjones
>     Read my programming and virtualization blog: http://rwmj.wordpress.com
>     libguestfs lets you edit virtual machines.  Supports shell scripting,
>     bindings from many languages.  http://libguestfs.org
> 
> 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
       [not found]         ` <CAFXwXrmbpuFNf5=nQxiTteo8fpCdAbK4pEAN176Cq0yvwZcfFw@mail.gmail.com>
  2023-07-05 16:35           ` Richard W.M. Jones
@ 2023-07-05 16:37           ` Richard W.M. Jones
  1 sibling, 0 replies; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-05 16:37 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Zijlstra, Arnd Bergmann, Naresh Kamboju, Anders Roxell,
	Daniel Díaz, Benjamin Copeland, lkml - Kernel Mailing List,
	the arch/x86 maintainers, Paolo Bonzini

On Wed, Jul 05, 2023 at 06:32:30PM +0200, Richard Henderson wrote:
> https://gitlab.com/qemu-project/qemu/-/commit/
> 3307e08c6f142bb3d2406cfbc0ee19359748b51a

FWIW reverting this commit (alone) causes qemu to throw this assertion:

qemu-system-x86_64: ../accel/tcg/tb-maint.c:1096: tb_invalidate_phys_page_range__locked: Assertion `((start ^ last) & ((target_long)-1 << 12)) == 0' failed.

I've seen this bug happen on both very old and very new versions of
qemu (even back to 6.0), so I don't think the bug is caused by or
fixed by any recent change.

Rich.

> r~
> 
> On Wed, 5 July 2023, 18:28 Richard W.M. Jones, <rjones@redhat.com> wrote:
> 
>     On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
>     > We have been having the same sort of problem
>     > (https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's another
>     > of those bugs that requires hundreds or thousands of boot iterations
>     > before you can see it.  There is a test in comment 27 but it requires
>     > guestfish and some hacking to work.  I'll try to come up with a
>     > cleaner test later.
>     >
>     > We see stack traces like:
>     >
>     > [    3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff,
>     max_idle_ns: 2085701024 ns
>     > [    3.082266] clocksource: Switched to clocksource acpi_pm
>     > [    3.090201] NET: Registered PF_INET protocol family
>     > [    3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
>     > [    3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted
>     6.4.0-10173-ga901a3568fd2 #8
>     > [    3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>     rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>     > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
>     > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8
>     f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66
>     <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
>     > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
>     > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>     0000000000280003
>     > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI:
>     ffffffff8441e4b8
>     > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09:
>     ffffffff8441e4b8
>     > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12:
>     0000000000000000
>     > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15:
>     ffffffff8441e4b8
>     > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000)
>     knlGS:0000000000000000
>     > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4:
>     0000000000750ee0
>     > [    3.093098] PKRU: 55555554
>     > [    3.093098] Call Trace:
>     > [    3.093098]  <TASK>
>     > [    3.093098]  ? die+0x31/0x80
>     > [    3.093098]  ? exc_int3+0x10e/0x120
>     > [    3.093098]  ? asm_exc_int3+0x39/0x40
>     > [    3.093098]  ? __mod_timer+0x1c3/0x370
>     > [    3.093098]  ? __mod_timer+0x1c3/0x370
>     > [    3.093098]  queue_delayed_work_on+0x23/0x30
>     > [    3.093098]  neigh_table_init+0x1bb/0x2e0
>     > [    3.093098]  arp_init+0x12/0x50
>     > [    3.093098]  inet_init+0x15b/0x2f0
>     > [    3.093098]  ? __pfx_inet_init+0x10/0x10
>     > [    3.093098]  do_one_initcall+0x58/0x230
>     > [    3.093098]  kernel_init_freeable+0x199/0x2d0
>     > [    3.093098]  ? __pfx_kernel_init+0x10/0x10
>     > [    3.093098]  kernel_init+0x15/0x1b0
>     > [    3.093098]  ret_from_fork+0x2c/0x50
>     > [    3.093098]  </TASK>
>     > [    3.093098] Modules linked in:
>     > [    3.093098] ---[ end trace 0000000000000000 ]---
>     > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
>     > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8
>     f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66
>     <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
>     > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
>     > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>     0000000000280003
>     > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI:
>     ffffffff8441e4b8
>     > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09:
>     ffffffff8441e4b8
>     > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12:
>     0000000000000000
>     > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15:
>     ffffffff8441e4b8
>     > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000)
>     knlGS:0000000000000000
>     > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4:
>     0000000000750ee0
>     > [    3.093098] PKRU: 55555554
>     > [    3.093098] Kernel panic - not syncing: Fatal exception in interrupt
>     >
>     > There are many variations, but the common pattern seems to be
>     > <something in the clock or timer code> -> int3 exception
>     >
>     > It only happens under qemu TCG (software emulation).
>     >
>     > It goes away if we recompile qemu without MTTCG support.
>     >
>     > It only happens with -smp enabled, we are using qemu -smp 4
>     >
>     > We are using qemu-system-x86_64 full system emulation on x86_64 host
>     > (ie. forcing KVM off).
>     >
>     > It happens with the latest upstream kernel and qemu, compiled from
>     > source.
> 
>     I got a bit further on this one.
> 
>     The bug happens in the code that updates the static branch used for at
>     least these two keys:
> 
>       static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
>       DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);
> 
>     There are probably others (it seems a generic problem with how static
>     branches are handled by TCG), but I only see the bug for those two.
> 
>     When the static branch is updated we end up calling
>     arch/x86/kernel/alternative.c:text_poke_bp_batch().  It's best to read
>     the description of that function to see where int3 is used:
> 
>       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/
>     arch/x86/kernel/alternative.c#n2086
> 
>     I modified the qemu TCG int3 helper so it dumps the code at %rip when
>     the interrupt fires, and I can actually see the changes in the above
>     function happen, first int3 being written, then the end of the nop,
>     then the int3 being overwritten with the first byte of the nop.
> 
>     Unfortunately the int3 still fires after the code has been completely
>     rewritten to its final (ie nop) value.
> 
>     This seems to indicate to me that neither the self-write to the kernel
>     text segment, nor sync_core (implemented by a "iret to self" trick)
>     invalidates the qemu TCG translation block containing the old int3
>     helper call.  Thus we (qemu) never "see" the new nop, we keep
>     emulating int3, and that causes the kernel to crash.
> 
>     I added print statements inside tb_invalidate_phys_page() and this
>     function seems never to be called at all.  It's my understanding that
>     at least the kernel writing to its text segment ought to cause
>     tb_invalidate_phys_page() to be called, but I'm not super-familiar
>     with this qemu code.
> 
>     Richard Henderson - do you have any suggestions?
> 
>     Rich.
> 
>     --
>     Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/
>     ~rjones
>     Read my programming and virtualization blog: http://rwmj.wordpress.com
>     libguestfs lets you edit virtual machines.  Supports shell scripting,
>     bindings from many languages.  http://libguestfs.org
> 
> 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
       [not found]             ` <CAFXwXrk1FEZPUO-zqNVJZ6YCHKUkgNehwmyDYuOr5fx8ff0OCA@mail.gmail.com>
@ 2023-07-05 16:40               ` Richard W.M. Jones
  2023-07-06  6:13                 ` Richard Henderson
  0 siblings, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-05 16:40 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Zijlstra, Arnd Bergmann, Naresh Kamboju, Anders Roxell,
	Daniel Díaz, Benjamin Copeland, lkml - Kernel Mailing List,
	the arch/x86 maintainers, Paolo Bonzini

On Wed, Jul 05, 2023 at 06:36:43PM +0200, Richard Henderson wrote:
> No, I thought it would be the fix for 8.0.0.

Thanks for the suggestion anyway.

Am I right in thinking that tb_invalidate_phys_page() ought to be
called when the kernel self-modifies its text segment?  If there's
some function that we expect to be called in this case then I can
instrument it.

Rich.

> r~
> 
> On Wed, 5 July 2023, 18:35 Richard W.M. Jones, <rjones@redhat.com> wrote:
> 
>     On Wed, Jul 05, 2023 at 06:32:30PM +0200, Richard Henderson wrote:
>     > https://gitlab.com/qemu-project/qemu/-/commit/
>     > 3307e08c6f142bb3d2406cfbc0ee19359748b51a
> 
>     I've got that patch in my qemu tree.  Do you suspect it's
>     wrong and want me to try reverting it?
> 
>     Rich.
> 
>     > r~
>     >
>     > On Wed, 5 July 2023, 18:28 Richard W.M. Jones, <rjones@redhat.com> wrote:
>     >
>     >     On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
>     >     > We have been having the same sort of problem
>     >     > (https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's
>     another
>     >     > of those bugs that requires hundreds or thousands of boot
>     iterations
>     >     > before you can see it.  There is a test in comment 27 but it
>     requires
>     >     > guestfish and some hacking to work.  I'll try to come up with a
>     >     > cleaner test later.
>     >     >
>     >     > We see stack traces like:
>     >     >
>     >     > [    3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles:
>     0xffffff,
>     >     max_idle_ns: 2085701024 ns
>     >     > [    3.082266] clocksource: Switched to clocksource acpi_pm
>     >     > [    3.090201] NET: Registered PF_INET protocol family
>     >     > [    3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
>     >     > [    3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted
>     >     6.4.0-10173-ga901a3568fd2 #8
>     >     > [    3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>     BIOS
>     >     rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>     >     > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
>     >     > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89
>     ff e8
>     >     f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f
>     20 66
>     >     <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
>     >     > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
>     >     > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>     >     0000000000280003
>     >     > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI:
>     >     ffffffff8441e4b8
>     >     > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09:
>     >     ffffffff8441e4b8
>     >     > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12:
>     >     0000000000000000
>     >     > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15:
>     >     ffffffff8441e4b8
>     >     > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000
>     (0000)
>     >     knlGS:0000000000000000
>     >     > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     >     > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4:
>     >     0000000000750ee0
>     >     > [    3.093098] PKRU: 55555554
>     >     > [    3.093098] Call Trace:
>     >     > [    3.093098]  <TASK>
>     >     > [    3.093098]  ? die+0x31/0x80
>     >     > [    3.093098]  ? exc_int3+0x10e/0x120
>     >     > [    3.093098]  ? asm_exc_int3+0x39/0x40
>     >     > [    3.093098]  ? __mod_timer+0x1c3/0x370
>     >     > [    3.093098]  ? __mod_timer+0x1c3/0x370
>     >     > [    3.093098]  queue_delayed_work_on+0x23/0x30
>     >     > [    3.093098]  neigh_table_init+0x1bb/0x2e0
>     >     > [    3.093098]  arp_init+0x12/0x50
>     >     > [    3.093098]  inet_init+0x15b/0x2f0
>     >     > [    3.093098]  ? __pfx_inet_init+0x10/0x10
>     >     > [    3.093098]  do_one_initcall+0x58/0x230
>     >     > [    3.093098]  kernel_init_freeable+0x199/0x2d0
>     >     > [    3.093098]  ? __pfx_kernel_init+0x10/0x10
>     >     > [    3.093098]  kernel_init+0x15/0x1b0
>     >     > [    3.093098]  ret_from_fork+0x2c/0x50
>     >     > [    3.093098]  </TASK>
>     >     > [    3.093098] Modules linked in:
>     >     > [    3.093098] ---[ end trace 0000000000000000 ]---
>     >     > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
>     >     > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89
>     ff e8
>     >     f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f
>     20 66
>     >     <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
>     >     > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
>     >     > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>     >     0000000000280003
>     >     > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI:
>     >     ffffffff8441e4b8
>     >     > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09:
>     >     ffffffff8441e4b8
>     >     > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12:
>     >     0000000000000000
>     >     > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15:
>     >     ffffffff8441e4b8
>     >     > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000
>     (0000)
>     >     knlGS:0000000000000000
>     >     > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     >     > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4:
>     >     0000000000750ee0
>     >     > [    3.093098] PKRU: 55555554
>     >     > [    3.093098] Kernel panic - not syncing: Fatal exception in
>     interrupt
>     >     >
>     >     > There are many variations, but the common pattern seems to be
>     >     > <something in the clock or timer code> -> int3 exception
>     >     >
>     >     > It only happens under qemu TCG (software emulation).
>     >     >
>     >     > It goes away if we recompile qemu without MTTCG support.
>     >     >
>     >     > It only happens with -smp enabled, we are using qemu -smp 4
>     >     >
>     >     > We are using qemu-system-x86_64 full system emulation on x86_64
>     host
>     >     > (ie. forcing KVM off).
>     >     >
>     >     > It happens with the latest upstream kernel and qemu, compiled from
>     >     > source.
>     >
>     >     I got a bit further on this one.
>     >
>     >     The bug happens in the code that updates the static branch used for
>     at
>     >     least these two keys:
>     >
>     >       static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
>     >       DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);
>     >
>     >     There are probably others (it seems a generic problem with how static
>     >     branches are handled by TCG), but I only see the bug for those two.
>     >
>     >     When the static branch is updated we end up calling
>     >     arch/x86/kernel/alternative.c:text_poke_bp_batch().  It's best to
>     read
>     >     the description of that function to see where int3 is used:
>     >
>     >       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>     tree/
>     >     arch/x86/kernel/alternative.c#n2086
>     >
>     >     I modified the qemu TCG int3 helper so it dumps the code at %rip when
>     >     the interrupt fires, and I can actually see the changes in the above
>     >     function happen, first int3 being written, then the end of the nop,
>     >     then the int3 being overwritten with the first byte of the nop.
>     >
>     >     Unfortunately the int3 still fires after the code has been completely
>     >     rewritten to its final (ie nop) value.
>     >
>     >     This seems to indicate to me that neither the self-write to the
>     kernel
>     >     text segment, nor sync_core (implemented by a "iret to self" trick)
>     >     invalidates the qemu TCG translation block containing the old int3
>     >     helper call.  Thus we (qemu) never "see" the new nop, we keep
>     >     emulating int3, and that causes the kernel to crash.
>     >
>     >     I added print statements inside tb_invalidate_phys_page() and this
>     >     function seems never to be called at all.  It's my understanding that
>     >     at least the kernel writing to its text segment ought to cause
>     >     tb_invalidate_phys_page() to be called, but I'm not super-familiar
>     >     with this qemu code.
>     >
>     >     Richard Henderson - do you have any suggestions?
>     >
>     >     Rich.
>     >
>     >     --
>     >     Richard Jones, Virtualization Group, Red Hat http://people.redhat.com
>     /
>     >     ~rjones
>     >     Read my programming and virtualization blog: http://
>     rwmj.wordpress.com
>     >     libguestfs lets you edit virtual machines.  Supports shell scripting,
>     >     bindings from many languages.  http://libguestfs.org
>     >
>     >
> 
>     --
>     Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/
>     ~rjones
>     Read my programming and virtualization blog: http://rwmj.wordpress.com
>     virt-top is 'top' for virtual machines.  Tiny program with many
>     powerful monitoring features, net stats, disk stats, logging, etc.
>     http://people.redhat.com/~rjones/virt-top
> 
> 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-05 16:28       ` Richard W.M. Jones
       [not found]         ` <CAFXwXrmbpuFNf5=nQxiTteo8fpCdAbK4pEAN176Cq0yvwZcfFw@mail.gmail.com>
@ 2023-07-05 21:50         ` Richard W.M. Jones
  2023-07-06  6:30           ` Richard Henderson
  1 sibling, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-05 21:50 UTC (permalink / raw)
  To: Peter Zijlstra, Richard Henderson
  Cc: Arnd Bergmann, Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86, Paolo Bonzini

On Wed, Jul 05, 2023 at 05:28:30PM +0100, Richard W.M. Jones wrote:
> On Tue, Jul 04, 2023 at 08:46:20AM +0100, Richard W.M. Jones wrote:
> > We have been having the same sort of problem
> > (https://bugzilla.redhat.com/show_bug.cgi?id=2216496).  It's another
> > of those bugs that requires hundreds or thousands of boot iterations
> > before you can see it.  There is a test in comment 27 but it requires
> > guestfish and some hacking to work.  I'll try to come up with a
> > cleaner test later.
> > 
> > We see stack traces like:
> > 
> > [    3.081939] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [    3.082266] clocksource: Switched to clocksource acpi_pm
> > [    3.090201] NET: Registered PF_INET protocol family
> > [    3.093098] int3: 0000 [#1] PREEMPT SMP NOPTI
> > [    3.093098] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.4.0-10173-ga901a3568fd2 #8
> > [    3.093098] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> > [    3.093098] PKRU: 55555554
> > [    3.093098] Call Trace:
> > [    3.093098]  <TASK>
> > [    3.093098]  ? die+0x31/0x80
> > [    3.093098]  ? exc_int3+0x10e/0x120
> > [    3.093098]  ? asm_exc_int3+0x39/0x40
> > [    3.093098]  ? __mod_timer+0x1c3/0x370
> > [    3.093098]  ? __mod_timer+0x1c3/0x370
> > [    3.093098]  queue_delayed_work_on+0x23/0x30
> > [    3.093098]  neigh_table_init+0x1bb/0x2e0
> > [    3.093098]  arp_init+0x12/0x50
> > [    3.093098]  inet_init+0x15b/0x2f0
> > [    3.093098]  ? __pfx_inet_init+0x10/0x10
> > [    3.093098]  do_one_initcall+0x58/0x230
> > [    3.093098]  kernel_init_freeable+0x199/0x2d0
> > [    3.093098]  ? __pfx_kernel_init+0x10/0x10
> > [    3.093098]  kernel_init+0x15/0x1b0
> > [    3.093098]  ret_from_fork+0x2c/0x50
> > [    3.093098]  </TASK>
> > [    3.093098] Modules linked in:
> > [    3.093098] ---[ end trace 0000000000000000 ]---
> > [    3.093098] RIP: 0010:__mod_timer+0x1c3/0x370
> > [    3.093098] Code: 00 00 41 bd ff ff ff ff 31 d2 4c 89 f6 4c 89 ff e8 f2 ef ff ff 41 89 c4 85 c0 75 09 83 e3 01 0f 85 54 ff ff ff 41 8b 4f 20 66 <90> f7 c1 00 00 10 00 0f 84 23 01 00 00 48 c7 c3 40 cc 01 00 65 48
> > [    3.093098] RSP: 0018:ffffaf1600013e00 EFLAGS: 00000046
> > [    3.093098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000280003
> > [    3.093098] RDX: 0000000000000000 RSI: ffff9aa90fd9dec0 RDI: ffffffff8441e4b8
> > [    3.093098] RBP: 00000000fffc200d R08: ffffffff8441e4a0 R09: ffffffff8441e4b8
> > [    3.093098] R10: 0000000000000001 R11: 000000000002e990 R12: 0000000000000000
> > [    3.093098] R13: 00000000ffffffff R14: ffff9aa90fd9dec0 R15: ffffffff8441e4b8
> > [    3.093098] FS:  0000000000000000(0000) GS:ffff9aa90fd80000(0000) knlGS:0000000000000000
> > [    3.093098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    3.093098] CR2: 0000000000000000 CR3: 000000004e02e000 CR4: 0000000000750ee0
> > [    3.093098] PKRU: 55555554
> > [    3.093098] Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > There are many variations, but the common pattern seems to be
> > <something in the clock or timer code> -> int3 exception
> > 
> > It only happens under qemu TCG (software emulation).
> > 
> > It goes away if we recompile qemu without MTTCG support.
> > 
> > It only happens with -smp enabled, we are using qemu -smp 4
> > 
> > We are using qemu-system-x86_64 full system emulation on x86_64 host
> > (ie. forcing KVM off).
> > 
> > It happens with the latest upstream kernel and qemu, compiled from
> > source.
> 
> I got a bit further on this one.
> 
> The bug happens in the code that updates the static branch used for at
> least these two keys:
> 
>   static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable);
>   DEFINE_STATIC_KEY_FALSE(timers_migration_enabled);
> 
> There are probably others (it seems a generic problem with how static
> branches are handled by TCG), but I only see the bug for those two.
> 
> When the static branch is updated we end up calling
> arch/x86/kernel/alternative.c:text_poke_bp_batch().  It's best to read
> the description of that function to see where int3 is used:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/alternative.c#n2086
> 
> I modified the qemu TCG int3 helper so it dumps the code at %rip when
> the interrupt fires, and I can actually see the changes in the above
> function happen, first int3 being written, then the end of the nop,
> then the int3 being overwritten with the first byte of the nop.
> 
> Unfortunately the int3 still fires after the code has been completely
> rewritten to its final (ie nop) value.
> 
> This seems to indicate to me that neither the self-write to the kernel
> text segment, nor sync_core (implemented by a "iret to self" trick)
> invalidates the qemu TCG translation block containing the old int3
> helper call.  Thus we (qemu) never "see" the new nop, we keep
> emulating int3, and that causes the kernel to crash.

The following paragraph is wrong:

> I added print statements inside tb_invalidate_phys_page() and this
> function seems never to be called at all.  It's my understanding that
> at least the kernel writing to its text segment ought to cause
> tb_invalidate_phys_page() to be called, but I'm not super-familiar
> with this qemu code.

tb_invalidate_phys_range_fast() *is* called, and we end up calling
  tb_invalidate_phys_page_range__locked ->
    tb_phys_invalidate__locked ->
      do_tb_phys_invalidate

Nevertheless the old TB (containing the call to the int3 helper) is
still called after the code has been replaced with a NOP.

Of course there are 4 MTTCG threads so maybe another thread is in the
middle of executing the same TB when it gets invalidated.
tb_invalidate_phys_page_range__locked goes to some effort to check if
the current TB is being invalidated and restart the TB, but as far as
I can see the test can only work for the current core, and won't
restart the TB on other cores.

Is there any way to get the current TranslationBlock* from a helper?
It would be useful for additional debugging, I couldn't see how to do it.

The next thing I'll look at is the kernel sync_core() function and
whether TCG does the right thing, whatever that would be.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-05 16:40               ` Richard W.M. Jones
@ 2023-07-06  6:13                 ` Richard Henderson
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2023-07-06  6:13 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Peter Zijlstra, Arnd Bergmann, Naresh Kamboju, Anders Roxell,
	Daniel Díaz, Benjamin Copeland, lkml - Kernel Mailing List,
	the arch/x86 maintainers, Paolo Bonzini

On 7/5/23 17:40, Richard W.M. Jones wrote:
> On Wed, Jul 05, 2023 at 06:36:43PM +0200, Richard Henderson wrote:
>> No, I thought it would be the fix for 8.0.0.
> 
> Thanks for the suggestion anyway.
> 
> Am I right in thinking that tb_invalidate_phys_page() ought to be
> called when the kernel self-modifies its text segment?  If there's
> some function that we expect to be called in this case then I can
> instrument it.

All such go through notdirty_write(), which has a convenient tracepoint.  Recall there are 
migration and vga dirty pages as well, which also go through here, so disabling vga if 
possible may help.

Code stuff should then go through tb_invalidate_phys_range_fast(), so not 
tb_invalidate_phys_page() at all.


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-05 21:50         ` Richard W.M. Jones
@ 2023-07-06  6:30           ` Richard Henderson
  2023-07-06 10:28             ` Richard W.M. Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2023-07-06  6:30 UTC (permalink / raw)
  To: Richard W.M. Jones, Peter Zijlstra
  Cc: Arnd Bergmann, Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86, Paolo Bonzini

On 7/5/23 22:50, Richard W.M. Jones wrote:
> tb_invalidate_phys_range_fast() *is* called, and we end up calling
>    tb_invalidate_phys_page_range__locked ->
>      tb_phys_invalidate__locked ->
>        do_tb_phys_invalidate
> 
> Nevertheless the old TB (containing the call to the int3 helper) is
> still called after the code has been replaced with a NOP.
> 
> Of course there are 4 MTTCG threads so maybe another thread is in the
> middle of executing the same TB when it gets invalidated.

Yes.

> tb_invalidate_phys_page_range__locked goes to some effort to check if
> the current TB is being invalidated and restart the TB, but as far as
> I can see the test can only work for the current core, and won't
> restart the TB on other cores.

Yes.

The assumption with any of these sorts of races is that it is "as if" the other thread has 
already passed the location of the write within that block.  But by the time this thread 
has finished do_tb_phys_invalidate, any other thread cannot execute the same block *again*.

There's a race here, and now that I think about it, there's been mail about it in the past:

https://lore.kernel.org/qemu-devel/cebad06c-48f2-6dbd-6d7f-3a3cf5aebbe3@linaro.org/

We take care of the same race for user-only in translator_access, by ensuring that each 
translated page is read-only *before* performing the read for translation.  But for system 
mode we grab the page locks *after* the reads.  Which means there's a race.

The email above describes the race pretty clearly, with a new TB being generated before 
the write is even complete.

It'll be non-trivial fixing this, because not only do we need to grab the lock earlier, 
there are ordering issues for a TB that spans two pages, in that one must grab the two 
locks in the correct order lest we deadlock.

r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-06  6:30           ` Richard Henderson
@ 2023-07-06 10:28             ` Richard W.M. Jones
  2023-08-08  5:27               ` John Stultz
  0 siblings, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2023-07-06 10:28 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Zijlstra, Arnd Bergmann, Naresh Kamboju, Anders Roxell,
	Daniel Díaz, Benjamin Copeland, linux-kernel, x86,
	Paolo Bonzini

On Thu, Jul 06, 2023 at 07:30:50AM +0100, Richard Henderson wrote:
> On 7/5/23 22:50, Richard W.M. Jones wrote:
> >tb_invalidate_phys_range_fast() *is* called, and we end up calling
> >   tb_invalidate_phys_page_range__locked ->
> >     tb_phys_invalidate__locked ->
> >       do_tb_phys_invalidate
> >
> >Nevertheless the old TB (containing the call to the int3 helper) is
> >still called after the code has been replaced with a NOP.
> >
> >Of course there are 4 MTTCG threads so maybe another thread is in the
> >middle of executing the same TB when it gets invalidated.
> 
> Yes.
> 
> >tb_invalidate_phys_page_range__locked goes to some effort to check if
> >the current TB is being invalidated and restart the TB, but as far as
> >I can see the test can only work for the current core, and won't
> >restart the TB on other cores.
> 
> Yes.
> 
> The assumption with any of these sorts of races is that it is "as
> if" the other thread has already passed the location of the write
> within that block.  But by the time this thread has finished
> do_tb_phys_invalidate, any other thread cannot execute the same
> block *again*.
> 
> There's a race here, and now that I think about it, there's been mail about it in the past:
> 
> https://lore.kernel.org/qemu-devel/cebad06c-48f2-6dbd-6d7f-3a3cf5aebbe3@linaro.org/
> 
> We take care of the same race for user-only in translator_access, by
> ensuring that each translated page is read-only *before* performing
> the read for translation.  But for system mode we grab the page
> locks *after* the reads.  Which means there's a race.
> 
> The email above describes the race pretty clearly, with a new TB
> being generated before the write is even complete.
> 
> It'll be non-trivial fixing this, because not only do we need to
> grab the lock earlier, there are ordering issues for a TB that spans
> two pages, in that one must grab the two locks in the correct order
> lest we deadlock.

Yes I can see how this is hard to fix.  Even if we just lock the page
containing the first instruction (which we know) before doing
translation, we still have a problem when entering tb_link_page()
where we would need to only lock the second page, which might cause
ordering issues.

How about a new per-page lock, which would be grabbed by
do_tb_phys_invalidate() and tb_gen_code(), just on the first
instruction?  It would mean, I think, that no page can be having TBs
invalidated and generated at the same time.

Or something like scanning the bytes as they are being translated,
generate a secure-ish checksum, then recheck it after translation and
discard the TB if the code changed.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-07-06 10:28             ` Richard W.M. Jones
@ 2023-08-08  5:27               ` John Stultz
  2023-08-08  7:28                 ` Richard W.M. Jones
  0 siblings, 1 reply; 15+ messages in thread
From: John Stultz @ 2023-08-08  5:27 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Richard Henderson, Peter Zijlstra, Arnd Bergmann, Naresh Kamboju,
	Anders Roxell, Daniel Díaz, Benjamin Copeland, linux-kernel,
	x86, Paolo Bonzini

On Thu, Jul 6, 2023 at 3:28 AM Richard W.M. Jones <rjones@redhat.com> wrote:
>
> On Thu, Jul 06, 2023 at 07:30:50AM +0100, Richard Henderson wrote:
> > On 7/5/23 22:50, Richard W.M. Jones wrote:
> > >tb_invalidate_phys_range_fast() *is* called, and we end up calling
> > >   tb_invalidate_phys_page_range__locked ->
> > >     tb_phys_invalidate__locked ->
> > >       do_tb_phys_invalidate
> > >
> > >Nevertheless the old TB (containing the call to the int3 helper) is
> > >still called after the code has been replaced with a NOP.
> > >
> > >Of course there are 4 MTTCG threads so maybe another thread is in the
> > >middle of executing the same TB when it gets invalidated.
> >
> > Yes.
> >
> > >tb_invalidate_phys_page_range__locked goes to some effort to check if
> > >the current TB is being invalidated and restart the TB, but as far as
> > >I can see the test can only work for the current core, and won't
> > >restart the TB on other cores.
> >
> > Yes.
> >
> > The assumption with any of these sorts of races is that it is "as
> > if" the other thread has already passed the location of the write
> > within that block.  But by the time this thread has finished
> > do_tb_phys_invalidate, any other thread cannot execute the same
> > block *again*.
> >
> > There's a race here, and now that I think about it, there's been mail about it in the past:
> >
> > https://lore.kernel.org/qemu-devel/cebad06c-48f2-6dbd-6d7f-3a3cf5aebbe3@linaro.org/
> >
> > We take care of the same race for user-only in translator_access, by
> > ensuring that each translated page is read-only *before* performing
> > the read for translation.  But for system mode we grab the page
> > locks *after* the reads.  Which means there's a race.
> >
> > The email above describes the race pretty clearly, with a new TB
> > being generated before the write is even complete.
> >
> > It'll be non-trivial fixing this, because not only do we need to
> > grab the lock earlier, there are ordering issues for a TB that spans
> > two pages, in that one must grab the two locks in the correct order
> > lest we deadlock.
>
> Yes I can see how this is hard to fix.  Even if we just lock the page
> containing the first instruction (which we know) before doing
> translation, we still have a problem when entering tb_link_page()
> where we would need to only lock the second page, which might cause
> ordering issues.
>
> How about a new per-page lock, which would be grabbed by
> do_tb_phys_invalidate() and tb_gen_code(), just on the first
> instruction?  It would mean, I think, that no page can be having TBs
> invalidated and generated at the same time.
>
> Or something like scanning the bytes as they are being translated,
> generate a secure-ish checksum, then recheck it after translation and
> discard the TB if the code changed.

Hey all,
  So I've occasionally tripped over something similar in my stress
testing of the proxy-execution patch series.

This is only triggered by booting w/ qemu (7.2.2) without kvm. It's
rare but seems to trip in bursts.

The errors usually look something like:

[   21.264283] clocksource: Switched to clocksource hpet
[   21.273213] hrtimer: interrupt took 2400410 ns
[   21.369331] VFS: Disk quotas dquot_6.6.0
[   21.375425] int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
[   21.375453] CPU: 12 PID: 0 Comm: swapper/12 Not tainted
6.4.0-rc5-00032-g8d3f70560882-dirty #1712
[   21.375453] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   21.375453] RIP: 0010:hrtimer_start_range_ns+0x1ab/0x3d0
[   21.375453] Code: 0f 85 70 ff ff ff 44 8b 7c 24 18 4c 63 65 08 4c
89 f3 e8 c8 86 0c 01 89 c0 41 83 e7 02 48 03 1c c5 40 0c ac b1 48 89
1c 24 0f <1f> 44 00 00 45 85 ff 0f 84 e3 01 00 00 4d 8d 7c 24 01 49 c1
e4 07
[   21.375453] RSP: 0018:ffffbf4b000f7e60 EFLAGS: 00000086
[   21.375453] RAX: 000000000000000c RBX: ffff984ff85239c0 RCX: 00000004ef14ffc0
[   21.375453] RDX: 7fffffffffffffff RSI: ffffffffb1973381 RDI: ffffffffb1976d48
[   21.375453] RBP: ffff984ff8523a40 R08: 0000000000000001 R09: 0000000000000000
[   21.375453] R10: 00000000e8003900 R11: 0000000005673366 R12: 0000000000000000
[   21.375453] R13: ffff984ff85242f0 R14: 00000000000239c0 R15: 0000000000000002
[   21.375453] FS:  0000000000000000(0000) GS:ffff984ff8500000(0000)
knlGS:0000000000000000
[   21.375453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   21.375453] CR2: 0000000000000000 CR3: 0000000045e62000 CR4: 00000000000006e0
[   21.375453] Call Trace:
[   21.375453]  <TASK>
[   21.375453]  ? die+0x2d/0x80
[   21.375453]  ? exc_int3+0xf3/0x100
[   21.375453]  ? asm_exc_int3+0x35/0x40
[   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
[   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
[   21.375453]  tick_nohz_restart+0x72/0x90
[   21.375453]  tick_nohz_idle_exit+0xc7/0xf0
[   21.375453]  do_idle+0x160/0x220
[   21.375453]  cpu_startup_entry+0x14/0x20
[   21.375453]  start_secondary+0xf5/0x100
[   21.375453]  secondary_startup_64_no_verify+0x10b/0x10b
[   21.375453]  </TASK>


Curious if anyone has found a workaround?

thanks
-john

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-08-08  5:27               ` John Stultz
@ 2023-08-08  7:28                 ` Richard W.M. Jones
  2025-10-25 19:48                   ` Gregory Price
  0 siblings, 1 reply; 15+ messages in thread
From: Richard W.M. Jones @ 2023-08-08  7:28 UTC (permalink / raw)
  To: John Stultz
  Cc: Richard Henderson, Peter Zijlstra, Arnd Bergmann, Naresh Kamboju,
	Anders Roxell, Daniel Díaz, Benjamin Copeland, linux-kernel,
	x86, Paolo Bonzini

On Mon, Aug 07, 2023 at 10:27:13PM -0700, John Stultz wrote:
> On Thu, Jul 6, 2023 at 3:28 AM Richard W.M. Jones <rjones@redhat.com> wrote:
> >
> > On Thu, Jul 06, 2023 at 07:30:50AM +0100, Richard Henderson wrote:
> > > On 7/5/23 22:50, Richard W.M. Jones wrote:
> > > >tb_invalidate_phys_range_fast() *is* called, and we end up calling
> > > >   tb_invalidate_phys_page_range__locked ->
> > > >     tb_phys_invalidate__locked ->
> > > >       do_tb_phys_invalidate
> > > >
> > > >Nevertheless the old TB (containing the call to the int3 helper) is
> > > >still called after the code has been replaced with a NOP.
> > > >
> > > >Of course there are 4 MTTCG threads so maybe another thread is in the
> > > >middle of executing the same TB when it gets invalidated.
> > >
> > > Yes.
> > >
> > > >tb_invalidate_phys_page_range__locked goes to some effort to check if
> > > >the current TB is being invalidated and restart the TB, but as far as
> > > >I can see the test can only work for the current core, and won't
> > > >restart the TB on other cores.
> > >
> > > Yes.
> > >
> > > The assumption with any of these sorts of races is that it is "as
> > > if" the other thread has already passed the location of the write
> > > within that block.  But by the time this thread has finished
> > > do_tb_phys_invalidate, any other thread cannot execute the same
> > > block *again*.
> > >
> > > There's a race here, and now that I think about it, there's been mail about it in the past:
> > >
> > > https://lore.kernel.org/qemu-devel/cebad06c-48f2-6dbd-6d7f-3a3cf5aebbe3@linaro.org/
> > >
> > > We take care of the same race for user-only in translator_access, by
> > > ensuring that each translated page is read-only *before* performing
> > > the read for translation.  But for system mode we grab the page
> > > locks *after* the reads.  Which means there's a race.
> > >
> > > The email above describes the race pretty clearly, with a new TB
> > > being generated before the write is even complete.
> > >
> > > It'll be non-trivial fixing this, because not only do we need to
> > > grab the lock earlier, there are ordering issues for a TB that spans
> > > two pages, in that one must grab the two locks in the correct order
> > > lest we deadlock.
> >
> > Yes I can see how this is hard to fix.  Even if we just lock the page
> > containing the first instruction (which we know) before doing
> > translation, we still have a problem when entering tb_link_page()
> > where we would need to only lock the second page, which might cause
> > ordering issues.
> >
> > How about a new per-page lock, which would be grabbed by
> > do_tb_phys_invalidate() and tb_gen_code(), just on the first
> > instruction?  It would mean, I think, that no page can be having TBs
> > invalidated and generated at the same time.
> >
> > Or something like scanning the bytes as they are being translated,
> > generate a secure-ish checksum, then recheck it after translation and
> > discard the TB if the code changed.
> 
> Hey all,
>   So I've occasionally tripped over something similar in my stress
> testing of the proxy-execution patch series.
> 
> This is only triggered by booting w/ qemu (7.2.2) without kvm. It's
> rare but seems to trip in bursts.
> 
> The errors usually look something like:
> 
> [   21.264283] clocksource: Switched to clocksource hpet
> [   21.273213] hrtimer: interrupt took 2400410 ns
> [   21.369331] VFS: Disk quotas dquot_6.6.0
> [   21.375425] int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
> [   21.375453] CPU: 12 PID: 0 Comm: swapper/12 Not tainted
> 6.4.0-rc5-00032-g8d3f70560882-dirty #1712
> [   21.375453] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   21.375453] RIP: 0010:hrtimer_start_range_ns+0x1ab/0x3d0
> [   21.375453] Code: 0f 85 70 ff ff ff 44 8b 7c 24 18 4c 63 65 08 4c
> 89 f3 e8 c8 86 0c 01 89 c0 41 83 e7 02 48 03 1c c5 40 0c ac b1 48 89
> 1c 24 0f <1f> 44 00 00 45 85 ff 0f 84 e3 01 00 00 4d 8d 7c 24 01 49 c1
> e4 07
> [   21.375453] RSP: 0018:ffffbf4b000f7e60 EFLAGS: 00000086
> [   21.375453] RAX: 000000000000000c RBX: ffff984ff85239c0 RCX: 00000004ef14ffc0
> [   21.375453] RDX: 7fffffffffffffff RSI: ffffffffb1973381 RDI: ffffffffb1976d48
> [   21.375453] RBP: ffff984ff8523a40 R08: 0000000000000001 R09: 0000000000000000
> [   21.375453] R10: 00000000e8003900 R11: 0000000005673366 R12: 0000000000000000
> [   21.375453] R13: ffff984ff85242f0 R14: 00000000000239c0 R15: 0000000000000002
> [   21.375453] FS:  0000000000000000(0000) GS:ffff984ff8500000(0000)
> knlGS:0000000000000000
> [   21.375453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   21.375453] CR2: 0000000000000000 CR3: 0000000045e62000 CR4: 00000000000006e0
> [   21.375453] Call Trace:
> [   21.375453]  <TASK>
> [   21.375453]  ? die+0x2d/0x80
> [   21.375453]  ? exc_int3+0xf3/0x100
> [   21.375453]  ? asm_exc_int3+0x35/0x40
> [   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
> [   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
> [   21.375453]  tick_nohz_restart+0x72/0x90
> [   21.375453]  tick_nohz_idle_exit+0xc7/0xf0
> [   21.375453]  do_idle+0x160/0x220
> [   21.375453]  cpu_startup_entry+0x14/0x20
> [   21.375453]  start_secondary+0xf5/0x100
> [   21.375453]  secondary_startup_64_no_verify+0x10b/0x10b
> [   21.375453]  </TASK>
> 
> 
> Curious if anyone has found a workaround?

Yes, it should be fixed upstream.  You will need these two commits:

commit deba78709ae8ce103e2248413857747f804cd1ef
Author: Richard Henderson <richard.henderson@linaro.org>
Date:   Thu Jul 6 17:55:48 2023 +0100

    accel/tcg: Always lock pages before translation

commit cb62bd15e14e304617d250158b77d0deb032f032
Author: Richard Henderson <richard.henderson@linaro.org>
Date:   Thu Jul 6 08:45:13 2023 +0100

    accel/tcg: Split out cpu_exec_longjmp_cleanup

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing.
  2023-08-08  7:28                 ` Richard W.M. Jones
@ 2025-10-25 19:48                   ` Gregory Price
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2025-10-25 19:48 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: John Stultz, Richard Henderson, Peter Zijlstra, Arnd Bergmann,
	Naresh Kamboju, Anders Roxell, Daniel Díaz,
	Benjamin Copeland, linux-kernel, x86, Paolo Bonzini

On Tue, Aug 08, 2023 at 08:28:35AM +0100, Richard W.M. Jones wrote:
> > [   21.375453] Call Trace:
> > [   21.375453]  <TASK>
> > [   21.375453]  ? die+0x2d/0x80
> > [   21.375453]  ? exc_int3+0xf3/0x100
> > [   21.375453]  ? asm_exc_int3+0x35/0x40
> > [   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
> > [   21.375453]  ? hrtimer_start_range_ns+0x1ab/0x3d0
--- >8
> 
> Yes, it should be fixed upstream.  You will need these two commits:
> 
> commit deba78709ae8ce103e2248413857747f804cd1ef
> Author: Richard Henderson <richard.henderson@linaro.org>
> Date:   Thu Jul 6 17:55:48 2023 +0100
> 
>     accel/tcg: Always lock pages before translation
> 

Apologies for reviving an ancient thread, but I believe there is another
corner case for this bug - and it's an extremely narrow race condition.

Running in QEMU pc-q35-9.2 - so w/ all the fixes from this thread.

We first noticed crashes in poke_int3_handler() stemming from stacks
that look like so:

    __kmalloc_noprof+0x7e
    __kmalloc_cache_noprof+0x34
    __kmalloc_node_noprof+0x98
    kmem_cache_alloc_lru_noprof+0x37
    kmem_cache_alloc_noprof+0x3d
    ... etc ...

Which lead us to static_branch code and subsequent this thread.

What we're seeing is QEMU/KVM are injecting the int3 exception with
the int3 IP address of the instruction following the 0xCC... and the
int3 has been removed.

   > address_of_int3 + 5

poke() code then Oops because it can't find the int3.

We can't find *why* the exception RIP has been incremented past the int3
instruction - nothing in the KVM or QEMU emulation code immediately
suggest an obvious bug.  

We spent a good amount of time inspecting the fixes in this thread,
as well as this thread:
https://lore.kernel.org/all/20220423021411.784383-6-seanjc@google.com/
And this thread:
https://lore.kernel.org/all/20250611113001.GC2273038@noisy.programming.kicks-ass.net/

We tried building a reproducer (guest_repro.c below) that simply
hammers on static_branch in the guest.  This did not work by itself.

We went off to test/validate other things:

1) our guests are not configured to VMExit on int3
   svm.vmcb.control.intercepts[prog["INTERCEPT_EXCEPTION"]] & (1 << 3)
   (u32)0

2) svm_inject_exception DOES inject int3's, and we're fairly certain
   this injection is what is causing the wrong RIP.

3) We attempted to get KVM/QEMU to emulate the int3's by executing
   other instructions prior to the int3 that we know would cause VMExits
   (cpuid, inb/outb), but this never caused a reproduction.

4) We finally traced the int3 injection to a VM exit for a nested page
   fault.  After the fault handling there's also an int3 reported in
   EXITINTINFO - this causes an int3 injection.

That lead me to think this might be the result of swap or numa
balancing causing a guest page to become unmapped in the host, where
the next instruction is an int3, which would cause a nested fault.

I wrote a script on the host to migrate a guest's memory to/from a
local/remote node, and finally got a reproduction (stack below).

KVM doesn't support emulating INT3 in protected mode, which means
that the emulated INT3 and subsequent issue must happen in QEMU...
and that is where I am stuck.

So ultimately, this bug looks a lot like the one discussed in this
thread:  An int3 from a static_branch is modified on one thread as
another thread tries to execute that int3 - except now we have a
Nested-Page-Fault that produces the subsequent race.

Any thoughts on where we might look in QEMU or KVM to make further
progress would be helpful - but I figured where we've gotten to might be
of interest to the folks who originally fixed this static_branch race.

I appreciate any cycles you might spare to help,
~Gregory

--- reproduction stack
 <TASK>
 ? __die+0x77/0xc0
 ? die+0x2b/0x50
 ? exc_int3+0x41/0x70
 ? asm_exc_int3+0x35/0x40
 ? cleanup_module+0x80/0x80 [sbint3]
 ? looper+0x29/0x80 [sbint3]
 ? looper+0x29/0x80 [sbint3]
 ? looper+0x1e/0x80 [sbint3]
 kthread+0xb1/0xe0
 ? __kthread_parkme+0x70/0x70
 ret_from_fork+0x30/0x40
 ? __kthread_parkme+0x70/0x70
 ret_from_fork_asm+0x11/0x20
 </TASK>

-------- guest_repro.c   (heavily truncated for brevity)

DEFINE_STATIC_KEY_FALSE(int3_branch_key); // initially disabled                                                                                   

/* 180 threads hammering on this */
static int looper(void *data)                                                                                                                     
{                                                                                                                                                 
    while (!kthread_should_stop()) {                                                                                                          
        schedule();                                
        /* nops produce 5-byte jump instr */
        if (static_branch_likely(&int3_branch_key))                                                                                  
            __asm__ volatile (".rept 40\n nop\n .endr\n");
        }
}
/* 1 thread hammering on this */
static int toggler(void *data)
{
    while (!kthread_should_stop()) {
        schedule();
        static_branch_enable(&int3_branch_key);
        static_branch_disable(&int3_branch_key);
    }
}

--- host_mover.c

/* 
 * basically this back and forth between node0<->node1
 * 1GB at a time for each chunk in /proc/<pid>/maps
 */
int ret = move_pages(qemu_pid, chunk_size, pages, nodes, status, 0);

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-10-25 19:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CA+G9fYsETJQm0Ue7hGsb+nbsiMikwycOV3V0DPr6WC2r61KRBQ@mail.gmail.com>
2023-06-21 15:31 ` qemu-x86_64 booting with 8.0.0 stil see int3: when running LTP tracing testing Arnd Bergmann
2023-06-21 16:06   ` Peter Zijlstra
2023-07-04  7:46     ` Richard W.M. Jones
2023-07-04 13:21       ` Richard W.M. Jones
2023-07-05 16:28       ` Richard W.M. Jones
     [not found]         ` <CAFXwXrmbpuFNf5=nQxiTteo8fpCdAbK4pEAN176Cq0yvwZcfFw@mail.gmail.com>
2023-07-05 16:35           ` Richard W.M. Jones
     [not found]             ` <CAFXwXrk1FEZPUO-zqNVJZ6YCHKUkgNehwmyDYuOr5fx8ff0OCA@mail.gmail.com>
2023-07-05 16:40               ` Richard W.M. Jones
2023-07-06  6:13                 ` Richard Henderson
2023-07-05 16:37           ` Richard W.M. Jones
2023-07-05 21:50         ` Richard W.M. Jones
2023-07-06  6:30           ` Richard Henderson
2023-07-06 10:28             ` Richard W.M. Jones
2023-08-08  5:27               ` John Stultz
2023-08-08  7:28                 ` Richard W.M. Jones
2025-10-25 19:48                   ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox