* CPU Lockups in KVM with deferred hrtimer rearming
@ 2026-04-16 20:50 Verma, Vishal L
2026-04-20 15:00 ` Thomas Gleixner
0 siblings, 1 reply; 28+ messages in thread
From: Verma, Vishal L @ 2026-04-16 20:50 UTC (permalink / raw)
To: peterz@infradead.org, tglx@kernel.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
Hi Peter,
We noticed a KVM Unit test 'x2apic' - (APIC LVT timer one shot)
failing, and also some TDX specific tests running into multiple CPUs in
hard lockups on a 192-CPU Emerald Rapids system, and we traced it to
the htrimers deferred rearming merge.
Making CONFIG_HRTIMER_REARM_DEFERRED default to n in Kconfig made both
pass.
This is the hard lockup splat:
watchdog: CPU98: Watchdog detected hard LOCKUP on cpu 98
Modules linked in: openvswitch nsh tls ipt_REJECT iptable_mangle iptable_nat iptable_filter ip_tables bridge stp llc kvm_intel kvm irqbypass sunrpc
irq event stamp: 34998
hardirqs last enabled at (34997): [<ffffffffc090ce6d>] tdx_vcpu_run+0x5d/0x350 [kvm_intel]
hardirqs last disabled at (34998): [<ffffffffb9add6df>] exc_nmi+0xaf/0x1a0
softirqs last enabled at (34404): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
softirqs last disabled at (34395): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
CPU: 98 UID: 0 PID: 54785 Comm: qemu-system-x86 Not tainted 7.0.0-g10324ed6a556 #1 PREEMPT(full)
Hardware name: HPE ProLiant DL380 Gen11/ProLiant DL380 Gen11, BIOS 2.48 03/11/2025
RIP: 0010:vmx_do_nmi_irqoff+0x13/0x20 [kvm_intel]
Code: ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 e8 3d db 6e f7 <c9> c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90
RSP: 0018:ff8d3a069bdf3af0 EFLAGS: 00000086
RAX: ff3cc96963d68000 RBX: ff3cc96963d68000 RCX: 4000000200000000
RDX: 0000000080000200 RSI: ff3cc96963d699d0 RDI: ff3cc96963d68000
RBP: ff8d3a069bdf3af0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ff3cc968d03d0000 R14: ff3cc968d03d0000 R15: 0000000000000000
FS: 00007f26ab7fe6c0(0000) GS:ff3cc98782d76000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001544af004 CR4: 0000000000f73ef0
PKRU: 00000000
Call Trace:
<TASK>
vmx_handle_nmi+0xdf/0x140 [kvm_intel]
tdx_vcpu_enter_exit+0xd5/0x300 [kvm_intel]
tdx_vcpu_run+0x5d/0x350 [kvm_intel]
vcpu_run+0xd4a/0x1800 [kvm]
? __local_bh_enable_ip+0x7b/0xf0
? kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
? kvm_arch_vcpu_ioctl_run+0xb9/0x5f0 [kvm]
kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
kvm_vcpu_ioctl+0x2ef/0xb00 [kvm]
? __fget_files+0x2b/0x190
? find_held_lock+0x2b/0x80
__x64_sys_ioctl+0x97/0xe0
do_syscall_64+0xf4/0x1540
? __x64_sys_ioctl+0xb1/0xe0
? trace_hardirqs_on_prepare+0xd2/0xf0
? do_syscall_64+0x225/0x1540
? trace_hardirqs_on+0x18/0x100
? __local_bh_enable_ip+0x7b/0xf0
? arch_do_signal_or_restart+0x155/0x250
? trace_hardirqs_off+0x4e/0xf0
? exit_to_user_mode_loop+0x150/0x4e0
? trace_hardirqs_on_prepare+0xd2/0xf0
? do_syscall_64+0x225/0x1540
? do_user_addr_fault+0x36c/0x6b0
? lockdep_hardirqs_on_prepare+0xdb/0x190
? trace_hardirqs_on+0x18/0x100
? do_syscall_64+0xab/0x1540
? exc_page_fault+0x12c/0x2b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f45f7ae00ed
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007f26ab7f3e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f26ab7fe6c0 RCX: 00007f45f7ae00ed
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000099
RBP: 00007f26ab7f3ec0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f26ab7fe6c0
R13: 00007ffdc7adecd0 R14: 00007f26ab7fecdc R15: 00007ffdc7adedd7
</TASK>
I tried out AI assisted and patch (below) which does happen to solve
it, but I'm not familiar in this area, and not sure if this is the
right fix.
---
diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..c4856c252412 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -4,6 +4,7 @@
#include <linux/static_call_types.h>
#include <linux/resume_user_mode.h>
+#include <linux/hrtimer_rearm.h>
#include <linux/syscalls.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
@@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
static inline void xfer_to_guest_mode_prepare(void)
{
lockdep_assert_irqs_disabled();
+ hrtimer_rearm_deferred();
tick_nohz_user_enter_prepare();
}
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5bd6efe598f0..f3bd084d9a72 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2058,6 +2058,7 @@ void __hrtimer_rearm_deferred(void)
}
hrtimer_rearm(cpu_base, expires_next, true);
}
+EXPORT_SYMBOL_GPL(__hrtimer_rearm_deferred);
static __always_inline void
hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L @ 2026-04-20 15:00 ` Thomas Gleixner 2026-04-20 15:22 ` Thomas Gleixner ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-20 15:00 UTC (permalink / raw) To: Verma, Vishal L, peterz@infradead.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote: > I tried out AI assisted and patch (below) which does happen to solve > it, but I'm not familiar in this area, and not sure if this is the > right fix. > > diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h > index bfa767702d9a..c4856c252412 100644 > --- a/include/linux/entry-virt.h > +++ b/include/linux/entry-virt.h > @@ -4,6 +4,7 @@ > > #include <linux/static_call_types.h> > #include <linux/resume_user_mode.h> > +#include <linux/hrtimer_rearm.h> > #include <linux/syscalls.h> > #include <linux/seccomp.h> > #include <linux/sched.h> > @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void); > static inline void xfer_to_guest_mode_prepare(void) > { > lockdep_assert_irqs_disabled(); > + hrtimer_rearm_deferred(); > tick_nohz_user_enter_prepare(); This code should never be reached with a rearm pending. Something else went wrong earlier. So while the patch "works" it papers over the underlying problem. Can you please do the following: 1) Apply the patch below 2) Enable function tracing and the hrtimer* trace events 3) Enable tracing if it has been disabled already echo 1 >/sys/kernel/tracing/tracing_on 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to become 0, which means the problem triggered. 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it somewhere to download from or send it to me compressed offlist. Thanks, tglx --- diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h index bfa767702d9a..ab73963a7496 100644 --- a/include/linux/entry-virt.h +++ b/include/linux/entry-virt.h @@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void); static inline void xfer_to_guest_mode_prepare(void) { lockdep_assert_irqs_disabled(); + if (test_thread_flag(TIF_HRTIMER_REARM)) { + tracing_off(); + hrtimer_rearm_deferred(); + } tick_nohz_user_enter_prepare(); } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-20 15:00 ` Thomas Gleixner @ 2026-04-20 15:22 ` Thomas Gleixner 2026-04-20 20:57 ` Verma, Vishal L 2026-04-21 4:51 ` Binbin Wu 2 siblings, 0 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-20 15:22 UTC (permalink / raw) To: Verma, Vishal L, peterz@infradead.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Mon, Apr 20 2026 at 17:00, Thomas Gleixner wrote: > On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote: > This code should never be reached with a rearm pending. Something else > went wrong earlier. So while the patch "works" it papers over the > underlying problem. Peter just noticed that this should be fixed with 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes") Thanks, tglx ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-20 15:00 ` Thomas Gleixner 2026-04-20 15:22 ` Thomas Gleixner @ 2026-04-20 20:57 ` Verma, Vishal L 2026-04-20 22:19 ` Thomas Gleixner 2026-04-21 4:51 ` Binbin Wu 2 siblings, 1 reply; 28+ messages in thread From: Verma, Vishal L @ 2026-04-20 20:57 UTC (permalink / raw) To: peterz@infradead.org, tglx@kernel.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote: > > This code should never be reached with a rearm pending. Something else > went wrong earlier. So while the patch "works" it papers over the > underlying problem. > > Can you please do the following: > > 1) Apply the patch below > > 2) Enable function tracing and the hrtimer* trace events > > 3) Enable tracing if it has been disabled already > > echo 1 >/sys/kernel/tracing/tracing_on > > 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to > become 0, which means the problem triggered. > > 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it > somewhere to download from or send it to me compressed offlist. Hi Thomas, I've uploaded the trace here (~75MB compressed): https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB As for: 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes") I already had that commit in the branch that was tested and it didn't fix it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-20 20:57 ` Verma, Vishal L @ 2026-04-20 22:19 ` Thomas Gleixner 2026-04-20 22:24 ` Verma, Vishal L 0 siblings, 1 reply; 28+ messages in thread From: Thomas Gleixner @ 2026-04-20 22:19 UTC (permalink / raw) To: Verma, Vishal L, peterz@infradead.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote: > On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote: >> >> This code should never be reached with a rearm pending. Something else >> went wrong earlier. So while the patch "works" it papers over the >> underlying problem. >> >> Can you please do the following: >> >> 1) Apply the patch below >> >> 2) Enable function tracing and the hrtimer* trace events >> >> 3) Enable tracing if it has been disabled already >> >> echo 1 >/sys/kernel/tracing/tracing_on >> >> 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to >> become 0, which means the problem triggered. >> >> 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it >> somewhere to download from or send it to me compressed offlist. > > Hi Thomas, > > I've uploaded the trace here (~75MB compressed): > https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB > > As for: > > 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes") > > I already had that commit in the branch that was tested and it didn't > fix it. Thanks for the update. Can you try to provide the information I asked for above? Thanks, tglx ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-20 22:19 ` Thomas Gleixner @ 2026-04-20 22:24 ` Verma, Vishal L 2026-04-21 6:29 ` Thomas Gleixner 0 siblings, 1 reply; 28+ messages in thread From: Verma, Vishal L @ 2026-04-20 22:24 UTC (permalink / raw) To: peterz@infradead.org, tglx@kernel.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote: > On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote: > > On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote: > > > > > > This code should never be reached with a rearm pending. Something else > > > went wrong earlier. So while the patch "works" it papers over the > > > underlying problem. > > > > > > Can you please do the following: > > > > > > 1) Apply the patch below > > > > > > 2) Enable function tracing and the hrtimer* trace events > > > > > > 3) Enable tracing if it has been disabled already > > > > > > echo 1 >/sys/kernel/tracing/tracing_on > > > > > > 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to > > > become 0, which means the problem triggered. > > > > > > 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it > > > somewhere to download from or send it to me compressed offlist. > > > > Hi Thomas, > > > > I've uploaded the trace here (~75MB compressed): > > https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB > > > > As for: > > > > 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes") > > > > I already had that commit in the branch that was tested and it didn't > > fix it. > > Thanks for the update. Can you try to provide the information I asked > for above? > Ah sorry - I should've said that with your patch applied, tracing_on did become 0, so the problem was triggered. The trace from that is in the URL above. This is how I collected it: tracefs=/sys/kernel/tracing echo 4096 > "$tracefs"/buffer_size_kb echo function > "$tracefs"/current_tracer echo 1 > "$tracefs"/events/hrtimer/enable echo 1 > "$tracefs"/tracing_on <run the test> tracing_on="$(cat "$tracefs"/tracing_on)" if [ "$tracing_on" -eq 0 ]; then echo "Debug patch triggered, collecting trace" cat "$tracefs"/trace | gzip > /tmp/hrtimer_rearm_trace.gz else echo "Debug patch did not trigger (tracing_on still 1)" fi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-20 22:24 ` Verma, Vishal L @ 2026-04-21 6:29 ` Thomas Gleixner 0 siblings, 0 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-21 6:29 UTC (permalink / raw) To: Verma, Vishal L, peterz@infradead.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Mon, Apr 20 2026 at 22:24, Verma, Vishal L wrote: > On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote: >> Thanks for the update. Can you try to provide the information I asked >> for above? >> > Ah sorry - I should've said that with your patch applied, tracing_on > did become 0, so the problem was triggered. > > The trace from that is in the URL above. I clearly can't read :) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-20 15:00 ` Thomas Gleixner 2026-04-20 15:22 ` Thomas Gleixner 2026-04-20 20:57 ` Verma, Vishal L @ 2026-04-21 4:51 ` Binbin Wu 2026-04-21 7:39 ` Thomas Gleixner 2 siblings, 1 reply; 28+ messages in thread From: Binbin Wu @ 2026-04-21 4:51 UTC (permalink / raw) To: Thomas Gleixner, Verma, Vishal L, peterz@infradead.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On 4/20/2026 11:00 PM, Thomas Gleixner wrote: > On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote: >> I tried out AI assisted and patch (below) which does happen to solve >> it, but I'm not familiar in this area, and not sure if this is the >> right fix. >> >> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h >> index bfa767702d9a..c4856c252412 100644 >> --- a/include/linux/entry-virt.h >> +++ b/include/linux/entry-virt.h >> @@ -4,6 +4,7 @@ >> >> #include <linux/static_call_types.h> >> #include <linux/resume_user_mode.h> >> +#include <linux/hrtimer_rearm.h> >> #include <linux/syscalls.h> >> #include <linux/seccomp.h> >> #include <linux/sched.h> >> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void); >> static inline void xfer_to_guest_mode_prepare(void) >> { >> lockdep_assert_irqs_disabled(); >> + hrtimer_rearm_deferred(); >> tick_nohz_user_enter_prepare(); > > > This code should never be reached with a rearm pending. Something else > went wrong earlier. So while the patch "works" it papers over the > underlying problem. IIUC, the problem might be: HRTimer -> VMExit: [IRQ is disabled] kvm_x86_call(handle_exit_irqoff)(vcpu) vmx_handle_exit_irqoff handle_external_interrupt_irqoff sysvec_apic_timer_interrupt irqentry_enter ... irqentry_exit irqentry_exit_to_kernel_mode if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer hrtimer_rearm_deferred() rearm is skipped! This issue is triggered on TDX since TDX can't use preemption timer while normal VMX VM uses preemption timer by default. > > Can you please do the following: > > 1) Apply the patch below > > 2) Enable function tracing and the hrtimer* trace events > > 3) Enable tracing if it has been disabled already > > echo 1 >/sys/kernel/tracing/tracing_on > > 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to > become 0, which means the problem triggered. > > 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it > somewhere to download from or send it to me compressed offlist. > > Thanks, > > tglx > --- > > diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h > index bfa767702d9a..ab73963a7496 100644 > --- a/include/linux/entry-virt.h > +++ b/include/linux/entry-virt.h > @@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void); > static inline void xfer_to_guest_mode_prepare(void) > { > lockdep_assert_irqs_disabled(); > + if (test_thread_flag(TIF_HRTIMER_REARM)) { > + tracing_off(); > + hrtimer_rearm_deferred(); > + } > tick_nohz_user_enter_prepare(); > } > > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 4:51 ` Binbin Wu @ 2026-04-21 7:39 ` Thomas Gleixner 2026-04-21 11:18 ` Peter Zijlstra 2026-04-21 16:11 ` Verma, Vishal L 0 siblings, 2 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-21 7:39 UTC (permalink / raw) To: Binbin Wu, Verma, Vishal L, peterz@infradead.org Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21 2026 at 12:51, Binbin Wu wrote: > On 4/20/2026 11:00 PM, Thomas Gleixner wrote: >>> static inline void xfer_to_guest_mode_prepare(void) >>> { >>> lockdep_assert_irqs_disabled(); >>> + hrtimer_rearm_deferred(); >>> tick_nohz_user_enter_prepare(); >> >> >> This code should never be reached with a rearm pending. Something else >> went wrong earlier. So while the patch "works" it papers over the >> underlying problem. > > IIUC, the problem might be: > > HRTimer -> VMExit: > [IRQ is disabled] > kvm_x86_call(handle_exit_irqoff)(vcpu) > vmx_handle_exit_irqoff > handle_external_interrupt_irqoff > sysvec_apic_timer_interrupt > irqentry_enter > ... > irqentry_exit > irqentry_exit_to_kernel_mode > if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer > hrtimer_rearm_deferred() rearm is skipped! > > > This issue is triggered on TDX since TDX can't use preemption timer while normal > VMX VM uses preemption timer by default. Kinda. The issue is that vmx_handle_exit_irqoff() always hands in regs with regs->flags.X86_EFLAGS_IF == 0. That has absolutely nothing to do with TDX and the preemption timer. The patch below solves the problem right there in the exit code, which is unfortunate as there might be a NEED_RESCHED pending. But that can't be taken into account as KVM enables interrupts _before_ reaching the exit work point. Yet another proof that virt creates more problems than it solves. Thanks, tglx --- Subject: entry: Enforce hrtimer rearming in the irqentry_exit path From: Thomas Gleixner <tglx@kernel.org> Date: Tue, 21 Apr 2026 09:00:52 +0200 irqentry_exit_to_kernel_mode_after_preempt() invokes hrtimer_rearm_deferred() only when the interrupted context had interrupts enabled. That's a correct decision because the timer interrupt can only be delivered in interrupt enabled contexts. The interrupt disabled path is used by exceptions and traps which never touch the hrtimer mechanics. So much for the theory, but then there is VIRT which ruins everything. KVM invokes regular interrupts with pt_regs which have interrupts disabled. That's correct from the KVM point of view, but completely violates the obviously correct expectations of the interrupt entry/exit code. Cure this by adding a hrtimer_rearm_deferred() invocation into the interrupted context has interrupt disabled path of irqentry_exit_to_kernel_mode_after_preempt(). That's unfortunate when there is an actual reschedule pending, but it can't be avoided because KVM invokes a lot of code and also reenables interrupts _before_ reaching the point where the reschedule condition is handled. That can delay the rearming significantly, which in turn can cause artificial latencies. Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com --- include/linux/irq-entry-common.h | 8 ++++++++ 1 file changed, 8 insertions(+) --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem instrumentation_end(); } else { /* + * This is sadly required due to KVM, which invokes regular + * interrupt handlers with interrupt disabled state in @regs. + */ + instrumentation_begin(); + hrtimer_rearm_deferred(); + instrumentation_end(); + + /* * IRQ flags state is correct already. Just tell RCU if it * was not watching on entry. */ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 7:39 ` Thomas Gleixner @ 2026-04-21 11:18 ` Peter Zijlstra 2026-04-21 11:32 ` Peter Zijlstra 2026-04-21 16:11 ` Verma, Vishal L 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 11:18 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > --- > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > From: Thomas Gleixner <tglx@kernel.org> > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > irqentry_exit_to_kernel_mode_after_preempt() invokes > hrtimer_rearm_deferred() only when the interrupted context had interrupts > enabled. That's a correct decision because the timer interrupt can only be > delivered in interrupt enabled contexts. The interrupt disabled path is > used by exceptions and traps which never touch the hrtimer mechanics. > > So much for the theory, but then there is VIRT which ruins everything. > > KVM invokes regular interrupts with pt_regs which have interrupts > disabled. That's correct from the KVM point of view, but completely > violates the obviously correct expectations of the interrupt entry/exit > code. Mooo :-( That also complicates the comment that goes with hrtimer_rearm_deferred(). Not sure how to 'fix' that. > Cure this by adding a hrtimer_rearm_deferred() invocation into the > interrupted context has interrupt disabled path of > irqentry_exit_to_kernel_mode_after_preempt(). > > That's unfortunate when there is an actual reschedule pending, but it can't > be avoided because KVM invokes a lot of code and also reenables interrupts > _before_ reaching the point where the reschedule condition is handled. That > can delay the rearming significantly, which in turn can cause artificial > latencies. Yeah, this is a trainwreck. If they want it better, KVM needs to get 'fixed' to not play silly games like this. > Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") > Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com> > Signed-off-by: Thomas Gleixner <tglx@kernel.org> > Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/linux/irq-entry-common.h | 8 ++++++++ > 1 file changed, 8 insertions(+) > > --- a/include/linux/irq-entry-common.h > +++ b/include/linux/irq-entry-common.h > @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem > instrumentation_end(); > } else { > /* > + * This is sadly required due to KVM, which invokes regular > + * interrupt handlers with interrupt disabled state in @regs. > + */ > + instrumentation_begin(); > + hrtimer_rearm_deferred(); > + instrumentation_end(); > + > + /* > * IRQ flags state is correct already. Just tell RCU if it > * was not watching on entry. > */ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 11:18 ` Peter Zijlstra @ 2026-04-21 11:32 ` Peter Zijlstra 2026-04-21 11:34 ` Peter Zijlstra 2026-04-21 16:30 ` Thomas Gleixner 0 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 11:32 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > > > --- > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > > From: Thomas Gleixner <tglx@kernel.org> > > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes > > hrtimer_rearm_deferred() only when the interrupted context had interrupts > > enabled. That's a correct decision because the timer interrupt can only be > > delivered in interrupt enabled contexts. The interrupt disabled path is > > used by exceptions and traps which never touch the hrtimer mechanics. > > > > So much for the theory, but then there is VIRT which ruins everything. > > > > KVM invokes regular interrupts with pt_regs which have interrupts > > disabled. That's correct from the KVM point of view, but completely > > violates the obviously correct expectations of the interrupt entry/exit > > code. > > Mooo :-( > > That also complicates the comment that goes with > hrtimer_rearm_deferred(). Not sure how to 'fix' that. > > > Cure this by adding a hrtimer_rearm_deferred() invocation into the > > interrupted context has interrupt disabled path of > > irqentry_exit_to_kernel_mode_after_preempt(). > > > > That's unfortunate when there is an actual reschedule pending, but it can't > > be avoided because KVM invokes a lot of code and also reenables interrupts > > _before_ reaching the point where the reschedule condition is handled. That > > can delay the rearming significantly, which in turn can cause artificial > > latencies. > > Yeah, this is a trainwreck. If they want it better, KVM needs to get > 'fixed' to not play silly games like this. > > > Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") > > Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com> > > Signed-off-by: Thomas Gleixner <tglx@kernel.org> > > Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > > --- > > include/linux/irq-entry-common.h | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > --- a/include/linux/irq-entry-common.h > > +++ b/include/linux/irq-entry-common.h > > @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem > > instrumentation_end(); > > } else { > > /* > > + * This is sadly required due to KVM, which invokes regular > > + * interrupt handlers with interrupt disabled state in @regs. > > + */ > > + instrumentation_begin(); > > + hrtimer_rearm_deferred(); > > + instrumentation_end(); > > + > > + /* > > * IRQ flags state is correct already. Just tell RCU if it > > * was not watching on entry. > > */ Ohhh, wait. What happens if you take a page-fault from NMI context? Does this then not result in trying to program the timer from NMI context? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 11:32 ` Peter Zijlstra @ 2026-04-21 11:34 ` Peter Zijlstra 2026-04-21 11:49 ` Peter Zijlstra 2026-04-21 16:30 ` Thomas Gleixner 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 11:34 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > > > > > --- > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > > > From: Thomas Gleixner <tglx@kernel.org> > > > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts > > > enabled. That's a correct decision because the timer interrupt can only be > > > delivered in interrupt enabled contexts. The interrupt disabled path is > > > used by exceptions and traps which never touch the hrtimer mechanics. > > > > > > So much for the theory, but then there is VIRT which ruins everything. > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts > > > disabled. That's correct from the KVM point of view, but completely > > > violates the obviously correct expectations of the interrupt entry/exit > > > code. > > > > Mooo :-( Also, is this a x86/KVM 'special' or is this true for all arch/KVM that use GENERIC_ENTRY? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 11:34 ` Peter Zijlstra @ 2026-04-21 11:49 ` Peter Zijlstra 2026-04-21 12:05 ` Peter Zijlstra 2026-04-21 17:11 ` Thomas Gleixner 0 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 11:49 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > > > > > > > --- > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > > > > From: Thomas Gleixner <tglx@kernel.org> > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts > > > > enabled. That's a correct decision because the timer interrupt can only be > > > > delivered in interrupt enabled contexts. The interrupt disabled path is > > > > used by exceptions and traps which never touch the hrtimer mechanics. > > > > > > > > So much for the theory, but then there is VIRT which ruins everything. > > > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts > > > > disabled. That's correct from the KVM point of view, but completely > > > > violates the obviously correct expectations of the interrupt entry/exit > > > > code. > > > > > > Mooo :-( > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that > use GENERIC_ENTRY? Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF on the fake frame instead? We know it will enable IRQs after doing handle_exit_irqoff() in vcpu_enter_guest(). SVM does not seem affected with this particular insanity. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 11:49 ` Peter Zijlstra @ 2026-04-21 12:05 ` Peter Zijlstra 2026-04-21 13:19 ` Peter Zijlstra 2026-04-21 17:11 ` Thomas Gleixner 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 12:05 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote: > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > > > > > > > > > --- > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > > > > > From: Thomas Gleixner <tglx@kernel.org> > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > > > > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts > > > > > enabled. That's a correct decision because the timer interrupt can only be > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is > > > > > used by exceptions and traps which never touch the hrtimer mechanics. > > > > > > > > > > So much for the theory, but then there is VIRT which ruins everything. > > > > > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts > > > > > disabled. That's correct from the KVM point of view, but completely > > > > > violates the obviously correct expectations of the interrupt entry/exit > > > > > code. > > > > > > > > Mooo :-( > > > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that > > use GENERIC_ENTRY? > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF > on the fake frame instead? We know it will enable IRQs after doing > handle_exit_irqoff() in vcpu_enter_guest(). Moo, you can't do that either, because it will ERETS/IRET and fuck up the state :/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 12:05 ` Peter Zijlstra @ 2026-04-21 13:19 ` Peter Zijlstra 2026-04-21 13:29 ` Peter Zijlstra 0 siblings, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 13:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: > > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote: > > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: > > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > > > > > > > > > > > --- > > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > > > > > > From: Thomas Gleixner <tglx@kernel.org> > > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > > > > > > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes > > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts > > > > > > enabled. That's a correct decision because the timer interrupt can only be > > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is > > > > > > used by exceptions and traps which never touch the hrtimer mechanics. > > > > > > > > > > > > So much for the theory, but then there is VIRT which ruins everything. > > > > > > > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts > > > > > > disabled. That's correct from the KVM point of view, but completely > > > > > > violates the obviously correct expectations of the interrupt entry/exit > > > > > > code. > > > > > > > > > > Mooo :-( > > > > > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that > > > use GENERIC_ENTRY? > > > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF > > on the fake frame instead? We know it will enable IRQs after doing > > handle_exit_irqoff() in vcpu_enter_guest(). > > Moo, you can't do that either, because it will ERETS/IRET and fuck up > the state :/ How insane is something like this? --- diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S index 894f7f16eb80..f3e2a8fde1ab 100644 --- a/arch/x86/entry/entry_64_fred.S +++ b/arch/x86/entry/entry_64_fred.S @@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) push %rdi /* fred_ss handed in by the caller */ push %rbp pushf + or $X86_EFLAGS_KVM, (%rsp) push $__KERNEL_CS /* diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 7535131c711b..aab93f07e768 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void) } #define arch_exit_to_user_mode arch_exit_to_user_mode +static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) +{ +#ifdef CONFIG_KVM_INTEL + /* + * KVM is a reserved bit and must always be 0. Hardware will #GP on + * IRET/ERETS with this bit set. + */ + regs->flags &= ~X86_EFLAGS_KVM; +#endif +} +#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode + #endif diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 7bb7bd90355d..c31f7bc2eba2 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val) static __always_inline bool regs_irqs_disabled(struct pt_regs *regs) { - return !(regs->flags & X86_EFLAGS_IF); + /* + * return context | IF | KVM + * ---------------+----+---- + * IRQ-off | 0 | 0 + * IRQ-on | 0 | 1 + * IRQ-on | 1 | 0 + * invalid | 1 | 1 + */ + return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0; } /* Query offset/name of register from its name/offset */ diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 81d0c8bf1137..d32edefde587 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -14,6 +14,8 @@ #define X86_EFLAGS_FIXED _BITUL(X86_EFLAGS_FIXED_BIT) #define X86_EFLAGS_PF_BIT 2 /* Parity Flag */ #define X86_EFLAGS_PF _BITUL(X86_EFLAGS_PF_BIT) +#define X86_EFLAGS_KVM_BIT 3 /* KVM Flag -- must be 0 */ +#define X86_EFLAGS_KVM _BITUL(X86_EFLAGS_PF_BIT) #define X86_EFLAGS_AF_BIT 4 /* Auxiliary carry Flag */ #define X86_EFLAGS_AF _BITUL(X86_EFLAGS_AF_BIT) #define X86_EFLAGS_ZF_BIT 6 /* Zero Flag */ diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 8a481dae9cae..3d0d0fb8de79 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -50,6 +50,7 @@ push %rbp #endif pushf + or $X86_EFLAGS_KVM, (%_ASM_SP) push $__KERNEL_CS \call_insn \call_target diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h index 167fba7dbf04..0acc20b63513 100644 --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void); static __always_inline void arch_exit_to_user_mode(void) { } #endif +#ifndef arch_exit_to_kernel_mode +static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { } +#endif + /** * arch_do_signal_or_restart - Architecture specific signal delivery function * @regs: Pointer to currents pt_regs @@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs, instrumentation_end(); irqentry_exit_to_kernel_mode_after_preempt(regs, state); + arch_exit_to_kernel_mode(regs); } /** ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 13:19 ` Peter Zijlstra @ 2026-04-21 13:29 ` Peter Zijlstra 2026-04-21 16:36 ` Thomas Gleixner 2026-04-21 18:11 ` Verma, Vishal L 0 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 13:29 UTC (permalink / raw) To: Thomas Gleixner Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote: > > > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: > > > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote: > > > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: > > > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote: > > > > > > > > > > > > > --- > > > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > > > > > > > From: Thomas Gleixner <tglx@kernel.org> > > > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > > > > > > > > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes > > > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts > > > > > > > enabled. That's a correct decision because the timer interrupt can only be > > > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is > > > > > > > used by exceptions and traps which never touch the hrtimer mechanics. > > > > > > > > > > > > > > So much for the theory, but then there is VIRT which ruins everything. > > > > > > > > > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts > > > > > > > disabled. That's correct from the KVM point of view, but completely > > > > > > > violates the obviously correct expectations of the interrupt entry/exit > > > > > > > code. > > > > > > > > > > > > Mooo :-( > > > > > > > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that > > > > use GENERIC_ENTRY? > > > > > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF > > > on the fake frame instead? We know it will enable IRQs after doing > > > handle_exit_irqoff() in vcpu_enter_guest(). > > > > Moo, you can't do that either, because it will ERETS/IRET and fuck up > > the state :/ > > How insane is something like this? Small matter of actually building... --- diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S index 894f7f16eb80..cc2c961a5683 100644 --- a/arch/x86/entry/entry_64_fred.S +++ b/arch/x86/entry/entry_64_fred.S @@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) push %rdi /* fred_ss handed in by the caller */ push %rbp pushf + orq $X86_EFLAGS_KVM, (%rsp) push $__KERNEL_CS /* diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 0e8c611bc9e2..75568a85b2d3 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -43,6 +43,7 @@ #define _ASM_SUB __ASM_SIZE(sub) #define _ASM_XADD __ASM_SIZE(xadd) #define _ASM_MUL __ASM_SIZE(mul) +#define _ASM_OR __ASM_SIZE(or) #define _ASM_AX __ASM_REG(ax) #define _ASM_BX __ASM_REG(bx) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 7535131c711b..aab93f07e768 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void) } #define arch_exit_to_user_mode arch_exit_to_user_mode +static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) +{ +#ifdef CONFIG_KVM_INTEL + /* + * KVM is a reserved bit and must always be 0. Hardware will #GP on + * IRET/ERETS with this bit set. + */ + regs->flags &= ~X86_EFLAGS_KVM; +#endif +} +#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode + #endif diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 7bb7bd90355d..c31f7bc2eba2 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val) static __always_inline bool regs_irqs_disabled(struct pt_regs *regs) { - return !(regs->flags & X86_EFLAGS_IF); + /* + * return context | IF | KVM + * ---------------+----+---- + * IRQ-off | 0 | 0 + * IRQ-on | 0 | 1 + * IRQ-on | 1 | 0 + * invalid | 1 | 1 + */ + return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0; } /* Query offset/name of register from its name/offset */ diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 81d0c8bf1137..d32edefde587 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -14,6 +14,8 @@ #define X86_EFLAGS_FIXED _BITUL(X86_EFLAGS_FIXED_BIT) #define X86_EFLAGS_PF_BIT 2 /* Parity Flag */ #define X86_EFLAGS_PF _BITUL(X86_EFLAGS_PF_BIT) +#define X86_EFLAGS_KVM_BIT 3 /* KVM Flag -- must be 0 */ +#define X86_EFLAGS_KVM _BITUL(X86_EFLAGS_PF_BIT) #define X86_EFLAGS_AF_BIT 4 /* Auxiliary carry Flag */ #define X86_EFLAGS_AF _BITUL(X86_EFLAGS_AF_BIT) #define X86_EFLAGS_ZF_BIT 6 /* Zero Flag */ diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 8a481dae9cae..cb9ab3ce030b 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -6,6 +6,7 @@ #include <asm/nospec-branch.h> #include <asm/percpu.h> #include <asm/segment.h> +#include <asm/processor-flags.h> #include "kvm-asm-offsets.h" #include "run_flags.h" @@ -50,6 +51,7 @@ push %rbp #endif pushf + _ASM_OR $X86_EFLAGS_KVM, (%_ASM_SP) push $__KERNEL_CS \call_insn \call_target diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h index 167fba7dbf04..0acc20b63513 100644 --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void); static __always_inline void arch_exit_to_user_mode(void) { } #endif +#ifndef arch_exit_to_kernel_mode +static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { } +#endif + /** * arch_do_signal_or_restart - Architecture specific signal delivery function * @regs: Pointer to currents pt_regs @@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs, instrumentation_end(); irqentry_exit_to_kernel_mode_after_preempt(regs, state); + arch_exit_to_kernel_mode(regs); } /** ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 13:29 ` Peter Zijlstra @ 2026-04-21 16:36 ` Thomas Gleixner 2026-04-21 18:11 ` Verma, Vishal L 1 sibling, 0 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-21 16:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21 2026 at 15:29, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote: >> > Moo, you can't do that either, because it will ERETS/IRET and fuck up >> > the state :/ >> >> How insane is something like this? Pretty insane :) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 13:29 ` Peter Zijlstra 2026-04-21 16:36 ` Thomas Gleixner @ 2026-04-21 18:11 ` Verma, Vishal L 1 sibling, 0 replies; 28+ messages in thread From: Verma, Vishal L @ 2026-04-21 18:11 UTC (permalink / raw) To: peterz@infradead.org, tglx@kernel.org Cc: Wu, Binbin, kvm@vger.kernel.org, binbin.wu@linux.intel.com, Edgecombe, Rick P, x86@kernel.org On Tue, 2026-04-21 at 15:29 +0200, Peter Zijlstra wrote: > > diff --git a/arch/x86/include/uapi/asm/processor-flags.h > b/arch/x86/include/uapi/asm/processor-flags.h > index 81d0c8bf1137..d32edefde587 100644 > --- a/arch/x86/include/uapi/asm/processor-flags.h > +++ b/arch/x86/include/uapi/asm/processor-flags.h > @@ -14,6 +14,8 @@ > #define X86_EFLAGS_FIXED _BITUL(X86_EFLAGS_FIXED_BIT) > #define X86_EFLAGS_PF_BIT 2 /* Parity Flag */ > #define X86_EFLAGS_PF _BITUL(X86_EFLAGS_PF_BIT) > +#define X86_EFLAGS_KVM_BIT 3 /* KVM Flag -- must be 0 */ > +#define X86_EFLAGS_KVM _BITUL(X86_EFLAGS_PF_BIT) I fixed up the copy-paste typo here - _BITUL(X86_EFLAGS_KVM_BIT) .. and with that the tests pass. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 11:49 ` Peter Zijlstra 2026-04-21 12:05 ` Peter Zijlstra @ 2026-04-21 17:11 ` Thomas Gleixner 2026-04-21 17:20 ` Jim Mattson 2026-04-21 19:18 ` Verma, Vishal L 1 sibling, 2 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-21 17:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org, Sean Christopherson, Paolo Bonzini On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: >> > > > KVM invokes regular interrupts with pt_regs which have interrupts >> > > > disabled. That's correct from the KVM point of view, but completely >> > > > violates the obviously correct expectations of the interrupt entry/exit >> > > > code. >> > > >> > > Mooo :-( >> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that >> use GENERIC_ENTRY? > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF > on the fake frame instead? We know it will enable IRQs after doing > handle_exit_irqoff() in vcpu_enter_guest(). Doesn't work :) > SVM does not seem affected with this particular insanity. Looks like. It will take the interrupt after local_irq_enable(). Now for VMX, that hrtimer_rearm_deferred() call should really go into handle_external_interrupt_irqoff(), which in turn requires to export __hrtimer_rearm_deferred(). But we can avoid that alltogether. Something like the untested below. Thanks, tglx --- --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -42,9 +42,10 @@ #include <linux/timer.h> #include <linux/freezer.h> #include <linux/compat.h> - #include <linux/uaccess.h> +#include <asm/irq_regs.h> + #include <trace/events/timer.h> #include "tick-internal.h" @@ -2062,11 +2063,16 @@ void __hrtimer_rearm_deferred(void) static __always_inline void hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next) { - /* hrtimer_interrupt() just re-evaluated the first expiring timer */ - cpu_base->deferred_needs_update = false; - /* Cache the expiry time */ - cpu_base->deferred_expires_next = expires_next; - set_thread_flag(TIF_HRTIMER_REARM); + /* Lies, damned lies and virt */ + if (likely(!regs_irqs_disabled(get_irq_regs()))) { + /* hrtimer_interrupt() just re-evaluated the first expiring timer */ + cpu_base->deferred_needs_update = false; + /* Cache the expiry time */ + cpu_base->deferred_expires_next = expires_next; + set_thread_flag(TIF_HRTIMER_REARM); + } else { + hrtimer_rearm(cpu_base, expires_next, false); + } } #else /* CONFIG_HRTIMER_REARM_DEFERRED */ static __always_inline void ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 17:11 ` Thomas Gleixner @ 2026-04-21 17:20 ` Jim Mattson 2026-04-21 18:29 ` Thomas Gleixner 2026-04-21 19:18 ` Verma, Vishal L 1 sibling, 1 reply; 28+ messages in thread From: Jim Mattson @ 2026-04-21 17:20 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org, Sean Christopherson, Paolo Bonzini . On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote: > > On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote: > > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: > >> > > > KVM invokes regular interrupts with pt_regs which have interrupts > >> > > > disabled. That's correct from the KVM point of view, but completely > >> > > > violates the obviously correct expectations of the interrupt entry/exit > >> > > > code. > >> > > > >> > > Mooo :-( > >> > >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that > >> use GENERIC_ENTRY? > > > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF > > on the fake frame instead? We know it will enable IRQs after doing > > handle_exit_irqoff() in vcpu_enter_guest(). > > Doesn't work :) > > > SVM does not seem affected with this particular insanity. > > Looks like. It will take the interrupt after local_irq_enable(). FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 17:20 ` Jim Mattson @ 2026-04-21 18:29 ` Thomas Gleixner 2026-04-21 18:55 ` Sean Christopherson 0 siblings, 1 reply; 28+ messages in thread From: Thomas Gleixner @ 2026-04-21 18:29 UTC (permalink / raw) To: Jim Mattson Cc: Peter Zijlstra, Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org, Sean Christopherson, Paolo Bonzini On Tue, Apr 21 2026 at 10:20, Jim Mattson wrote: > On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote: >> >> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote: >> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: >> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts >> >> > > > disabled. That's correct from the KVM point of view, but completely >> >> > > > violates the obviously correct expectations of the interrupt entry/exit >> >> > > > code. >> >> > > >> >> > > Mooo :-( >> >> >> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that >> >> use GENERIC_ENTRY? >> > >> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF >> > on the fake frame instead? We know it will enable IRQs after doing >> > handle_exit_irqoff() in vcpu_enter_guest(). >> >> Doesn't work :) >> >> > SVM does not seem affected with this particular insanity. >> >> Looks like. It will take the interrupt after local_irq_enable(). > > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT. I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is there any performance benefit or is it just used because it's there? Thanks, tglx ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 18:29 ` Thomas Gleixner @ 2026-04-21 18:55 ` Sean Christopherson 2026-04-21 20:06 ` Peter Zijlstra 2026-04-21 20:39 ` Paolo Bonzini 0 siblings, 2 replies; 28+ messages in thread From: Sean Christopherson @ 2026-04-21 18:55 UTC (permalink / raw) To: Thomas Gleixner Cc: Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma, kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org, Paolo Bonzini On Tue, Apr 21, 2026, Thomas Gleixner wrote: > On Tue, Apr 21 2026 at 10:20, Jim Mattson wrote: > > On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote: > >> > >> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote: > >> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote: > >> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts > >> >> > > > disabled. That's correct from the KVM point of view, but completely > >> >> > > > violates the obviously correct expectations of the interrupt entry/exit > >> >> > > > code. > >> >> > > > >> >> > > Mooo :-( > >> >> > >> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that > >> >> use GENERIC_ENTRY? > >> > > >> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF > >> > on the fake frame instead? We know it will enable IRQs after doing > >> > handle_exit_irqoff() in vcpu_enter_guest(). > >> > >> Doesn't work :) > >> > >> > SVM does not seem affected with this particular insanity. > >> > >> Looks like. It will take the interrupt after local_irq_enable(). > > > > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT. Hell no. > I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is > there any performance benefit or is it just used because it's there? There are performance benefits, and it preserves ordering: the first IRQ that's serviced by the host is guaranteed to be _the_ IRQ that triggered the VM-Exit. E.g. with AMD's approach, any IRQs that arrive between the VM-Exit and STI (which is a pretty big swath of code) could be serviced before the IRQ that triggered the exit, depending on priority. VM_EXIT_ACK_INTR_ON_EXIT also provides symmetry with Intel's handing of NMIs, as NMIs are unconditionally "acked" on VM-Exit. Even if performance is "fine", changing decades of fundamental KVM behavior is terrifying. Pulling in an earlier idea: : Now for VMX, that hrtimer_rearm_deferred() call should really go into : handle_external_interrupt_irqoff(), which in turn requires to export : __hrtimer_rearm_deferred(). IMO, that's the way to go. But instead of exporting __hrtimer_rearm_deferred(), move vmx_do_nmi_irqoff() and vmx_do_interrupt_irqoff() into core kernel entry code (along with the assembly glue), and then EXPORT_SYMBOL_FOR_KVM those. It'd mean some extra surgery, e.g. to provide an equivalent to KVM's IDT lookup: gate_offset((gate_desc *)host_idt_base + vector) But I suspect it would be a big net positive in the end.i E.g. the entry code would *know* it's dealing with a direct call from KVM, and thus shouldn't need to play pt_regs games. Actually, even better would be to bury the FRED vs. not-FRED details in entry code. E.g. on the KVM invocation side, we could get to something like the below, and I'm pretty sure _reduce_ the number of for-KVM exports in the process. diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a29896a9ef14..f6f5c124ed3b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu, "unexpected VM-Exit interrupt info: 0x%x", intr_info)) return; - /* - * Invoke the kernel's IRQ handler for the vector. Use the FRED path - * when it's available even if FRED isn't fully enabled, e.g. even if - * FRED isn't supported in hardware, in order to avoid the indirect - * CALL in the non-FRED path. - */ + /* For the IRQ to the core kernel for processing. */ kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ); - if (IS_ENABLED(CONFIG_X86_FRED)) - fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector); - else - vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector)); + x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector); kvm_after_interrupt(vcpu); vcpu->arch.at_instruction_boundary = true; @@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu) return; kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); - if (cpu_feature_enabled(X86_FEATURE_FRED)) - fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); - else - vmx_do_nmi_irqoff(); + x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); kvm_after_interrupt(vcpu); } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 18:55 ` Sean Christopherson @ 2026-04-21 20:06 ` Peter Zijlstra 2026-04-21 20:46 ` Peter Zijlstra 2026-04-21 20:39 ` Paolo Bonzini 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 20:06 UTC (permalink / raw) To: Sean Christopherson Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma, kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org, Paolo Bonzini On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote: > Pulling in an earlier idea: > > : Now for VMX, that hrtimer_rearm_deferred() call should really go into > : handle_external_interrupt_irqoff(), which in turn requires to export > : __hrtimer_rearm_deferred(). > > Actually, even better would be to bury the FRED vs. not-FRED details in entry > code. E.g. on the KVM invocation side, we could get to something like the below, > and I'm pretty sure _reduce_ the number of for-KVM exports in the process. Something like so then? diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index 72cae8e0ce85..83b4762d6ecb 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -13,7 +13,7 @@ CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) CFLAGS_syscall_32.o += -fno-stack-protector CFLAGS_syscall_64.o += -fno-stack-protector -obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o +obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o common.o obj-y += vdso/ obj-y += vsyscall/ diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c new file mode 100644 index 000000000000..4b0171abb083 --- /dev/null +++ b/arch/x86/entry/common.c @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include <linux/kvm_types.h> +#include <linux/hrtimer_rearm.h> +#include <asm/entry-common.h> +#include <asm/fred.h> +#include <asm/desc.h> + +noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector) +{ +#ifdef CONFIG_X86_64 + fred_entry_from_kvm(event_type, vector); +#else + idt_entry_from_kvm(vector); +#endif + if (event_type == EVENT_TYPE_EXTINT) { + instrumentation_begin(); + hrtimer_rearm_deferred(); + instrumentation_end(); + } +} +EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm); diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 92c0b4a94e0a..96c3e9322297 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1224,3 +1224,36 @@ SYM_CODE_START(rewind_stack_and_make_dead) 1: jmp 1b SYM_CODE_END(rewind_stack_and_make_dead) .popsection + +.pushsection .noinstr.text, "ax" +.macro IDT_DO_EVENT_IRQOFF call_insn call_target + /* + * Unconditionally create a stack frame, getting the correct RSP on the + * stack (for x86-64) would take two instructions anyways, and RBP can + * be used to restore RSP to make objtool happy (see below). + */ + push %ebp + mov %esp, %ebp + + pushf + push $__KERNEL_CS + \call_insn \call_target + + /* + * "Restore" RSP from RBP, even though IRET has already unwound RSP to + * the correct value. objtool doesn't know the callee will IRET and, + * without the explicit restore, thinks the stack is getting walloped. + * Using an unwind hint is problematic due to x86-64's dynamic alignment. + */ + leave + RET +.endm + +SYM_FUNC_START(idt_do_interrupt_irqoff) + IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 +SYM_FUNC_END(idt_do_interrupt_irqoff) + +SYM_FUNC_START(idt_do_nmi_irqoff) + IDT_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx +SYM_FUNC_END(idt_do_nmi_irqoff) +.popsection diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S index 894f7f16eb80..0d2768ab836c 100644 --- a/arch/x86/entry/entry_64_fred.S +++ b/arch/x86/entry/entry_64_fred.S @@ -147,5 +147,4 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) RET SYM_FUNC_END(asm_fred_entry_from_kvm) -EXPORT_SYMBOL_FOR_KVM(asm_fred_entry_from_kvm); #endif diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index ec95fe44fa3a..cb24990f38fd 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -437,6 +437,7 @@ extern void idt_setup_early_traps(void); extern void idt_setup_traps(void); extern void idt_setup_apic_and_irq_gates(void); extern bool idt_is_f00f_address(unsigned long address); +extern void idt_entry_from_kvm(unsigned int vector); #ifdef CONFIG_X86_64 extern void idt_setup_early_pf(void); diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 7535131c711b..eca24b5e07f4 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -97,4 +97,6 @@ static __always_inline void arch_exit_to_user_mode(void) } #define arch_exit_to_user_mode arch_exit_to_user_mode +extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector); + #endif diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 260456588756..d95d8d196cd4 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -266,6 +266,14 @@ void __init idt_setup_early_pf(void) idt_setup_from_table(idt_table, early_pf_idts, ARRAY_SIZE(early_pf_idts), true); } +#else +void idt_entry_from_kvm(unsigned int vector) +{ + if (vector == NMI_VECTOR) + idt_do_nmi_irqoff(); + else + idt_do_interrupt_irqoff(gate_offset(idt_table + vector)); +} #endif static void __init idt_map_in_cea(void) diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 8a481dae9cae..ff1f254a0ef4 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -31,38 +31,6 @@ #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE #endif -.macro VMX_DO_EVENT_IRQOFF call_insn call_target - /* - * Unconditionally create a stack frame, getting the correct RSP on the - * stack (for x86-64) would take two instructions anyways, and RBP can - * be used to restore RSP to make objtool happy (see below). - */ - push %_ASM_BP - mov %_ASM_SP, %_ASM_BP - -#ifdef CONFIG_X86_64 - /* - * Align RSP to a 16-byte boundary (to emulate CPU behavior) before - * creating the synthetic interrupt stack frame for the IRQ/NMI. - */ - and $-16, %rsp - push $__KERNEL_DS - push %rbp -#endif - pushf - push $__KERNEL_CS - \call_insn \call_target - - /* - * "Restore" RSP from RBP, even though IRET has already unwound RSP to - * the correct value. objtool doesn't know the callee will IRET and, - * without the explicit restore, thinks the stack is getting walloped. - * Using an unwind hint is problematic due to x86-64's dynamic alignment. - */ - leave - RET -.endm - .section .noinstr.text, "ax" /** @@ -320,10 +288,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL) SYM_FUNC_END(__vmx_vcpu_run) -SYM_FUNC_START(vmx_do_nmi_irqoff) - VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx -SYM_FUNC_END(vmx_do_nmi_irqoff) - #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT /** @@ -375,13 +339,3 @@ SYM_FUNC_START(vmread_error_trampoline) RET SYM_FUNC_END(vmread_error_trampoline) #endif - -.section .text, "ax" - -#ifndef CONFIG_X86_FRED - -SYM_FUNC_START(vmx_do_interrupt_irqoff) - VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 -SYM_FUNC_END(vmx_do_interrupt_irqoff) - -#endif diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a29896a9ef14..f6f5c124ed3b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu, "unexpected VM-Exit interrupt info: 0x%x", intr_info)) return; - /* - * Invoke the kernel's IRQ handler for the vector. Use the FRED path - * when it's available even if FRED isn't fully enabled, e.g. even if - * FRED isn't supported in hardware, in order to avoid the indirect - * CALL in the non-FRED path. - */ + /* For the IRQ to the core kernel for processing. */ kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ); - if (IS_ENABLED(CONFIG_X86_FRED)) - fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector); - else - vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector)); + x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector); kvm_after_interrupt(vcpu); vcpu->arch.at_instruction_boundary = true; @@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu) return; kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); - if (cpu_feature_enabled(X86_FEATURE_FRED)) - fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); - else - vmx_do_nmi_irqoff(); + x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); kvm_after_interrupt(vcpu); } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 20:06 ` Peter Zijlstra @ 2026-04-21 20:46 ` Peter Zijlstra 0 siblings, 0 replies; 28+ messages in thread From: Peter Zijlstra @ 2026-04-21 20:46 UTC (permalink / raw) To: Sean Christopherson Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma, kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org, Paolo Bonzini On Tue, Apr 21, 2026 at 10:06:20PM +0200, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote: > > > Pulling in an earlier idea: > > > > : Now for VMX, that hrtimer_rearm_deferred() call should really go into > > : handle_external_interrupt_irqoff(), which in turn requires to export > > : __hrtimer_rearm_deferred(). > > > > > Actually, even better would be to bury the FRED vs. not-FRED details in entry > > code. E.g. on the KVM invocation side, we could get to something like the below, > > and I'm pretty sure _reduce_ the number of for-KVM exports in the process. > > Something like so then? And this one seems to build on ARCH=i386 too. --- diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile index 72cae8e0ce85..83b4762d6ecb 100644 --- a/arch/x86/entry/Makefile +++ b/arch/x86/entry/Makefile @@ -13,7 +13,7 @@ CFLAGS_REMOVE_syscall_64.o = $(CC_FLAGS_FTRACE) CFLAGS_syscall_32.o += -fno-stack-protector CFLAGS_syscall_64.o += -fno-stack-protector -obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o +obj-y := entry.o entry_$(BITS).o syscall_$(BITS).o common.o obj-y += vdso/ obj-y += vsyscall/ diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c new file mode 100644 index 000000000000..8de94a590b26 --- /dev/null +++ b/arch/x86/entry/common.c @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include <linux/entry-common.h> +#include <linux/kvm_types.h> +#include <linux/hrtimer_rearm.h> +#include <asm/fred.h> +#include <asm/desc.h> + +noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector) +{ +#ifdef CONFIG_X86_64 + fred_entry_from_kvm(event_type, vector); +#else + idt_entry_from_kvm(vector); +#endif + if (event_type == EVENT_TYPE_EXTINT) { + instrumentation_begin(); + hrtimer_rearm_deferred(); + instrumentation_end(); + } +} +EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm); diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 92c0b4a94e0a..9324e97d14cf 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1224,3 +1224,36 @@ SYM_CODE_START(rewind_stack_and_make_dead) 1: jmp 1b SYM_CODE_END(rewind_stack_and_make_dead) .popsection + +.pushsection .noinstr.text, "ax" +.macro IDT_DO_EVENT_IRQOFF call_insn call_target + /* + * Unconditionally create a stack frame, getting the correct RSP on the + * stack (for x86-64) would take two instructions anyways, and RBP can + * be used to restore RSP to make objtool happy (see below). + */ + push %ebp + mov %esp, %ebp + + pushf + push $__KERNEL_CS + \call_insn \call_target + + /* + * "Restore" RSP from RBP, even though IRET has already unwound RSP to + * the correct value. objtool doesn't know the callee will IRET and, + * without the explicit restore, thinks the stack is getting walloped. + * Using an unwind hint is problematic due to x86-64's dynamic alignment. + */ + leave + RET +.endm + +SYM_FUNC_START(idt_do_interrupt_irqoff) + IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 +SYM_FUNC_END(idt_do_interrupt_irqoff) + +SYM_FUNC_START(idt_do_nmi_irqoff) + IDT_DO_EVENT_IRQOFF call asm_exc_nmi +SYM_FUNC_END(idt_do_nmi_irqoff) +.popsection diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S index 894f7f16eb80..0d2768ab836c 100644 --- a/arch/x86/entry/entry_64_fred.S +++ b/arch/x86/entry/entry_64_fred.S @@ -147,5 +147,4 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) RET SYM_FUNC_END(asm_fred_entry_from_kvm) -EXPORT_SYMBOL_FOR_KVM(asm_fred_entry_from_kvm); #endif diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index ec95fe44fa3a..f44d6a606b4c 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -438,6 +438,10 @@ extern void idt_setup_traps(void); extern void idt_setup_apic_and_irq_gates(void); extern bool idt_is_f00f_address(unsigned long address); +extern void idt_do_interrupt_irqoff(unsigned int vector); +extern void idt_do_nmi_irqoff(void); +extern void idt_entry_from_kvm(unsigned int vector); + #ifdef CONFIG_X86_64 extern void idt_setup_early_pf(void); #else diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index 7535131c711b..eca24b5e07f4 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -97,4 +97,6 @@ static __always_inline void arch_exit_to_user_mode(void) } #define arch_exit_to_user_mode arch_exit_to_user_mode +extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector); + #endif diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 42bf6a58ec36..db4072875f5f 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -633,17 +633,6 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC, xenpv_exc_machine_check); #endif /* NMI */ - -#if IS_ENABLED(CONFIG_KVM_INTEL) -/* - * Special entry point for VMX which invokes this on the kernel stack, even for - * 64-bit, i.e. without using an IST. asm_exc_nmi() requires an IST to work - * correctly vs. the NMI 'executing' marker. Used for 32-bit kernels as well - * to avoid more ifdeffery. - */ -DECLARE_IDTENTRY(X86_TRAP_NMI, exc_nmi_kvm_vmx); -#endif - DECLARE_IDTENTRY_NMI(X86_TRAP_NMI, exc_nmi); #ifdef CONFIG_XEN_PV DECLARE_IDTENTRY_RAW(X86_TRAP_NMI, xenpv_exc_nmi); diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c index 260456588756..d95d8d196cd4 100644 --- a/arch/x86/kernel/idt.c +++ b/arch/x86/kernel/idt.c @@ -266,6 +266,14 @@ void __init idt_setup_early_pf(void) idt_setup_from_table(idt_table, early_pf_idts, ARRAY_SIZE(early_pf_idts), true); } +#else +void idt_entry_from_kvm(unsigned int vector) +{ + if (vector == NMI_VECTOR) + idt_do_nmi_irqoff(); + else + idt_do_interrupt_irqoff(gate_offset(idt_table + vector)); +} #endif static void __init idt_map_in_cea(void) diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 3d239ed12744..06fe225fb0a2 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -609,14 +609,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi) goto nmi_restart; } -#if IS_ENABLED(CONFIG_KVM_INTEL) -DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx) -{ - exc_nmi(regs); -} -EXPORT_SYMBOL_FOR_KVM(asm_exc_nmi_kvm_vmx); -#endif - #ifdef CONFIG_NMI_CHECK_CPU static char *nmi_check_stall_msg[] = { diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 8a481dae9cae..ff1f254a0ef4 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -31,38 +31,6 @@ #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE #endif -.macro VMX_DO_EVENT_IRQOFF call_insn call_target - /* - * Unconditionally create a stack frame, getting the correct RSP on the - * stack (for x86-64) would take two instructions anyways, and RBP can - * be used to restore RSP to make objtool happy (see below). - */ - push %_ASM_BP - mov %_ASM_SP, %_ASM_BP - -#ifdef CONFIG_X86_64 - /* - * Align RSP to a 16-byte boundary (to emulate CPU behavior) before - * creating the synthetic interrupt stack frame for the IRQ/NMI. - */ - and $-16, %rsp - push $__KERNEL_DS - push %rbp -#endif - pushf - push $__KERNEL_CS - \call_insn \call_target - - /* - * "Restore" RSP from RBP, even though IRET has already unwound RSP to - * the correct value. objtool doesn't know the callee will IRET and, - * without the explicit restore, thinks the stack is getting walloped. - * Using an unwind hint is problematic due to x86-64's dynamic alignment. - */ - leave - RET -.endm - .section .noinstr.text, "ax" /** @@ -320,10 +288,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL) SYM_FUNC_END(__vmx_vcpu_run) -SYM_FUNC_START(vmx_do_nmi_irqoff) - VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx -SYM_FUNC_END(vmx_do_nmi_irqoff) - #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT /** @@ -375,13 +339,3 @@ SYM_FUNC_START(vmread_error_trampoline) RET SYM_FUNC_END(vmread_error_trampoline) #endif - -.section .text, "ax" - -#ifndef CONFIG_X86_FRED - -SYM_FUNC_START(vmx_do_interrupt_irqoff) - VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 -SYM_FUNC_END(vmx_do_interrupt_irqoff) - -#endif diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a29896a9ef14..f6f5c124ed3b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu, "unexpected VM-Exit interrupt info: 0x%x", intr_info)) return; - /* - * Invoke the kernel's IRQ handler for the vector. Use the FRED path - * when it's available even if FRED isn't fully enabled, e.g. even if - * FRED isn't supported in hardware, in order to avoid the indirect - * CALL in the non-FRED path. - */ + /* For the IRQ to the core kernel for processing. */ kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ); - if (IS_ENABLED(CONFIG_X86_FRED)) - fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector); - else - vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector)); + x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector); kvm_after_interrupt(vcpu); vcpu->arch.at_instruction_boundary = true; @@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu) return; kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); - if (cpu_feature_enabled(X86_FEATURE_FRED)) - fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); - else - vmx_do_nmi_irqoff(); + x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR); kvm_after_interrupt(vcpu); } ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 18:55 ` Sean Christopherson 2026-04-21 20:06 ` Peter Zijlstra @ 2026-04-21 20:39 ` Paolo Bonzini 1 sibling, 0 replies; 28+ messages in thread From: Paolo Bonzini @ 2026-04-21 20:39 UTC (permalink / raw) To: Sean Christopherson Cc: Thomas Gleixner, Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu, the arch/x86 maintainers, Paolo Bonzini Il mar 21 apr 2026, 19:55 Sean Christopherson <seanjc@google.com> ha scritto: > > > > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT. > > Hell no. > > > I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is > > there any performance benefit or is it just used because it's there? > > There are performance benefits, and it preserves ordering [...] > NMIs are unconditionally "acked" on VM-Exit. Not that I disagree but... > Even if performance is "fine", changing decades of fundamental KVM behavior is > terrifying. ... it's not decades, ack on VM exit is actually relatively recent (10 years out 20 :)). The reason why it was introduced is another killer for the idea, though. Posted interrupts require it, for some reason only known to Intel. Thanks, Paolo ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 17:11 ` Thomas Gleixner 2026-04-21 17:20 ` Jim Mattson @ 2026-04-21 19:18 ` Verma, Vishal L 1 sibling, 0 replies; 28+ messages in thread From: Verma, Vishal L @ 2026-04-21 19:18 UTC (permalink / raw) To: peterz@infradead.org, tglx@kernel.org Cc: Wu, Binbin, kvm@vger.kernel.org, bonzini@redhat.com, seanjc@google.com, binbin.wu@linux.intel.com, Edgecombe, Rick P, x86@kernel.org On Tue, 2026-04-21 at 19:11 +0200, Thomas Gleixner wrote: > > Now for VMX, that hrtimer_rearm_deferred() call should really go into > handle_external_interrupt_irqoff(), which in turn requires to export > __hrtimer_rearm_deferred(). > > But we can avoid that alltogether. Something like the untested below. Tested with the below patch and the tests pass with this too. > > Thanks, > > tglx > --- > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -42,9 +42,10 @@ > #include <linux/timer.h> > #include <linux/freezer.h> > #include <linux/compat.h> > - > #include <linux/uaccess.h> > > +#include <asm/irq_regs.h> > + > #include <trace/events/timer.h> > > #include "tick-internal.h" > @@ -2062,11 +2063,16 @@ void __hrtimer_rearm_deferred(void) > static __always_inline void > hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next) > { > - /* hrtimer_interrupt() just re-evaluated the first expiring timer */ > - cpu_base->deferred_needs_update = false; > - /* Cache the expiry time */ > - cpu_base->deferred_expires_next = expires_next; > - set_thread_flag(TIF_HRTIMER_REARM); > + /* Lies, damned lies and virt */ > + if (likely(!regs_irqs_disabled(get_irq_regs()))) { > + /* hrtimer_interrupt() just re-evaluated the first expiring timer */ > + cpu_base->deferred_needs_update = false; > + /* Cache the expiry time */ > + cpu_base->deferred_expires_next = expires_next; > + set_thread_flag(TIF_HRTIMER_REARM); > + } else { > + hrtimer_rearm(cpu_base, expires_next, false); > + } > } > #else /* CONFIG_HRTIMER_REARM_DEFERRED */ > static __always_inline void > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 11:32 ` Peter Zijlstra 2026-04-21 11:34 ` Peter Zijlstra @ 2026-04-21 16:30 ` Thomas Gleixner 1 sibling, 0 replies; 28+ messages in thread From: Thomas Gleixner @ 2026-04-21 16:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, Apr 21 2026 at 13:32, Peter Zijlstra wrote: > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote: >> > /* >> > + * This is sadly required due to KVM, which invokes regular >> > + * interrupt handlers with interrupt disabled state in @regs. >> > + */ >> > + instrumentation_begin(); >> > + hrtimer_rearm_deferred(); >> > + instrumentation_end(); >> > + >> > + /* >> > * IRQ flags state is correct already. Just tell RCU if it >> > * was not watching on entry. >> > */ > > Ohhh, wait. What happens if you take a page-fault from NMI context? Does > this then not result in trying to program the timer from NMI context? Uuuurgh, yes. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming 2026-04-21 7:39 ` Thomas Gleixner 2026-04-21 11:18 ` Peter Zijlstra @ 2026-04-21 16:11 ` Verma, Vishal L 1 sibling, 0 replies; 28+ messages in thread From: Verma, Vishal L @ 2026-04-21 16:11 UTC (permalink / raw) To: peterz@infradead.org, tglx@kernel.org, binbin.wu@linux.intel.com Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin, x86@kernel.org On Tue, 2026-04-21 at 09:39 +0200, Thomas Gleixner wrote: > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path > From: Thomas Gleixner <tglx@kernel.org> > Date: Tue, 21 Apr 2026 09:00:52 +0200 > > irqentry_exit_to_kernel_mode_after_preempt() invokes > hrtimer_rearm_deferred() only when the interrupted context had interrupts > enabled. That's a correct decision because the timer interrupt can only be > delivered in interrupt enabled contexts. The interrupt disabled path is > used by exceptions and traps which never touch the hrtimer mechanics. > > So much for the theory, but then there is VIRT which ruins everything. > > KVM invokes regular interrupts with pt_regs which have interrupts > disabled. That's correct from the KVM point of view, but completely > violates the obviously correct expectations of the interrupt entry/exit > code. > > Cure this by adding a hrtimer_rearm_deferred() invocation into the > interrupted context has interrupt disabled path of > irqentry_exit_to_kernel_mode_after_preempt(). > > That's unfortunate when there is an actual reschedule pending, but it can't > be avoided because KVM invokes a lot of code and also reenables interrupts > _before_ reaching the point where the reschedule condition is handled. That > can delay the rearming significantly, which in turn can cause artificial > latencies. > > Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") > Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com> > Signed-off-by: Thomas Gleixner <tglx@kernel.org> > Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com Hi Thomas, I tested this and verified it solves both the tests, no more lockups. If this is the final fix, you can add: Tested-by: Vishal Verma <vishal.l.verma@intel.com> (I'm queueing up Peter's patch on the CI now too) > --- > include/linux/irq-entry-common.h | 8 ++++++++ > 1 file changed, 8 insertions(+) > > --- a/include/linux/irq-entry-common.h > +++ b/include/linux/irq-entry-common.h > @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem > instrumentation_end(); > } else { > /* > + * This is sadly required due to KVM, which invokes regular > + * interrupt handlers with interrupt disabled state in @regs. > + */ > + instrumentation_begin(); > + hrtimer_rearm_deferred(); > + instrumentation_end(); > + > + /* > * IRQ flags state is correct already. Just tell RCU if it > * was not watching on entry. > */ ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2026-04-21 20:47 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L 2026-04-20 15:00 ` Thomas Gleixner 2026-04-20 15:22 ` Thomas Gleixner 2026-04-20 20:57 ` Verma, Vishal L 2026-04-20 22:19 ` Thomas Gleixner 2026-04-20 22:24 ` Verma, Vishal L 2026-04-21 6:29 ` Thomas Gleixner 2026-04-21 4:51 ` Binbin Wu 2026-04-21 7:39 ` Thomas Gleixner 2026-04-21 11:18 ` Peter Zijlstra 2026-04-21 11:32 ` Peter Zijlstra 2026-04-21 11:34 ` Peter Zijlstra 2026-04-21 11:49 ` Peter Zijlstra 2026-04-21 12:05 ` Peter Zijlstra 2026-04-21 13:19 ` Peter Zijlstra 2026-04-21 13:29 ` Peter Zijlstra 2026-04-21 16:36 ` Thomas Gleixner 2026-04-21 18:11 ` Verma, Vishal L 2026-04-21 17:11 ` Thomas Gleixner 2026-04-21 17:20 ` Jim Mattson 2026-04-21 18:29 ` Thomas Gleixner 2026-04-21 18:55 ` Sean Christopherson 2026-04-21 20:06 ` Peter Zijlstra 2026-04-21 20:46 ` Peter Zijlstra 2026-04-21 20:39 ` Paolo Bonzini 2026-04-21 19:18 ` Verma, Vishal L 2026-04-21 16:30 ` Thomas Gleixner 2026-04-21 16:11 ` Verma, Vishal L
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox