* CPU Lockups in KVM with deferred hrtimer rearming
@ 2026-04-16 20:50 Verma, Vishal L
2026-04-20 15:00 ` Thomas Gleixner
0 siblings, 1 reply; 17+ messages in thread
From: Verma, Vishal L @ 2026-04-16 20:50 UTC (permalink / raw)
To: peterz@infradead.org, tglx@kernel.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
Hi Peter,
We noticed a KVM Unit test 'x2apic' - (APIC LVT timer one shot)
failing, and also some TDX specific tests running into multiple CPUs in
hard lockups on a 192-CPU Emerald Rapids system, and we traced it to
the htrimers deferred rearming merge.
Making CONFIG_HRTIMER_REARM_DEFERRED default to n in Kconfig made both
pass.
This is the hard lockup splat:
watchdog: CPU98: Watchdog detected hard LOCKUP on cpu 98
Modules linked in: openvswitch nsh tls ipt_REJECT iptable_mangle iptable_nat iptable_filter ip_tables bridge stp llc kvm_intel kvm irqbypass sunrpc
irq event stamp: 34998
hardirqs last enabled at (34997): [<ffffffffc090ce6d>] tdx_vcpu_run+0x5d/0x350 [kvm_intel]
hardirqs last disabled at (34998): [<ffffffffb9add6df>] exc_nmi+0xaf/0x1a0
softirqs last enabled at (34404): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
softirqs last disabled at (34395): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
CPU: 98 UID: 0 PID: 54785 Comm: qemu-system-x86 Not tainted 7.0.0-g10324ed6a556 #1 PREEMPT(full)
Hardware name: HPE ProLiant DL380 Gen11/ProLiant DL380 Gen11, BIOS 2.48 03/11/2025
RIP: 0010:vmx_do_nmi_irqoff+0x13/0x20 [kvm_intel]
Code: ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 e8 3d db 6e f7 <c9> c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90
RSP: 0018:ff8d3a069bdf3af0 EFLAGS: 00000086
RAX: ff3cc96963d68000 RBX: ff3cc96963d68000 RCX: 4000000200000000
RDX: 0000000080000200 RSI: ff3cc96963d699d0 RDI: ff3cc96963d68000
RBP: ff8d3a069bdf3af0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ff3cc968d03d0000 R14: ff3cc968d03d0000 R15: 0000000000000000
FS: 00007f26ab7fe6c0(0000) GS:ff3cc98782d76000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001544af004 CR4: 0000000000f73ef0
PKRU: 00000000
Call Trace:
<TASK>
vmx_handle_nmi+0xdf/0x140 [kvm_intel]
tdx_vcpu_enter_exit+0xd5/0x300 [kvm_intel]
tdx_vcpu_run+0x5d/0x350 [kvm_intel]
vcpu_run+0xd4a/0x1800 [kvm]
? __local_bh_enable_ip+0x7b/0xf0
? kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
? kvm_arch_vcpu_ioctl_run+0xb9/0x5f0 [kvm]
kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
kvm_vcpu_ioctl+0x2ef/0xb00 [kvm]
? __fget_files+0x2b/0x190
? find_held_lock+0x2b/0x80
__x64_sys_ioctl+0x97/0xe0
do_syscall_64+0xf4/0x1540
? __x64_sys_ioctl+0xb1/0xe0
? trace_hardirqs_on_prepare+0xd2/0xf0
? do_syscall_64+0x225/0x1540
? trace_hardirqs_on+0x18/0x100
? __local_bh_enable_ip+0x7b/0xf0
? arch_do_signal_or_restart+0x155/0x250
? trace_hardirqs_off+0x4e/0xf0
? exit_to_user_mode_loop+0x150/0x4e0
? trace_hardirqs_on_prepare+0xd2/0xf0
? do_syscall_64+0x225/0x1540
? do_user_addr_fault+0x36c/0x6b0
? lockdep_hardirqs_on_prepare+0xdb/0x190
? trace_hardirqs_on+0x18/0x100
? do_syscall_64+0xab/0x1540
? exc_page_fault+0x12c/0x2b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f45f7ae00ed
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007f26ab7f3e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f26ab7fe6c0 RCX: 00007f45f7ae00ed
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000099
RBP: 00007f26ab7f3ec0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f26ab7fe6c0
R13: 00007ffdc7adecd0 R14: 00007f26ab7fecdc R15: 00007ffdc7adedd7
</TASK>
I tried out AI assisted and patch (below) which does happen to solve
it, but I'm not familiar in this area, and not sure if this is the
right fix.
---
diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..c4856c252412 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -4,6 +4,7 @@
#include <linux/static_call_types.h>
#include <linux/resume_user_mode.h>
+#include <linux/hrtimer_rearm.h>
#include <linux/syscalls.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
@@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
static inline void xfer_to_guest_mode_prepare(void)
{
lockdep_assert_irqs_disabled();
+ hrtimer_rearm_deferred();
tick_nohz_user_enter_prepare();
}
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5bd6efe598f0..f3bd084d9a72 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2058,6 +2058,7 @@ void __hrtimer_rearm_deferred(void)
}
hrtimer_rearm(cpu_base, expires_next, true);
}
+EXPORT_SYMBOL_GPL(__hrtimer_rearm_deferred);
static __always_inline void
hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
@ 2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22 ` Thomas Gleixner
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Thomas Gleixner @ 2026-04-20 15:00 UTC (permalink / raw)
To: Verma, Vishal L, peterz@infradead.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
> I tried out AI assisted and patch (below) which does happen to solve
> it, but I'm not familiar in this area, and not sure if this is the
> right fix.
>
> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
> index bfa767702d9a..c4856c252412 100644
> --- a/include/linux/entry-virt.h
> +++ b/include/linux/entry-virt.h
> @@ -4,6 +4,7 @@
>
> #include <linux/static_call_types.h>
> #include <linux/resume_user_mode.h>
> +#include <linux/hrtimer_rearm.h>
> #include <linux/syscalls.h>
> #include <linux/seccomp.h>
> #include <linux/sched.h>
> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
> static inline void xfer_to_guest_mode_prepare(void)
> {
> lockdep_assert_irqs_disabled();
> + hrtimer_rearm_deferred();
> tick_nohz_user_enter_prepare();
This code should never be reached with a rearm pending. Something else
went wrong earlier. So while the patch "works" it papers over the
underlying problem.
Can you please do the following:
1) Apply the patch below
2) Enable function tracing and the hrtimer* trace events
3) Enable tracing if it has been disabled already
echo 1 >/sys/kernel/tracing/tracing_on
4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
become 0, which means the problem triggered.
5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
somewhere to download from or send it to me compressed offlist.
Thanks,
tglx
---
diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..ab73963a7496 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void);
static inline void xfer_to_guest_mode_prepare(void)
{
lockdep_assert_irqs_disabled();
+ if (test_thread_flag(TIF_HRTIMER_REARM)) {
+ tracing_off();
+ hrtimer_rearm_deferred();
+ }
tick_nohz_user_enter_prepare();
}
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-20 15:00 ` Thomas Gleixner
@ 2026-04-20 15:22 ` Thomas Gleixner
2026-04-20 20:57 ` Verma, Vishal L
2026-04-21 4:51 ` Binbin Wu
2 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2026-04-20 15:22 UTC (permalink / raw)
To: Verma, Vishal L, peterz@infradead.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Mon, Apr 20 2026 at 17:00, Thomas Gleixner wrote:
> On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.
Peter just noticed that this should be fixed with
1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22 ` Thomas Gleixner
@ 2026-04-20 20:57 ` Verma, Vishal L
2026-04-20 22:19 ` Thomas Gleixner
2026-04-21 4:51 ` Binbin Wu
2 siblings, 1 reply; 17+ messages in thread
From: Verma, Vishal L @ 2026-04-20 20:57 UTC (permalink / raw)
To: peterz@infradead.org, tglx@kernel.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
>
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.
>
> Can you please do the following:
>
> 1) Apply the patch below
>
> 2) Enable function tracing and the hrtimer* trace events
>
> 3) Enable tracing if it has been disabled already
>
> echo 1 >/sys/kernel/tracing/tracing_on
>
> 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
> become 0, which means the problem triggered.
>
> 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
> somewhere to download from or send it to me compressed offlist.
Hi Thomas,
I've uploaded the trace here (~75MB compressed):
https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
As for:
1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
I already had that commit in the branch that was tested and it didn't
fix it.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-20 20:57 ` Verma, Vishal L
@ 2026-04-20 22:19 ` Thomas Gleixner
2026-04-20 22:24 ` Verma, Vishal L
0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2026-04-20 22:19 UTC (permalink / raw)
To: Verma, Vishal L, peterz@infradead.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote:
> On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
>>
>> This code should never be reached with a rearm pending. Something else
>> went wrong earlier. So while the patch "works" it papers over the
>> underlying problem.
>>
>> Can you please do the following:
>>
>> 1) Apply the patch below
>>
>> 2) Enable function tracing and the hrtimer* trace events
>>
>> 3) Enable tracing if it has been disabled already
>>
>> echo 1 >/sys/kernel/tracing/tracing_on
>>
>> 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>> become 0, which means the problem triggered.
>>
>> 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>> somewhere to download from or send it to me compressed offlist.
>
> Hi Thomas,
>
> I've uploaded the trace here (~75MB compressed):
> https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
>
> As for:
>
> 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
>
> I already had that commit in the branch that was tested and it didn't
> fix it.
Thanks for the update. Can you try to provide the information I asked
for above?
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-20 22:19 ` Thomas Gleixner
@ 2026-04-20 22:24 ` Verma, Vishal L
2026-04-21 6:29 ` Thomas Gleixner
0 siblings, 1 reply; 17+ messages in thread
From: Verma, Vishal L @ 2026-04-20 22:24 UTC (permalink / raw)
To: peterz@infradead.org, tglx@kernel.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote:
> > On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
> > >
> > > This code should never be reached with a rearm pending. Something else
> > > went wrong earlier. So while the patch "works" it papers over the
> > > underlying problem.
> > >
> > > Can you please do the following:
> > >
> > > 1) Apply the patch below
> > >
> > > 2) Enable function tracing and the hrtimer* trace events
> > >
> > > 3) Enable tracing if it has been disabled already
> > >
> > > echo 1 >/sys/kernel/tracing/tracing_on
> > >
> > > 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
> > > become 0, which means the problem triggered.
> > >
> > > 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
> > > somewhere to download from or send it to me compressed offlist.
> >
> > Hi Thomas,
> >
> > I've uploaded the trace here (~75MB compressed):
> > https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
> >
> > As for:
> >
> > 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
> >
> > I already had that commit in the branch that was tested and it didn't
> > fix it.
>
> Thanks for the update. Can you try to provide the information I asked
> for above?
>
Ah sorry - I should've said that with your patch applied, tracing_on
did become 0, so the problem was triggered.
The trace from that is in the URL above.
This is how I collected it:
tracefs=/sys/kernel/tracing
echo 4096 > "$tracefs"/buffer_size_kb
echo function > "$tracefs"/current_tracer
echo 1 > "$tracefs"/events/hrtimer/enable
echo 1 > "$tracefs"/tracing_on
<run the test>
tracing_on="$(cat "$tracefs"/tracing_on)"
if [ "$tracing_on" -eq 0 ]; then
echo "Debug patch triggered, collecting trace"
cat "$tracefs"/trace | gzip > /tmp/hrtimer_rearm_trace.gz
else
echo "Debug patch did not trigger (tracing_on still 1)"
fi
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22 ` Thomas Gleixner
2026-04-20 20:57 ` Verma, Vishal L
@ 2026-04-21 4:51 ` Binbin Wu
2026-04-21 7:39 ` Thomas Gleixner
2 siblings, 1 reply; 17+ messages in thread
From: Binbin Wu @ 2026-04-21 4:51 UTC (permalink / raw)
To: Thomas Gleixner, Verma, Vishal L, peterz@infradead.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On 4/20/2026 11:00 PM, Thomas Gleixner wrote:
> On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
>> I tried out AI assisted and patch (below) which does happen to solve
>> it, but I'm not familiar in this area, and not sure if this is the
>> right fix.
>>
>> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
>> index bfa767702d9a..c4856c252412 100644
>> --- a/include/linux/entry-virt.h
>> +++ b/include/linux/entry-virt.h
>> @@ -4,6 +4,7 @@
>>
>> #include <linux/static_call_types.h>
>> #include <linux/resume_user_mode.h>
>> +#include <linux/hrtimer_rearm.h>
>> #include <linux/syscalls.h>
>> #include <linux/seccomp.h>
>> #include <linux/sched.h>
>> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
>> static inline void xfer_to_guest_mode_prepare(void)
>> {
>> lockdep_assert_irqs_disabled();
>> + hrtimer_rearm_deferred();
>> tick_nohz_user_enter_prepare();
>
>
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.
IIUC, the problem might be:
HRTimer -> VMExit:
[IRQ is disabled]
kvm_x86_call(handle_exit_irqoff)(vcpu)
vmx_handle_exit_irqoff
handle_external_interrupt_irqoff
sysvec_apic_timer_interrupt
irqentry_enter
...
irqentry_exit
irqentry_exit_to_kernel_mode
if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer
hrtimer_rearm_deferred() rearm is skipped!
This issue is triggered on TDX since TDX can't use preemption timer while normal
VMX VM uses preemption timer by default.
>
> Can you please do the following:
>
> 1) Apply the patch below
>
> 2) Enable function tracing and the hrtimer* trace events
>
> 3) Enable tracing if it has been disabled already
>
> echo 1 >/sys/kernel/tracing/tracing_on
>
> 4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
> become 0, which means the problem triggered.
>
> 5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
> somewhere to download from or send it to me compressed offlist.
>
> Thanks,
>
> tglx
> ---
>
> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
> index bfa767702d9a..ab73963a7496 100644
> --- a/include/linux/entry-virt.h
> +++ b/include/linux/entry-virt.h
> @@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void);
> static inline void xfer_to_guest_mode_prepare(void)
> {
> lockdep_assert_irqs_disabled();
> + if (test_thread_flag(TIF_HRTIMER_REARM)) {
> + tracing_off();
> + hrtimer_rearm_deferred();
> + }
> tick_nohz_user_enter_prepare();
> }
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-20 22:24 ` Verma, Vishal L
@ 2026-04-21 6:29 ` Thomas Gleixner
0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2026-04-21 6:29 UTC (permalink / raw)
To: Verma, Vishal L, peterz@infradead.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Mon, Apr 20 2026 at 22:24, Verma, Vishal L wrote:
> On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote:
>> Thanks for the update. Can you try to provide the information I asked
>> for above?
>>
> Ah sorry - I should've said that with your patch applied, tracing_on
> did become 0, so the problem was triggered.
>
> The trace from that is in the URL above.
I clearly can't read :)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 4:51 ` Binbin Wu
@ 2026-04-21 7:39 ` Thomas Gleixner
2026-04-21 11:18 ` Peter Zijlstra
2026-04-21 16:11 ` Verma, Vishal L
0 siblings, 2 replies; 17+ messages in thread
From: Thomas Gleixner @ 2026-04-21 7:39 UTC (permalink / raw)
To: Binbin Wu, Verma, Vishal L, peterz@infradead.org
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Tue, Apr 21 2026 at 12:51, Binbin Wu wrote:
> On 4/20/2026 11:00 PM, Thomas Gleixner wrote:
>>> static inline void xfer_to_guest_mode_prepare(void)
>>> {
>>> lockdep_assert_irqs_disabled();
>>> + hrtimer_rearm_deferred();
>>> tick_nohz_user_enter_prepare();
>>
>>
>> This code should never be reached with a rearm pending. Something else
>> went wrong earlier. So while the patch "works" it papers over the
>> underlying problem.
>
> IIUC, the problem might be:
>
> HRTimer -> VMExit:
> [IRQ is disabled]
> kvm_x86_call(handle_exit_irqoff)(vcpu)
> vmx_handle_exit_irqoff
> handle_external_interrupt_irqoff
> sysvec_apic_timer_interrupt
> irqentry_enter
> ...
> irqentry_exit
> irqentry_exit_to_kernel_mode
> if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer
> hrtimer_rearm_deferred() rearm is skipped!
>
>
> This issue is triggered on TDX since TDX can't use preemption timer while normal
> VMX VM uses preemption timer by default.
Kinda.
The issue is that vmx_handle_exit_irqoff() always hands in regs with
regs->flags.X86_EFLAGS_IF == 0. That has absolutely nothing to do with
TDX and the preemption timer.
The patch below solves the problem right there in the exit code, which
is unfortunate as there might be a NEED_RESCHED pending. But that can't
be taken into account as KVM enables interrupts _before_ reaching the
exit work point.
Yet another proof that virt creates more problems than it solves.
Thanks,
tglx
---
Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
From: Thomas Gleixner <tglx@kernel.org>
Date: Tue, 21 Apr 2026 09:00:52 +0200
irqentry_exit_to_kernel_mode_after_preempt() invokes
hrtimer_rearm_deferred() only when the interrupted context had interrupts
enabled. That's a correct decision because the timer interrupt can only be
delivered in interrupt enabled contexts. The interrupt disabled path is
used by exceptions and traps which never touch the hrtimer mechanics.
So much for the theory, but then there is VIRT which ruins everything.
KVM invokes regular interrupts with pt_regs which have interrupts
disabled. That's correct from the KVM point of view, but completely
violates the obviously correct expectations of the interrupt entry/exit
code.
Cure this by adding a hrtimer_rearm_deferred() invocation into the
interrupted context has interrupt disabled path of
irqentry_exit_to_kernel_mode_after_preempt().
That's unfortunate when there is an actual reschedule pending, but it can't
be avoided because KVM invokes a lot of code and also reenables interrupts
_before_ reaching the point where the reschedule condition is handled. That
can delay the rearming significantly, which in turn can cause artificial
latencies.
Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
---
include/linux/irq-entry-common.h | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
instrumentation_end();
} else {
/*
+ * This is sadly required due to KVM, which invokes regular
+ * interrupt handlers with interrupt disabled state in @regs.
+ */
+ instrumentation_begin();
+ hrtimer_rearm_deferred();
+ instrumentation_end();
+
+ /*
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 7:39 ` Thomas Gleixner
@ 2026-04-21 11:18 ` Peter Zijlstra
2026-04-21 11:32 ` Peter Zijlstra
2026-04-21 16:11 ` Verma, Vishal L
1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:18 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> ---
> Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> From: Thomas Gleixner <tglx@kernel.org>
> Date: Tue, 21 Apr 2026 09:00:52 +0200
>
> irqentry_exit_to_kernel_mode_after_preempt() invokes
> hrtimer_rearm_deferred() only when the interrupted context had interrupts
> enabled. That's a correct decision because the timer interrupt can only be
> delivered in interrupt enabled contexts. The interrupt disabled path is
> used by exceptions and traps which never touch the hrtimer mechanics.
>
> So much for the theory, but then there is VIRT which ruins everything.
>
> KVM invokes regular interrupts with pt_regs which have interrupts
> disabled. That's correct from the KVM point of view, but completely
> violates the obviously correct expectations of the interrupt entry/exit
> code.
Mooo :-(
That also complicates the comment that goes with
hrtimer_rearm_deferred(). Not sure how to 'fix' that.
> Cure this by adding a hrtimer_rearm_deferred() invocation into the
> interrupted context has interrupt disabled path of
> irqentry_exit_to_kernel_mode_after_preempt().
>
> That's unfortunate when there is an actual reschedule pending, but it can't
> be avoided because KVM invokes a lot of code and also reenables interrupts
> _before_ reaching the point where the reschedule condition is handled. That
> can delay the rearming significantly, which in turn can cause artificial
> latencies.
Yeah, this is a trainwreck. If they want it better, KVM needs to get
'fixed' to not play silly games like this.
> Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> include/linux/irq-entry-common.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
> instrumentation_end();
> } else {
> /*
> + * This is sadly required due to KVM, which invokes regular
> + * interrupt handlers with interrupt disabled state in @regs.
> + */
> + instrumentation_begin();
> + hrtimer_rearm_deferred();
> + instrumentation_end();
> +
> + /*
> * IRQ flags state is correct already. Just tell RCU if it
> * was not watching on entry.
> */
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 11:18 ` Peter Zijlstra
@ 2026-04-21 11:32 ` Peter Zijlstra
2026-04-21 11:34 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
>
> > ---
> > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > From: Thomas Gleixner <tglx@kernel.org>
> > Date: Tue, 21 Apr 2026 09:00:52 +0200
> >
> > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > enabled. That's a correct decision because the timer interrupt can only be
> > delivered in interrupt enabled contexts. The interrupt disabled path is
> > used by exceptions and traps which never touch the hrtimer mechanics.
> >
> > So much for the theory, but then there is VIRT which ruins everything.
> >
> > KVM invokes regular interrupts with pt_regs which have interrupts
> > disabled. That's correct from the KVM point of view, but completely
> > violates the obviously correct expectations of the interrupt entry/exit
> > code.
>
> Mooo :-(
>
> That also complicates the comment that goes with
> hrtimer_rearm_deferred(). Not sure how to 'fix' that.
>
> > Cure this by adding a hrtimer_rearm_deferred() invocation into the
> > interrupted context has interrupt disabled path of
> > irqentry_exit_to_kernel_mode_after_preempt().
> >
> > That's unfortunate when there is an actual reschedule pending, but it can't
> > be avoided because KVM invokes a lot of code and also reenables interrupts
> > _before_ reaching the point where the reschedule condition is handled. That
> > can delay the rearming significantly, which in turn can cause artificial
> > latencies.
>
> Yeah, this is a trainwreck. If they want it better, KVM needs to get
> 'fixed' to not play silly games like this.
>
> > Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> > Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> > ---
> > include/linux/irq-entry-common.h | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > --- a/include/linux/irq-entry-common.h
> > +++ b/include/linux/irq-entry-common.h
> > @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
> > instrumentation_end();
> > } else {
> > /*
> > + * This is sadly required due to KVM, which invokes regular
> > + * interrupt handlers with interrupt disabled state in @regs.
> > + */
> > + instrumentation_begin();
> > + hrtimer_rearm_deferred();
> > + instrumentation_end();
> > +
> > + /*
> > * IRQ flags state is correct already. Just tell RCU if it
> > * was not watching on entry.
> > */
Ohhh, wait. What happens if you take a page-fault from NMI context? Does
this then not result in trying to program the timer from NMI context?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 11:32 ` Peter Zijlstra
@ 2026-04-21 11:34 ` Peter Zijlstra
2026-04-21 11:49 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:34 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> >
> > > ---
> > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > From: Thomas Gleixner <tglx@kernel.org>
> > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > >
> > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > enabled. That's a correct decision because the timer interrupt can only be
> > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > used by exceptions and traps which never touch the hrtimer mechanics.
> > >
> > > So much for the theory, but then there is VIRT which ruins everything.
> > >
> > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > disabled. That's correct from the KVM point of view, but completely
> > > violates the obviously correct expectations of the interrupt entry/exit
> > > code.
> >
> > Mooo :-(
Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
use GENERIC_ENTRY?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 11:34 ` Peter Zijlstra
@ 2026-04-21 11:49 ` Peter Zijlstra
2026-04-21 12:05 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:49 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > >
> > > > ---
> > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > >
> > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > >
> > > > So much for the theory, but then there is VIRT which ruins everything.
> > > >
> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > disabled. That's correct from the KVM point of view, but completely
> > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > code.
> > >
> > > Mooo :-(
>
> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> use GENERIC_ENTRY?
Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
on the fake frame instead? We know it will enable IRQs after doing
handle_exit_irqoff() in vcpu_enter_guest().
SVM does not seem affected with this particular insanity.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 11:49 ` Peter Zijlstra
@ 2026-04-21 12:05 ` Peter Zijlstra
2026-04-21 13:19 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 12:05 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > >
> > > > > ---
> > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > >
> > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > >
> > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > >
> > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > code.
> > > >
> > > > Mooo :-(
> >
> > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > use GENERIC_ENTRY?
>
> Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> on the fake frame instead? We know it will enable IRQs after doing
> handle_exit_irqoff() in vcpu_enter_guest().
Moo, you can't do that either, because it will ERETS/IRET and fuck up
the state :/
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 12:05 ` Peter Zijlstra
@ 2026-04-21 13:19 ` Peter Zijlstra
2026-04-21 13:29 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 13:19 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > >
> > > > > > ---
> > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > >
> > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > >
> > > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > >
> > > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > > code.
> > > > >
> > > > > Mooo :-(
> > >
> > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > > use GENERIC_ENTRY?
> >
> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > on the fake frame instead? We know it will enable IRQs after doing
> > handle_exit_irqoff() in vcpu_enter_guest().
>
> Moo, you can't do that either, because it will ERETS/IRET and fuck up
> the state :/
How insane is something like this?
---
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..f3e2a8fde1ab 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
push %rdi /* fred_ss handed in by the caller */
push %rbp
pushf
+ or $X86_EFLAGS_KVM, (%rsp)
push $__KERNEL_CS
/*
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..aab93f07e768 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void)
}
#define arch_exit_to_user_mode arch_exit_to_user_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs)
+{
+#ifdef CONFIG_KVM_INTEL
+ /*
+ * KVM is a reserved bit and must always be 0. Hardware will #GP on
+ * IRET/ERETS with this bit set.
+ */
+ regs->flags &= ~X86_EFLAGS_KVM;
+#endif
+}
+#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode
+
#endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 7bb7bd90355d..c31f7bc2eba2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
{
- return !(regs->flags & X86_EFLAGS_IF);
+ /*
+ * return context | IF | KVM
+ * ---------------+----+----
+ * IRQ-off | 0 | 0
+ * IRQ-on | 0 | 1
+ * IRQ-on | 1 | 0
+ * invalid | 1 | 1
+ */
+ return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0;
}
/* Query offset/name of register from its name/offset */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 81d0c8bf1137..d32edefde587 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -14,6 +14,8 @@
#define X86_EFLAGS_FIXED _BITUL(X86_EFLAGS_FIXED_BIT)
#define X86_EFLAGS_PF_BIT 2 /* Parity Flag */
#define X86_EFLAGS_PF _BITUL(X86_EFLAGS_PF_BIT)
+#define X86_EFLAGS_KVM_BIT 3 /* KVM Flag -- must be 0 */
+#define X86_EFLAGS_KVM _BITUL(X86_EFLAGS_PF_BIT)
#define X86_EFLAGS_AF_BIT 4 /* Auxiliary carry Flag */
#define X86_EFLAGS_AF _BITUL(X86_EFLAGS_AF_BIT)
#define X86_EFLAGS_ZF_BIT 6 /* Zero Flag */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..3d0d0fb8de79 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -50,6 +50,7 @@
push %rbp
#endif
pushf
+ or $X86_EFLAGS_KVM, (%_ASM_SP)
push $__KERNEL_CS
\call_insn \call_target
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf04..0acc20b63513 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void);
static __always_inline void arch_exit_to_user_mode(void) { }
#endif
+#ifndef arch_exit_to_kernel_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { }
+#endif
+
/**
* arch_do_signal_or_restart - Architecture specific signal delivery function
* @regs: Pointer to currents pt_regs
@@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
instrumentation_end();
irqentry_exit_to_kernel_mode_after_preempt(regs, state);
+ arch_exit_to_kernel_mode(regs);
}
/**
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 13:19 ` Peter Zijlstra
@ 2026-04-21 13:29 ` Peter Zijlstra
0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2026-04-21 13:29 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
Edgecombe, Rick P, Wu, Binbin, x86@kernel.org
On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > > >
> > > > > > > ---
> > > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > > >
> > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > > >
> > > > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > > >
> > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > > > code.
> > > > > >
> > > > > > Mooo :-(
> > > >
> > > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > > > use GENERIC_ENTRY?
> > >
> > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > > on the fake frame instead? We know it will enable IRQs after doing
> > > handle_exit_irqoff() in vcpu_enter_guest().
> >
> > Moo, you can't do that either, because it will ERETS/IRET and fuck up
> > the state :/
>
> How insane is something like this?
Small matter of actually building...
---
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..cc2c961a5683 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
push %rdi /* fred_ss handed in by the caller */
push %rbp
pushf
+ orq $X86_EFLAGS_KVM, (%rsp)
push $__KERNEL_CS
/*
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 0e8c611bc9e2..75568a85b2d3 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -43,6 +43,7 @@
#define _ASM_SUB __ASM_SIZE(sub)
#define _ASM_XADD __ASM_SIZE(xadd)
#define _ASM_MUL __ASM_SIZE(mul)
+#define _ASM_OR __ASM_SIZE(or)
#define _ASM_AX __ASM_REG(ax)
#define _ASM_BX __ASM_REG(bx)
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..aab93f07e768 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void)
}
#define arch_exit_to_user_mode arch_exit_to_user_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs)
+{
+#ifdef CONFIG_KVM_INTEL
+ /*
+ * KVM is a reserved bit and must always be 0. Hardware will #GP on
+ * IRET/ERETS with this bit set.
+ */
+ regs->flags &= ~X86_EFLAGS_KVM;
+#endif
+}
+#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode
+
#endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 7bb7bd90355d..c31f7bc2eba2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
{
- return !(regs->flags & X86_EFLAGS_IF);
+ /*
+ * return context | IF | KVM
+ * ---------------+----+----
+ * IRQ-off | 0 | 0
+ * IRQ-on | 0 | 1
+ * IRQ-on | 1 | 0
+ * invalid | 1 | 1
+ */
+ return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0;
}
/* Query offset/name of register from its name/offset */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 81d0c8bf1137..d32edefde587 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -14,6 +14,8 @@
#define X86_EFLAGS_FIXED _BITUL(X86_EFLAGS_FIXED_BIT)
#define X86_EFLAGS_PF_BIT 2 /* Parity Flag */
#define X86_EFLAGS_PF _BITUL(X86_EFLAGS_PF_BIT)
+#define X86_EFLAGS_KVM_BIT 3 /* KVM Flag -- must be 0 */
+#define X86_EFLAGS_KVM _BITUL(X86_EFLAGS_PF_BIT)
#define X86_EFLAGS_AF_BIT 4 /* Auxiliary carry Flag */
#define X86_EFLAGS_AF _BITUL(X86_EFLAGS_AF_BIT)
#define X86_EFLAGS_ZF_BIT 6 /* Zero Flag */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..cb9ab3ce030b 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -6,6 +6,7 @@
#include <asm/nospec-branch.h>
#include <asm/percpu.h>
#include <asm/segment.h>
+#include <asm/processor-flags.h>
#include "kvm-asm-offsets.h"
#include "run_flags.h"
@@ -50,6 +51,7 @@
push %rbp
#endif
pushf
+ _ASM_OR $X86_EFLAGS_KVM, (%_ASM_SP)
push $__KERNEL_CS
\call_insn \call_target
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf04..0acc20b63513 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void);
static __always_inline void arch_exit_to_user_mode(void) { }
#endif
+#ifndef arch_exit_to_kernel_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { }
+#endif
+
/**
* arch_do_signal_or_restart - Architecture specific signal delivery function
* @regs: Pointer to currents pt_regs
@@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
instrumentation_end();
irqentry_exit_to_kernel_mode_after_preempt(regs, state);
+ arch_exit_to_kernel_mode(regs);
}
/**
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: CPU Lockups in KVM with deferred hrtimer rearming
2026-04-21 7:39 ` Thomas Gleixner
2026-04-21 11:18 ` Peter Zijlstra
@ 2026-04-21 16:11 ` Verma, Vishal L
1 sibling, 0 replies; 17+ messages in thread
From: Verma, Vishal L @ 2026-04-21 16:11 UTC (permalink / raw)
To: peterz@infradead.org, tglx@kernel.org, binbin.wu@linux.intel.com
Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
x86@kernel.org
On Tue, 2026-04-21 at 09:39 +0200, Thomas Gleixner wrote:
>
> Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> From: Thomas Gleixner <tglx@kernel.org>
> Date: Tue, 21 Apr 2026 09:00:52 +0200
>
> irqentry_exit_to_kernel_mode_after_preempt() invokes
> hrtimer_rearm_deferred() only when the interrupted context had interrupts
> enabled. That's a correct decision because the timer interrupt can only be
> delivered in interrupt enabled contexts. The interrupt disabled path is
> used by exceptions and traps which never touch the hrtimer mechanics.
>
> So much for the theory, but then there is VIRT which ruins everything.
>
> KVM invokes regular interrupts with pt_regs which have interrupts
> disabled. That's correct from the KVM point of view, but completely
> violates the obviously correct expectations of the interrupt entry/exit
> code.
>
> Cure this by adding a hrtimer_rearm_deferred() invocation into the
> interrupted context has interrupt disabled path of
> irqentry_exit_to_kernel_mode_after_preempt().
>
> That's unfortunate when there is an actual reschedule pending, but it can't
> be avoided because KVM invokes a lot of code and also reenables interrupts
> _before_ reaching the point where the reschedule condition is handled. That
> can delay the rearming significantly, which in turn can cause artificial
> latencies.
>
> Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
Hi Thomas, I tested this and verified it solves both the tests, no more
lockups. If this is the final fix, you can add:
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
(I'm queueing up Peter's patch on the CI now too)
> ---
> include/linux/irq-entry-common.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
> instrumentation_end();
> } else {
> /*
> + * This is sadly required due to KVM, which invokes regular
> + * interrupt handlers with interrupt disabled state in @regs.
> + */
> + instrumentation_begin();
> + hrtimer_rearm_deferred();
> + instrumentation_end();
> +
> + /*
> * IRQ flags state is correct already. Just tell RCU if it
> * was not watching on entry.
> */
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-04-21 16:11 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22 ` Thomas Gleixner
2026-04-20 20:57 ` Verma, Vishal L
2026-04-20 22:19 ` Thomas Gleixner
2026-04-20 22:24 ` Verma, Vishal L
2026-04-21 6:29 ` Thomas Gleixner
2026-04-21 4:51 ` Binbin Wu
2026-04-21 7:39 ` Thomas Gleixner
2026-04-21 11:18 ` Peter Zijlstra
2026-04-21 11:32 ` Peter Zijlstra
2026-04-21 11:34 ` Peter Zijlstra
2026-04-21 11:49 ` Peter Zijlstra
2026-04-21 12:05 ` Peter Zijlstra
2026-04-21 13:19 ` Peter Zijlstra
2026-04-21 13:29 ` Peter Zijlstra
2026-04-21 16:11 ` Verma, Vishal L
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox