public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* CPU Lockups in KVM with deferred hrtimer rearming
@ 2026-04-16 20:50 Verma, Vishal L
  2026-04-20 15:00 ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Verma, Vishal L @ 2026-04-16 20:50 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

Hi Peter,

We noticed a KVM Unit test 'x2apic' - (APIC LVT timer one shot)
failing, and also some TDX specific tests running into multiple CPUs in
hard lockups on a 192-CPU Emerald Rapids system, and we traced it to
the htrimers deferred rearming merge.

Making CONFIG_HRTIMER_REARM_DEFERRED default to n in Kconfig made both
pass.

This is the hard lockup splat:

   watchdog: CPU98: Watchdog detected hard LOCKUP on cpu 98
   Modules linked in: openvswitch nsh tls ipt_REJECT iptable_mangle iptable_nat iptable_filter ip_tables bridge stp llc kvm_intel kvm irqbypass sunrpc
   irq event stamp: 34998
   hardirqs last  enabled at (34997): [<ffffffffc090ce6d>] tdx_vcpu_run+0x5d/0x350 [kvm_intel]
   hardirqs last disabled at (34998): [<ffffffffb9add6df>] exc_nmi+0xaf/0x1a0
   softirqs last  enabled at (34404): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
   softirqs last disabled at (34395): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
   CPU: 98 UID: 0 PID: 54785 Comm: qemu-system-x86 Not tainted 7.0.0-g10324ed6a556 #1 PREEMPT(full) 
   Hardware name: HPE ProLiant DL380 Gen11/ProLiant DL380 Gen11, BIOS 2.48 03/11/2025
   RIP: 0010:vmx_do_nmi_irqoff+0x13/0x20 [kvm_intel]
   Code: ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 e8 3d db 6e f7 <c9> c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90
   RSP: 0018:ff8d3a069bdf3af0 EFLAGS: 00000086
   RAX: ff3cc96963d68000 RBX: ff3cc96963d68000 RCX: 4000000200000000
   RDX: 0000000080000200 RSI: ff3cc96963d699d0 RDI: ff3cc96963d68000
   RBP: ff8d3a069bdf3af0 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
   R13: ff3cc968d03d0000 R14: ff3cc968d03d0000 R15: 0000000000000000
   FS:  00007f26ab7fe6c0(0000) GS:ff3cc98782d76000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000000 CR3: 00000001544af004 CR4: 0000000000f73ef0
   PKRU: 00000000
   Call Trace:
    <TASK>
    vmx_handle_nmi+0xdf/0x140 [kvm_intel]
    tdx_vcpu_enter_exit+0xd5/0x300 [kvm_intel]
    tdx_vcpu_run+0x5d/0x350 [kvm_intel]
    vcpu_run+0xd4a/0x1800 [kvm]
    ? __local_bh_enable_ip+0x7b/0xf0
    ? kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
    ? kvm_arch_vcpu_ioctl_run+0xb9/0x5f0 [kvm]
    kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
    kvm_vcpu_ioctl+0x2ef/0xb00 [kvm]
    ? __fget_files+0x2b/0x190
    ? find_held_lock+0x2b/0x80
    __x64_sys_ioctl+0x97/0xe0
    do_syscall_64+0xf4/0x1540
    ? __x64_sys_ioctl+0xb1/0xe0
    ? trace_hardirqs_on_prepare+0xd2/0xf0
    ? do_syscall_64+0x225/0x1540
    ? trace_hardirqs_on+0x18/0x100
    ? __local_bh_enable_ip+0x7b/0xf0
    ? arch_do_signal_or_restart+0x155/0x250
    ? trace_hardirqs_off+0x4e/0xf0
    ? exit_to_user_mode_loop+0x150/0x4e0
    ? trace_hardirqs_on_prepare+0xd2/0xf0
    ? do_syscall_64+0x225/0x1540
    ? do_user_addr_fault+0x36c/0x6b0
    ? lockdep_hardirqs_on_prepare+0xdb/0x190
    ? trace_hardirqs_on+0x18/0x100
    ? do_syscall_64+0xab/0x1540
    ? exc_page_fault+0x12c/0x2b0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e
   RIP: 0033:0x7f45f7ae00ed
   Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
   RSP: 002b:00007f26ab7f3e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f26ab7fe6c0 RCX: 00007f45f7ae00ed
   RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000099
   RBP: 00007f26ab7f3ec0 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 00007f26ab7fe6c0
   R13: 00007ffdc7adecd0 R14: 00007f26ab7fecdc R15: 00007ffdc7adedd7
    </TASK>

I tried out AI assisted and patch (below) which does happen to solve
it, but I'm not familiar in this area, and not sure if this is the
right fix.

---

diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..c4856c252412 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -4,6 +4,7 @@
 
 #include <linux/static_call_types.h>
 #include <linux/resume_user_mode.h>
+#include <linux/hrtimer_rearm.h>
 #include <linux/syscalls.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
@@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
 static inline void xfer_to_guest_mode_prepare(void)
 {
        lockdep_assert_irqs_disabled();
+       hrtimer_rearm_deferred();
        tick_nohz_user_enter_prepare();
 }
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5bd6efe598f0..f3bd084d9a72 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2058,6 +2058,7 @@ void __hrtimer_rearm_deferred(void)
        }
        hrtimer_rearm(cpu_base, expires_next, true);
 }
+EXPORT_SYMBOL_GPL(__hrtimer_rearm_deferred);
 
 static __always_inline void
 hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
@ 2026-04-20 15:00 ` Thomas Gleixner
  2026-04-20 15:22   ` Thomas Gleixner
                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-20 15:00 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
> I tried out AI assisted and patch (below) which does happen to solve
> it, but I'm not familiar in this area, and not sure if this is the
> right fix.
>
> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
> index bfa767702d9a..c4856c252412 100644
> --- a/include/linux/entry-virt.h
> +++ b/include/linux/entry-virt.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/static_call_types.h>
>  #include <linux/resume_user_mode.h>
> +#include <linux/hrtimer_rearm.h>
>  #include <linux/syscalls.h>
>  #include <linux/seccomp.h>
>  #include <linux/sched.h>
> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
>  static inline void xfer_to_guest_mode_prepare(void)
>  {
>         lockdep_assert_irqs_disabled();
> +       hrtimer_rearm_deferred();
>         tick_nohz_user_enter_prepare();


This code should never be reached with a rearm pending. Something else
went wrong earlier. So while the patch "works" it papers over the
underlying problem.

Can you please do the following:

    1) Apply the patch below

    2) Enable function tracing and the hrtimer* trace events

    3) Enable tracing if it has been disabled already

       echo 1 >/sys/kernel/tracing/tracing_on

    4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
       become 0, which means the problem triggered.

    5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
       somewhere to download from or send it to me compressed offlist.

Thanks,

        tglx
---

diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..ab73963a7496 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void);
 static inline void xfer_to_guest_mode_prepare(void)
 {
 	lockdep_assert_irqs_disabled();
+	if (test_thread_flag(TIF_HRTIMER_REARM)) {
+		tracing_off();
+		hrtimer_rearm_deferred();
+	}
 	tick_nohz_user_enter_prepare();
 }
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 15:00 ` Thomas Gleixner
@ 2026-04-20 15:22   ` Thomas Gleixner
  2026-04-20 20:57   ` Verma, Vishal L
  2026-04-21  4:51   ` Binbin Wu
  2 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-20 15:22 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, Apr 20 2026 at 17:00, Thomas Gleixner wrote:
> On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.

Peter just noticed that this should be fixed with

   1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 15:00 ` Thomas Gleixner
  2026-04-20 15:22   ` Thomas Gleixner
@ 2026-04-20 20:57   ` Verma, Vishal L
  2026-04-20 22:19     ` Thomas Gleixner
  2026-04-21  4:51   ` Binbin Wu
  2 siblings, 1 reply; 29+ messages in thread
From: Verma, Vishal L @ 2026-04-20 20:57 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
> 
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.
> 
> Can you please do the following:
> 
>     1) Apply the patch below
> 
>     2) Enable function tracing and the hrtimer* trace events
> 
>     3) Enable tracing if it has been disabled already
> 
>        echo 1 >/sys/kernel/tracing/tracing_on
> 
>     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>        become 0, which means the problem triggered.
> 
>     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>        somewhere to download from or send it to me compressed offlist.

Hi Thomas,

I've uploaded the trace here (~75MB compressed):
https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB

As for:

1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")

I already had that commit in the branch that was tested and it didn't
fix it.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 20:57   ` Verma, Vishal L
@ 2026-04-20 22:19     ` Thomas Gleixner
  2026-04-20 22:24       ` Verma, Vishal L
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-20 22:19 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote:
> On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
>> 
>> This code should never be reached with a rearm pending. Something else
>> went wrong earlier. So while the patch "works" it papers over the
>> underlying problem.
>> 
>> Can you please do the following:
>> 
>>     1) Apply the patch below
>> 
>>     2) Enable function tracing and the hrtimer* trace events
>> 
>>     3) Enable tracing if it has been disabled already
>> 
>>        echo 1 >/sys/kernel/tracing/tracing_on
>> 
>>     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>>        become 0, which means the problem triggered.
>> 
>>     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>>        somewhere to download from or send it to me compressed offlist.
>
> Hi Thomas,
>
> I've uploaded the trace here (~75MB compressed):
> https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
>
> As for:
>
> 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
>
> I already had that commit in the branch that was tested and it didn't
> fix it.

Thanks for the update. Can you try to provide the information I asked
for above?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 22:19     ` Thomas Gleixner
@ 2026-04-20 22:24       ` Verma, Vishal L
  2026-04-21  6:29         ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Verma, Vishal L @ 2026-04-20 22:24 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote:
> > On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
> > > 
> > > This code should never be reached with a rearm pending. Something else
> > > went wrong earlier. So while the patch "works" it papers over the
> > > underlying problem.
> > > 
> > > Can you please do the following:
> > > 
> > >     1) Apply the patch below
> > > 
> > >     2) Enable function tracing and the hrtimer* trace events
> > > 
> > >     3) Enable tracing if it has been disabled already
> > > 
> > >        echo 1 >/sys/kernel/tracing/tracing_on
> > > 
> > >     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
> > >        become 0, which means the problem triggered.
> > > 
> > >     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
> > >        somewhere to download from or send it to me compressed offlist.
> > 
> > Hi Thomas,
> > 
> > I've uploaded the trace here (~75MB compressed):
> > https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
> > 
> > As for:
> > 
> > 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
> > 
> > I already had that commit in the branch that was tested and it didn't
> > fix it.
> 
> Thanks for the update. Can you try to provide the information I asked
> for above?
> 
Ah sorry - I should've said that with your patch applied, tracing_on
did become 0, so the problem was triggered.

The trace from that is in the URL above.

This is how I collected it:

   tracefs=/sys/kernel/tracing
   echo 4096 > "$tracefs"/buffer_size_kb
   echo function > "$tracefs"/current_tracer
   echo 1 > "$tracefs"/events/hrtimer/enable
   echo 1 > "$tracefs"/tracing_on
   
   <run the test>
   
   tracing_on="$(cat "$tracefs"/tracing_on)"
   if [ "$tracing_on" -eq 0 ]; then
   	echo "Debug patch triggered, collecting trace"
   	cat "$tracefs"/trace | gzip > /tmp/hrtimer_rearm_trace.gz
   else
   	echo "Debug patch did not trigger (tracing_on still 1)"
   fi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 15:00 ` Thomas Gleixner
  2026-04-20 15:22   ` Thomas Gleixner
  2026-04-20 20:57   ` Verma, Vishal L
@ 2026-04-21  4:51   ` Binbin Wu
  2026-04-21  7:39     ` Thomas Gleixner
  2 siblings, 1 reply; 29+ messages in thread
From: Binbin Wu @ 2026-04-21  4:51 UTC (permalink / raw)
  To: Thomas Gleixner, Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org



On 4/20/2026 11:00 PM, Thomas Gleixner wrote:
> On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
>> I tried out AI assisted and patch (below) which does happen to solve
>> it, but I'm not familiar in this area, and not sure if this is the
>> right fix.
>>
>> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
>> index bfa767702d9a..c4856c252412 100644
>> --- a/include/linux/entry-virt.h
>> +++ b/include/linux/entry-virt.h
>> @@ -4,6 +4,7 @@
>>  
>>  #include <linux/static_call_types.h>
>>  #include <linux/resume_user_mode.h>
>> +#include <linux/hrtimer_rearm.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/seccomp.h>
>>  #include <linux/sched.h>
>> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
>>  static inline void xfer_to_guest_mode_prepare(void)
>>  {
>>         lockdep_assert_irqs_disabled();
>> +       hrtimer_rearm_deferred();
>>         tick_nohz_user_enter_prepare();
> 
> 
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.

IIUC, the problem might be:

HRTimer -> VMExit:
[IRQ is disabled]
    kvm_x86_call(handle_exit_irqoff)(vcpu)
        vmx_handle_exit_irqoff
            handle_external_interrupt_irqoff
                sysvec_apic_timer_interrupt
                    irqentry_enter
                    ...
                    irqentry_exit
                        irqentry_exit_to_kernel_mode
                            if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer 
                                hrtimer_rearm_deferred()         rearm is skipped!


This issue is triggered on TDX since TDX can't use preemption timer while normal
VMX VM uses preemption timer by default.


> 
> Can you please do the following:
> 
>     1) Apply the patch below
> 
>     2) Enable function tracing and the hrtimer* trace events
> 
>     3) Enable tracing if it has been disabled already
> 
>        echo 1 >/sys/kernel/tracing/tracing_on
> 
>     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>        become 0, which means the problem triggered.
> 
>     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>        somewhere to download from or send it to me compressed offlist.
> 
> Thanks,
> 
>         tglx
> ---
> 
> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
> index bfa767702d9a..ab73963a7496 100644
> --- a/include/linux/entry-virt.h
> +++ b/include/linux/entry-virt.h
> @@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void);
>  static inline void xfer_to_guest_mode_prepare(void)
>  {
>  	lockdep_assert_irqs_disabled();
> +	if (test_thread_flag(TIF_HRTIMER_REARM)) {
> +		tracing_off();
> +		hrtimer_rearm_deferred();
> +	}
>  	tick_nohz_user_enter_prepare();
>  }
>  
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 22:24       ` Verma, Vishal L
@ 2026-04-21  6:29         ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-21  6:29 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, Apr 20 2026 at 22:24, Verma, Vishal L wrote:
> On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote:
>> Thanks for the update. Can you try to provide the information I asked
>> for above?
>> 
> Ah sorry - I should've said that with your patch applied, tracing_on
> did become 0, so the problem was triggered.
>
> The trace from that is in the URL above.

I clearly can't read :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21  4:51   ` Binbin Wu
@ 2026-04-21  7:39     ` Thomas Gleixner
  2026-04-21 11:18       ` Peter Zijlstra
  2026-04-21 16:11       ` Verma, Vishal L
  0 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-21  7:39 UTC (permalink / raw)
  To: Binbin Wu, Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Tue, Apr 21 2026 at 12:51, Binbin Wu wrote:
> On 4/20/2026 11:00 PM, Thomas Gleixner wrote:
>>>  static inline void xfer_to_guest_mode_prepare(void)
>>>  {
>>>         lockdep_assert_irqs_disabled();
>>> +       hrtimer_rearm_deferred();
>>>         tick_nohz_user_enter_prepare();
>> 
>> 
>> This code should never be reached with a rearm pending. Something else
>> went wrong earlier. So while the patch "works" it papers over the
>> underlying problem.
>
> IIUC, the problem might be:
>
> HRTimer -> VMExit:
> [IRQ is disabled]
>     kvm_x86_call(handle_exit_irqoff)(vcpu)
>         vmx_handle_exit_irqoff
>             handle_external_interrupt_irqoff
>                 sysvec_apic_timer_interrupt
>                     irqentry_enter
>                     ...
>                     irqentry_exit
>                         irqentry_exit_to_kernel_mode
>                             if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer 
>                                 hrtimer_rearm_deferred()         rearm is skipped!
>
>
> This issue is triggered on TDX since TDX can't use preemption timer while normal
> VMX VM uses preemption timer by default.

Kinda.

The issue is that vmx_handle_exit_irqoff() always hands in regs with
regs->flags.X86_EFLAGS_IF == 0. That has absolutely nothing to do with
TDX and the preemption timer.

The patch below solves the problem right there in the exit code, which
is unfortunate as there might be a NEED_RESCHED pending. But that can't
be taken into account as KVM enables interrupts _before_ reaching the
exit work point.

Yet another proof that virt creates more problems than it solves.

Thanks,

        tglx
---
Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
From: Thomas Gleixner <tglx@kernel.org>
Date: Tue, 21 Apr 2026 09:00:52 +0200

irqentry_exit_to_kernel_mode_after_preempt() invokes
hrtimer_rearm_deferred() only when the interrupted context had interrupts
enabled. That's a correct decision because the timer interrupt can only be
delivered in interrupt enabled contexts. The interrupt disabled path is
used by exceptions and traps which never touch the hrtimer mechanics.

So much for the theory, but then there is VIRT which ruins everything.

KVM invokes regular interrupts with pt_regs which have interrupts
disabled. That's correct from the KVM point of view, but completely
violates the obviously correct expectations of the interrupt entry/exit
code.

Cure this by adding a hrtimer_rearm_deferred() invocation into the
interrupted context has interrupt disabled path of
irqentry_exit_to_kernel_mode_after_preempt().

That's unfortunate when there is an actual reschedule pending, but it can't
be avoided because KVM invokes a lot of code and also reenables interrupts
_before_ reaching the point where the reschedule condition is handled. That
can delay the rearming significantly, which in turn can cause artificial
latencies.

Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
---
 include/linux/irq-entry-common.h |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
 		instrumentation_end();
 	} else {
 		/*
+		 * This is sadly required due to KVM, which invokes regular
+		 * interrupt handlers with interrupt disabled state in @regs.
+		 */
+		instrumentation_begin();
+		hrtimer_rearm_deferred();
+		instrumentation_end();
+
+		/*
 		 * IRQ flags state is correct already. Just tell RCU if it
 		 * was not watching on entry.
 		 */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21  7:39     ` Thomas Gleixner
@ 2026-04-21 11:18       ` Peter Zijlstra
  2026-04-21 11:32         ` Peter Zijlstra
  2026-04-21 16:11       ` Verma, Vishal L
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:

> ---
> Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> From: Thomas Gleixner <tglx@kernel.org>
> Date: Tue, 21 Apr 2026 09:00:52 +0200
> 
> irqentry_exit_to_kernel_mode_after_preempt() invokes
> hrtimer_rearm_deferred() only when the interrupted context had interrupts
> enabled. That's a correct decision because the timer interrupt can only be
> delivered in interrupt enabled contexts. The interrupt disabled path is
> used by exceptions and traps which never touch the hrtimer mechanics.
> 
> So much for the theory, but then there is VIRT which ruins everything.
> 
> KVM invokes regular interrupts with pt_regs which have interrupts
> disabled. That's correct from the KVM point of view, but completely
> violates the obviously correct expectations of the interrupt entry/exit
> code.

Mooo :-(

That also complicates the comment that goes with
hrtimer_rearm_deferred(). Not sure how to 'fix' that.

> Cure this by adding a hrtimer_rearm_deferred() invocation into the
> interrupted context has interrupt disabled path of
> irqentry_exit_to_kernel_mode_after_preempt().
> 
> That's unfortunate when there is an actual reschedule pending, but it can't
> be avoided because KVM invokes a lot of code and also reenables interrupts
> _before_ reaching the point where the reschedule condition is handled. That
> can delay the rearming significantly, which in turn can cause artificial
> latencies.

Yeah, this is a trainwreck. If they want it better, KVM needs to get
'fixed' to not play silly games like this.

> Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  include/linux/irq-entry-common.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
>  		instrumentation_end();
>  	} else {
>  		/*
> +		 * This is sadly required due to KVM, which invokes regular
> +		 * interrupt handlers with interrupt disabled state in @regs.
> +		 */
> +		instrumentation_begin();
> +		hrtimer_rearm_deferred();
> +		instrumentation_end();
> +
> +		/*
>  		 * IRQ flags state is correct already. Just tell RCU if it
>  		 * was not watching on entry.
>  		 */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:18       ` Peter Zijlstra
@ 2026-04-21 11:32         ` Peter Zijlstra
  2026-04-21 11:34           ` Peter Zijlstra
  2026-04-21 16:30           ` Thomas Gleixner
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> 
> > ---
> > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > From: Thomas Gleixner <tglx@kernel.org>
> > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > 
> > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > enabled. That's a correct decision because the timer interrupt can only be
> > delivered in interrupt enabled contexts. The interrupt disabled path is
> > used by exceptions and traps which never touch the hrtimer mechanics.
> > 
> > So much for the theory, but then there is VIRT which ruins everything.
> > 
> > KVM invokes regular interrupts with pt_regs which have interrupts
> > disabled. That's correct from the KVM point of view, but completely
> > violates the obviously correct expectations of the interrupt entry/exit
> > code.
> 
> Mooo :-(
> 
> That also complicates the comment that goes with
> hrtimer_rearm_deferred(). Not sure how to 'fix' that.
> 
> > Cure this by adding a hrtimer_rearm_deferred() invocation into the
> > interrupted context has interrupt disabled path of
> > irqentry_exit_to_kernel_mode_after_preempt().
> > 
> > That's unfortunate when there is an actual reschedule pending, but it can't
> > be avoided because KVM invokes a lot of code and also reenables interrupts
> > _before_ reaching the point where the reschedule condition is handled. That
> > can delay the rearming significantly, which in turn can cause artificial
> > latencies.
> 
> Yeah, this is a trainwreck. If they want it better, KVM needs to get
> 'fixed' to not play silly games like this.
> 
> > Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> > Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> > ---
> >  include/linux/irq-entry-common.h |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > --- a/include/linux/irq-entry-common.h
> > +++ b/include/linux/irq-entry-common.h
> > @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
> >  		instrumentation_end();
> >  	} else {
> >  		/*
> > +		 * This is sadly required due to KVM, which invokes regular
> > +		 * interrupt handlers with interrupt disabled state in @regs.
> > +		 */
> > +		instrumentation_begin();
> > +		hrtimer_rearm_deferred();
> > +		instrumentation_end();
> > +
> > +		/*
> >  		 * IRQ flags state is correct already. Just tell RCU if it
> >  		 * was not watching on entry.
> >  		 */

Ohhh, wait. What happens if you take a page-fault from NMI context? Does
this then not result in trying to program the timer from NMI context?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:32         ` Peter Zijlstra
@ 2026-04-21 11:34           ` Peter Zijlstra
  2026-04-21 11:49             ` Peter Zijlstra
  2026-04-21 16:30           ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > 
> > > ---
> > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > From: Thomas Gleixner <tglx@kernel.org>
> > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > 
> > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > enabled. That's a correct decision because the timer interrupt can only be
> > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > 
> > > So much for the theory, but then there is VIRT which ruins everything.
> > > 
> > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > disabled. That's correct from the KVM point of view, but completely
> > > violates the obviously correct expectations of the interrupt entry/exit
> > > code.
> > 
> > Mooo :-(

Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
use GENERIC_ENTRY?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:34           ` Peter Zijlstra
@ 2026-04-21 11:49             ` Peter Zijlstra
  2026-04-21 12:05               ` Peter Zijlstra
  2026-04-21 17:11               ` Thomas Gleixner
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > 
> > > > ---
> > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > 
> > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > 
> > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > 
> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > disabled. That's correct from the KVM point of view, but completely
> > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > code.
> > > 
> > > Mooo :-(
> 
> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> use GENERIC_ENTRY?

Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
on the fake frame instead? We know it will enable IRQs after doing
handle_exit_irqoff() in vcpu_enter_guest().

SVM does not seem affected with this particular insanity.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:49             ` Peter Zijlstra
@ 2026-04-21 12:05               ` Peter Zijlstra
  2026-04-21 13:19                 ` Peter Zijlstra
  2026-04-21 17:11               ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > 
> > > > > ---
> > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > 
> > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > 
> > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > 
> > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > code.
> > > > 
> > > > Mooo :-(
> > 
> > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > use GENERIC_ENTRY?
> 
> Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> on the fake frame instead? We know it will enable IRQs after doing
> handle_exit_irqoff() in vcpu_enter_guest().

Moo, you can't do that either, because it will ERETS/IRET and fuck up
the state :/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 12:05               ` Peter Zijlstra
@ 2026-04-21 13:19                 ` Peter Zijlstra
  2026-04-21 13:29                   ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 13:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > > 
> > > > > > ---
> > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > > 
> > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > > 
> > > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > > 
> > > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > > code.
> > > > > 
> > > > > Mooo :-(
> > > 
> > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > > use GENERIC_ENTRY?
> > 
> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > on the fake frame instead? We know it will enable IRQs after doing
> > handle_exit_irqoff() in vcpu_enter_guest().
> 
> Moo, you can't do that either, because it will ERETS/IRET and fuck up
> the state :/

How insane is something like this?

---
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..f3e2a8fde1ab 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	push %rdi			/* fred_ss handed in by the caller */
 	push %rbp
 	pushf
+	or $X86_EFLAGS_KVM, (%rsp)
 	push $__KERNEL_CS
 
 	/*
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..aab93f07e768 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs)
+{
+#ifdef CONFIG_KVM_INTEL
+	/*
+	 * KVM is a reserved bit and must always be 0. Hardware will #GP on
+	 * IRET/ERETS with this bit set.
+	 */
+	regs->flags &= ~X86_EFLAGS_KVM;
+#endif
+}
+#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode
+
 #endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 7bb7bd90355d..c31f7bc2eba2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
 
 static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 {
-	return !(regs->flags & X86_EFLAGS_IF);
+	/*
+	 * return context | IF | KVM
+	 * ---------------+----+----
+	 * IRQ-off        |  0 |  0
+	 * IRQ-on         |  0 |  1
+	 * IRQ-on         |  1 |  0
+	 * invalid        |  1 |  1
+	 */
+	return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0;
 }
 
 /* Query offset/name of register from its name/offset */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 81d0c8bf1137..d32edefde587 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -14,6 +14,8 @@
 #define X86_EFLAGS_FIXED	_BITUL(X86_EFLAGS_FIXED_BIT)
 #define X86_EFLAGS_PF_BIT	2 /* Parity Flag */
 #define X86_EFLAGS_PF		_BITUL(X86_EFLAGS_PF_BIT)
+#define X86_EFLAGS_KVM_BIT	3 /* KVM Flag -- must be 0 */
+#define X86_EFLAGS_KVM		_BITUL(X86_EFLAGS_PF_BIT)
 #define X86_EFLAGS_AF_BIT	4 /* Auxiliary carry Flag */
 #define X86_EFLAGS_AF		_BITUL(X86_EFLAGS_AF_BIT)
 #define X86_EFLAGS_ZF_BIT	6 /* Zero Flag */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..3d0d0fb8de79 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -50,6 +50,7 @@
 	push %rbp
 #endif
 	pushf
+	or $X86_EFLAGS_KVM, (%_ASM_SP)
 	push $__KERNEL_CS
 	\call_insn \call_target
 
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf04..0acc20b63513 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void);
 static __always_inline void arch_exit_to_user_mode(void) { }
 #endif
 
+#ifndef arch_exit_to_kernel_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { }
+#endif
+
 /**
  * arch_do_signal_or_restart -  Architecture specific signal delivery function
  * @regs:	Pointer to currents pt_regs
@@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
 	instrumentation_end();
 
 	irqentry_exit_to_kernel_mode_after_preempt(regs, state);
+	arch_exit_to_kernel_mode(regs);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 13:19                 ` Peter Zijlstra
@ 2026-04-21 13:29                   ` Peter Zijlstra
  2026-04-21 16:36                     ` Thomas Gleixner
  2026-04-21 18:11                     ` Verma, Vishal L
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 13:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > > > 
> > > > > > > ---
> > > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > > > 
> > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > > > 
> > > > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > > > 
> > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > > > code.
> > > > > > 
> > > > > > Mooo :-(
> > > > 
> > > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > > > use GENERIC_ENTRY?
> > > 
> > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > > on the fake frame instead? We know it will enable IRQs after doing
> > > handle_exit_irqoff() in vcpu_enter_guest().
> > 
> > Moo, you can't do that either, because it will ERETS/IRET and fuck up
> > the state :/
> 
> How insane is something like this?

Small matter of actually building...

---
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..cc2c961a5683 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	push %rdi			/* fred_ss handed in by the caller */
 	push %rbp
 	pushf
+	orq $X86_EFLAGS_KVM, (%rsp)
 	push $__KERNEL_CS
 
 	/*
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 0e8c611bc9e2..75568a85b2d3 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -43,6 +43,7 @@
 #define _ASM_SUB	__ASM_SIZE(sub)
 #define _ASM_XADD	__ASM_SIZE(xadd)
 #define _ASM_MUL	__ASM_SIZE(mul)
+#define _ASM_OR		__ASM_SIZE(or)
 
 #define _ASM_AX		__ASM_REG(ax)
 #define _ASM_BX		__ASM_REG(bx)
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..aab93f07e768 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs)
+{
+#ifdef CONFIG_KVM_INTEL
+	/*
+	 * KVM is a reserved bit and must always be 0. Hardware will #GP on
+	 * IRET/ERETS with this bit set.
+	 */
+	regs->flags &= ~X86_EFLAGS_KVM;
+#endif
+}
+#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode
+
 #endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 7bb7bd90355d..c31f7bc2eba2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
 
 static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 {
-	return !(regs->flags & X86_EFLAGS_IF);
+	/*
+	 * return context | IF | KVM
+	 * ---------------+----+----
+	 * IRQ-off        |  0 |  0
+	 * IRQ-on         |  0 |  1
+	 * IRQ-on         |  1 |  0
+	 * invalid        |  1 |  1
+	 */
+	return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0;
 }
 
 /* Query offset/name of register from its name/offset */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 81d0c8bf1137..d32edefde587 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -14,6 +14,8 @@
 #define X86_EFLAGS_FIXED	_BITUL(X86_EFLAGS_FIXED_BIT)
 #define X86_EFLAGS_PF_BIT	2 /* Parity Flag */
 #define X86_EFLAGS_PF		_BITUL(X86_EFLAGS_PF_BIT)
+#define X86_EFLAGS_KVM_BIT	3 /* KVM Flag -- must be 0 */
+#define X86_EFLAGS_KVM		_BITUL(X86_EFLAGS_PF_BIT)
 #define X86_EFLAGS_AF_BIT	4 /* Auxiliary carry Flag */
 #define X86_EFLAGS_AF		_BITUL(X86_EFLAGS_AF_BIT)
 #define X86_EFLAGS_ZF_BIT	6 /* Zero Flag */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..cb9ab3ce030b 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -6,6 +6,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/percpu.h>
 #include <asm/segment.h>
+#include <asm/processor-flags.h>
 #include "kvm-asm-offsets.h"
 #include "run_flags.h"
 
@@ -50,6 +51,7 @@
 	push %rbp
 #endif
 	pushf
+	_ASM_OR $X86_EFLAGS_KVM, (%_ASM_SP)
 	push $__KERNEL_CS
 	\call_insn \call_target
 
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf04..0acc20b63513 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void);
 static __always_inline void arch_exit_to_user_mode(void) { }
 #endif
 
+#ifndef arch_exit_to_kernel_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { }
+#endif
+
 /**
  * arch_do_signal_or_restart -  Architecture specific signal delivery function
  * @regs:	Pointer to currents pt_regs
@@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
 	instrumentation_end();
 
 	irqentry_exit_to_kernel_mode_after_preempt(regs, state);
+	arch_exit_to_kernel_mode(regs);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21  7:39     ` Thomas Gleixner
  2026-04-21 11:18       ` Peter Zijlstra
@ 2026-04-21 16:11       ` Verma, Vishal L
  1 sibling, 0 replies; 29+ messages in thread
From: Verma, Vishal L @ 2026-04-21 16:11 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org, binbin.wu@linux.intel.com
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Tue, 2026-04-21 at 09:39 +0200, Thomas Gleixner wrote:
> 
> Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> From: Thomas Gleixner <tglx@kernel.org>
> Date: Tue, 21 Apr 2026 09:00:52 +0200
> 
> irqentry_exit_to_kernel_mode_after_preempt() invokes
> hrtimer_rearm_deferred() only when the interrupted context had interrupts
> enabled. That's a correct decision because the timer interrupt can only be
> delivered in interrupt enabled contexts. The interrupt disabled path is
> used by exceptions and traps which never touch the hrtimer mechanics.
> 
> So much for the theory, but then there is VIRT which ruins everything.
> 
> KVM invokes regular interrupts with pt_regs which have interrupts
> disabled. That's correct from the KVM point of view, but completely
> violates the obviously correct expectations of the interrupt entry/exit
> code.
> 
> Cure this by adding a hrtimer_rearm_deferred() invocation into the
> interrupted context has interrupt disabled path of
> irqentry_exit_to_kernel_mode_after_preempt().
> 
> That's unfortunate when there is an actual reschedule pending, but it can't
> be avoided because KVM invokes a lot of code and also reenables interrupts
> _before_ reaching the point where the reschedule condition is handled. That
> can delay the rearming significantly, which in turn can cause artificial
> latencies.
> 
> Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com

Hi Thomas, I tested this and verified it solves both the tests, no more
lockups. If this is the final fix, you can add:

Tested-by: Vishal Verma <vishal.l.verma@intel.com>

(I'm queueing up Peter's patch on the CI now too)

> ---
>  include/linux/irq-entry-common.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
>  		instrumentation_end();
>  	} else {
>  		/*
> +		 * This is sadly required due to KVM, which invokes regular
> +		 * interrupt handlers with interrupt disabled state in @regs.
> +		 */
> +		instrumentation_begin();
> +		hrtimer_rearm_deferred();
> +		instrumentation_end();
> +
> +		/*
>  		 * IRQ flags state is correct already. Just tell RCU if it
>  		 * was not watching on entry.
>  		 */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:32         ` Peter Zijlstra
  2026-04-21 11:34           ` Peter Zijlstra
@ 2026-04-21 16:30           ` Thomas Gleixner
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-21 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21 2026 at 13:32, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
>> >  		/*
>> > +		 * This is sadly required due to KVM, which invokes regular
>> > +		 * interrupt handlers with interrupt disabled state in @regs.
>> > +		 */
>> > +		instrumentation_begin();
>> > +		hrtimer_rearm_deferred();
>> > +		instrumentation_end();
>> > +
>> > +		/*
>> >  		 * IRQ flags state is correct already. Just tell RCU if it
>> >  		 * was not watching on entry.
>> >  		 */
>
> Ohhh, wait. What happens if you take a page-fault from NMI context? Does
> this then not result in trying to program the timer from NMI context?

Uuuurgh, yes.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 13:29                   ` Peter Zijlstra
@ 2026-04-21 16:36                     ` Thomas Gleixner
  2026-04-21 18:11                     ` Verma, Vishal L
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-21 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21 2026 at 15:29, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote:
>> > Moo, you can't do that either, because it will ERETS/IRET and fuck up
>> > the state :/
>> 
>> How insane is something like this?

Pretty insane :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:49             ` Peter Zijlstra
  2026-04-21 12:05               ` Peter Zijlstra
@ 2026-04-21 17:11               ` Thomas Gleixner
  2026-04-21 17:20                 ` Jim Mattson
  2026-04-21 19:18                 ` Verma, Vishal L
  1 sibling, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-21 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org,
	Sean Christopherson, Paolo Bonzini

On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
>> > > > KVM invokes regular interrupts with pt_regs which have interrupts
>> > > > disabled. That's correct from the KVM point of view, but completely
>> > > > violates the obviously correct expectations of the interrupt entry/exit
>> > > > code.
>> > > 
>> > > Mooo :-(
>> 
>> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
>> use GENERIC_ENTRY?
>
> Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> on the fake frame instead? We know it will enable IRQs after doing
> handle_exit_irqoff() in vcpu_enter_guest().

Doesn't work :)

> SVM does not seem affected with this particular insanity.

Looks like. It will take the interrupt after local_irq_enable().

Now for VMX, that hrtimer_rearm_deferred() call should really go into
handle_external_interrupt_irqoff(), which in turn requires to export
__hrtimer_rearm_deferred().

But we can avoid that alltogether. Something like the untested below.

Thanks,

        tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -42,9 +42,10 @@
 #include <linux/timer.h>
 #include <linux/freezer.h>
 #include <linux/compat.h>
-
 #include <linux/uaccess.h>
 
+#include <asm/irq_regs.h>
+
 #include <trace/events/timer.h>
 
 #include "tick-internal.h"
@@ -2062,11 +2063,16 @@ void __hrtimer_rearm_deferred(void)
 static __always_inline void
 hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
 {
-	/* hrtimer_interrupt() just re-evaluated the first expiring timer */
-	cpu_base->deferred_needs_update = false;
-	/* Cache the expiry time */
-	cpu_base->deferred_expires_next = expires_next;
-	set_thread_flag(TIF_HRTIMER_REARM);
+	/* Lies, damned lies and virt */
+	if (likely(!regs_irqs_disabled(get_irq_regs()))) {
+		/* hrtimer_interrupt() just re-evaluated the first expiring timer */
+		cpu_base->deferred_needs_update = false;
+		/* Cache the expiry time */
+		cpu_base->deferred_expires_next = expires_next;
+		set_thread_flag(TIF_HRTIMER_REARM);
+	} else {
+		hrtimer_rearm(cpu_base, expires_next, false);
+	}
 }
 #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
 static __always_inline void



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 17:11               ` Thomas Gleixner
@ 2026-04-21 17:20                 ` Jim Mattson
  2026-04-21 18:29                   ` Thomas Gleixner
  2026-04-21 19:18                 ` Verma, Vishal L
  1 sibling, 1 reply; 29+ messages in thread
From: Jim Mattson @ 2026-04-21 17:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org,
	Sean Christopherson, Paolo Bonzini

.

On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> >> > > > disabled. That's correct from the KVM point of view, but completely
> >> > > > violates the obviously correct expectations of the interrupt entry/exit
> >> > > > code.
> >> > >
> >> > > Mooo :-(
> >>
> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> >> use GENERIC_ENTRY?
> >
> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > on the fake frame instead? We know it will enable IRQs after doing
> > handle_exit_irqoff() in vcpu_enter_guest().
>
> Doesn't work :)
>
> > SVM does not seem affected with this particular insanity.
>
> Looks like. It will take the interrupt after local_irq_enable().

FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 13:29                   ` Peter Zijlstra
  2026-04-21 16:36                     ` Thomas Gleixner
@ 2026-04-21 18:11                     ` Verma, Vishal L
  1 sibling, 0 replies; 29+ messages in thread
From: Verma, Vishal L @ 2026-04-21 18:11 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: Wu, Binbin, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	Edgecombe, Rick P, x86@kernel.org

On Tue, 2026-04-21 at 15:29 +0200, Peter Zijlstra wrote:
> 
> diff --git a/arch/x86/include/uapi/asm/processor-flags.h
> b/arch/x86/include/uapi/asm/processor-flags.h
> index 81d0c8bf1137..d32edefde587 100644
> --- a/arch/x86/include/uapi/asm/processor-flags.h
> +++ b/arch/x86/include/uapi/asm/processor-flags.h
> @@ -14,6 +14,8 @@
>  #define X86_EFLAGS_FIXED	_BITUL(X86_EFLAGS_FIXED_BIT)
>  #define X86_EFLAGS_PF_BIT	2 /* Parity Flag */
>  #define X86_EFLAGS_PF		_BITUL(X86_EFLAGS_PF_BIT)
> +#define X86_EFLAGS_KVM_BIT	3 /* KVM Flag -- must be 0 */
> +#define X86_EFLAGS_KVM		_BITUL(X86_EFLAGS_PF_BIT)

I fixed up the copy-paste typo here -   _BITUL(X86_EFLAGS_KVM_BIT)

.. and with that the tests pass.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 17:20                 ` Jim Mattson
@ 2026-04-21 18:29                   ` Thomas Gleixner
  2026-04-21 18:55                     ` Sean Christopherson
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2026-04-21 18:29 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Peter Zijlstra, Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org,
	Sean Christopherson, Paolo Bonzini

On Tue, Apr 21 2026 at 10:20, Jim Mattson wrote:
> On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote:
>>
>> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
>> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
>> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts
>> >> > > > disabled. That's correct from the KVM point of view, but completely
>> >> > > > violates the obviously correct expectations of the interrupt entry/exit
>> >> > > > code.
>> >> > >
>> >> > > Mooo :-(
>> >>
>> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
>> >> use GENERIC_ENTRY?
>> >
>> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
>> > on the fake frame instead? We know it will enable IRQs after doing
>> > handle_exit_irqoff() in vcpu_enter_guest().
>>
>> Doesn't work :)
>>
>> > SVM does not seem affected with this particular insanity.
>>
>> Looks like. It will take the interrupt after local_irq_enable().
>
> FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.

I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
there any performance benefit or is it just used because it's there?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:29                   ` Thomas Gleixner
@ 2026-04-21 18:55                     ` Sean Christopherson
  2026-04-21 20:06                       ` Peter Zijlstra
  2026-04-21 20:39                       ` Paolo Bonzini
  0 siblings, 2 replies; 29+ messages in thread
From: Sean Christopherson @ 2026-04-21 18:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026, Thomas Gleixner wrote:
> On Tue, Apr 21 2026 at 10:20, Jim Mattson wrote:
> > On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote:
> >>
> >> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
> >> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> >> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> >> >> > > > disabled. That's correct from the KVM point of view, but completely
> >> >> > > > violates the obviously correct expectations of the interrupt entry/exit
> >> >> > > > code.
> >> >> > >
> >> >> > > Mooo :-(
> >> >>
> >> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> >> >> use GENERIC_ENTRY?
> >> >
> >> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> >> > on the fake frame instead? We know it will enable IRQs after doing
> >> > handle_exit_irqoff() in vcpu_enter_guest().
> >>
> >> Doesn't work :)
> >>
> >> > SVM does not seem affected with this particular insanity.
> >>
> >> Looks like. It will take the interrupt after local_irq_enable().
> >
> > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.

Hell no.

> I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
> there any performance benefit or is it just used because it's there?

There are performance benefits, and it preserves ordering: the first IRQ that's
serviced by the host is guaranteed to be _the_ IRQ that triggered the VM-Exit.
E.g. with AMD's approach, any IRQs that arrive between the VM-Exit and STI (which
is a pretty big swath of code) could be serviced before the IRQ that triggered
the exit, depending on priority.

VM_EXIT_ACK_INTR_ON_EXIT also provides symmetry with Intel's handing of NMIs, as
NMIs are unconditionally "acked" on VM-Exit.

Even if performance is "fine", changing decades of fundamental KVM behavior is
terrifying.

Pulling in an earlier idea:

 : Now for VMX, that hrtimer_rearm_deferred() call should really go into
 : handle_external_interrupt_irqoff(), which in turn requires to export
 : __hrtimer_rearm_deferred().

IMO, that's the way to go.  But instead of exporting __hrtimer_rearm_deferred(),
move vmx_do_nmi_irqoff() and vmx_do_interrupt_irqoff() into core kernel entry code
(along with the assembly glue), and then EXPORT_SYMBOL_FOR_KVM those.  It'd mean
some extra surgery, e.g. to provide an equivalent to KVM's IDT lookup:

	gate_offset((gate_desc *)host_idt_base + vector)

But I suspect it would be a big net positive in the end.i  E.g. the entry code
would *know* it's dealing with a direct call from KVM, and thus shouldn't need
to play pt_regs games.

Actually, even better would be to bury the FRED vs. not-FRED details in entry
code.  E.g. on the KVM invocation side, we could get to something like the below,
and I'm pretty sure _reduce_ the number of for-KVM exports in the process.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..f6f5c124ed3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
            "unexpected VM-Exit interrupt info: 0x%x", intr_info))
                return;
 
-       /*
-        * Invoke the kernel's IRQ handler for the vector.  Use the FRED path
-        * when it's available even if FRED isn't fully enabled, e.g. even if
-        * FRED isn't supported in hardware, in order to avoid the indirect
-        * CALL in the non-FRED path.
-        */
+       /* For the IRQ to the core kernel for processing. */
        kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-       if (IS_ENABLED(CONFIG_X86_FRED))
-               fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
-       else
-               vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
+       x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
        kvm_after_interrupt(vcpu);
 
        vcpu->arch.at_instruction_boundary = true;
@@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
                return;
 
        kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-       if (cpu_feature_enabled(X86_FEATURE_FRED))
-               fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
-       else
-               vmx_do_nmi_irqoff();
+       x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
        kvm_after_interrupt(vcpu);
 }

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 17:11               ` Thomas Gleixner
  2026-04-21 17:20                 ` Jim Mattson
@ 2026-04-21 19:18                 ` Verma, Vishal L
  1 sibling, 0 replies; 29+ messages in thread
From: Verma, Vishal L @ 2026-04-21 19:18 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: Wu, Binbin, kvm@vger.kernel.org, bonzini@redhat.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Edgecombe, Rick P,
	x86@kernel.org

On Tue, 2026-04-21 at 19:11 +0200, Thomas Gleixner wrote:
> 
> Now for VMX, that hrtimer_rearm_deferred() call should really go into
> handle_external_interrupt_irqoff(), which in turn requires to export
> __hrtimer_rearm_deferred().
> 
> But we can avoid that alltogether. Something like the untested below.

Tested with the below patch and the tests pass with this too.

> 
> Thanks,
> 
>         tglx
> ---
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -42,9 +42,10 @@
>  #include <linux/timer.h>
>  #include <linux/freezer.h>
>  #include <linux/compat.h>
> -
>  #include <linux/uaccess.h>
>  
> +#include <asm/irq_regs.h>
> +
>  #include <trace/events/timer.h>
>  
>  #include "tick-internal.h"
> @@ -2062,11 +2063,16 @@ void __hrtimer_rearm_deferred(void)
>  static __always_inline void
>  hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
>  {
> -	/* hrtimer_interrupt() just re-evaluated the first expiring timer */
> -	cpu_base->deferred_needs_update = false;
> -	/* Cache the expiry time */
> -	cpu_base->deferred_expires_next = expires_next;
> -	set_thread_flag(TIF_HRTIMER_REARM);
> +	/* Lies, damned lies and virt */
> +	if (likely(!regs_irqs_disabled(get_irq_regs()))) {
> +		/* hrtimer_interrupt() just re-evaluated the first expiring timer */
> +		cpu_base->deferred_needs_update = false;
> +		/* Cache the expiry time */
> +		cpu_base->deferred_expires_next = expires_next;
> +		set_thread_flag(TIF_HRTIMER_REARM);
> +	} else {
> +		hrtimer_rearm(cpu_base, expires_next, false);
> +	}
>  }
>  #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
>  static __always_inline void
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:55                     ` Sean Christopherson
@ 2026-04-21 20:06                       ` Peter Zijlstra
  2026-04-21 20:46                         ` Peter Zijlstra
  2026-04-21 20:57                         ` Sean Christopherson
  2026-04-21 20:39                       ` Paolo Bonzini
  1 sibling, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 20:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote:

> Pulling in an earlier idea:
> 
>  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
>  : handle_external_interrupt_irqoff(), which in turn requires to export
>  : __hrtimer_rearm_deferred().
> 

> Actually, even better would be to bury the FRED vs. not-FRED details in entry
> code.  E.g. on the KVM invocation side, we could get to something like the below,
> and I'm pretty sure _reduce_ the number of for-KVM exports in the process.

Something like so then?

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 72cae8e0ce85..83b4762d6ecb 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -13,7 +13,7 @@ CFLAGS_REMOVE_syscall_64.o	= $(CC_FLAGS_FTRACE)
 CFLAGS_syscall_32.o		+= -fno-stack-protector
 CFLAGS_syscall_64.o		+= -fno-stack-protector
 
-obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o
+obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o common.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..4b0171abb083
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/kvm_types.h>
+#include <linux/hrtimer_rearm.h>
+#include <asm/entry-common.h>
+#include <asm/fred.h>
+#include <asm/desc.h>
+
+noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
+{
+#ifdef CONFIG_X86_64
+	fred_entry_from_kvm(event_type, vector);
+#else
+	idt_entry_from_kvm(vector);
+#endif
+	if (event_type == EVENT_TYPE_EXTINT) {
+		instrumentation_begin();
+		hrtimer_rearm_deferred();
+		instrumentation_end();
+	}
+}
+EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 92c0b4a94e0a..96c3e9322297 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1224,3 +1224,36 @@ SYM_CODE_START(rewind_stack_and_make_dead)
 1:	jmp 1b
 SYM_CODE_END(rewind_stack_and_make_dead)
 .popsection
+
+.pushsection .noinstr.text, "ax"
+.macro IDT_DO_EVENT_IRQOFF call_insn call_target
+	/*
+	 * Unconditionally create a stack frame, getting the correct RSP on the
+	 * stack (for x86-64) would take two instructions anyways, and RBP can
+	 * be used to restore RSP to make objtool happy (see below).
+	 */
+	push %ebp
+	mov %esp, %ebp
+
+	pushf
+	push $__KERNEL_CS
+	\call_insn \call_target
+
+	/*
+	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
+	 * the correct value.  objtool doesn't know the callee will IRET and,
+	 * without the explicit restore, thinks the stack is getting walloped.
+	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
+	 */
+	leave
+	RET
+.endm
+
+SYM_FUNC_START(idt_do_interrupt_irqoff)
+	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+SYM_FUNC_END(idt_do_interrupt_irqoff)
+
+SYM_FUNC_START(idt_do_nmi_irqoff)
+	IDT_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
+SYM_FUNC_END(idt_do_nmi_irqoff)
+.popsection
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..0d2768ab836c 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -147,5 +147,4 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	RET
 
 SYM_FUNC_END(asm_fred_entry_from_kvm)
-EXPORT_SYMBOL_FOR_KVM(asm_fred_entry_from_kvm);
 #endif
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ec95fe44fa3a..cb24990f38fd 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -437,6 +437,7 @@ extern void idt_setup_early_traps(void);
 extern void idt_setup_traps(void);
 extern void idt_setup_apic_and_irq_gates(void);
 extern bool idt_is_f00f_address(unsigned long address);
+extern void idt_entry_from_kvm(unsigned int vector);
 
 #ifdef CONFIG_X86_64
 extern void idt_setup_early_pf(void);
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..eca24b5e07f4 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,6 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector);
+
 #endif
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..d95d8d196cd4 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -266,6 +266,14 @@ void __init idt_setup_early_pf(void)
 	idt_setup_from_table(idt_table, early_pf_idts,
 			     ARRAY_SIZE(early_pf_idts), true);
 }
+#else
+void idt_entry_from_kvm(unsigned int vector)
+{
+	if (vector == NMI_VECTOR)
+		idt_do_nmi_irqoff();
+	else
+		idt_do_interrupt_irqoff(gate_offset(idt_table + vector));
+}
 #endif
 
 static void __init idt_map_in_cea(void)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..ff1f254a0ef4 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -31,38 +31,6 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
-	/*
-	 * Unconditionally create a stack frame, getting the correct RSP on the
-	 * stack (for x86-64) would take two instructions anyways, and RBP can
-	 * be used to restore RSP to make objtool happy (see below).
-	 */
-	push %_ASM_BP
-	mov %_ASM_SP, %_ASM_BP
-
-#ifdef CONFIG_X86_64
-	/*
-	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
-	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
-	 */
-	and  $-16, %rsp
-	push $__KERNEL_DS
-	push %rbp
-#endif
-	pushf
-	push $__KERNEL_CS
-	\call_insn \call_target
-
-	/*
-	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
-	 * the correct value.  objtool doesn't know the callee will IRET and,
-	 * without the explicit restore, thinks the stack is getting walloped.
-	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
-	 */
-	leave
-	RET
-.endm
-
 .section .noinstr.text, "ax"
 
 /**
@@ -320,10 +288,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 
 SYM_FUNC_END(__vmx_vcpu_run)
 
-SYM_FUNC_START(vmx_do_nmi_irqoff)
-	VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
-SYM_FUNC_END(vmx_do_nmi_irqoff)
-
 #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /**
@@ -375,13 +339,3 @@ SYM_FUNC_START(vmread_error_trampoline)
 	RET
 SYM_FUNC_END(vmread_error_trampoline)
 #endif
-
-.section .text, "ax"
-
-#ifndef CONFIG_X86_FRED
-
-SYM_FUNC_START(vmx_do_interrupt_irqoff)
-	VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
-SYM_FUNC_END(vmx_do_interrupt_irqoff)
-
-#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..f6f5c124ed3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
 		return;
 
-	/*
-	 * Invoke the kernel's IRQ handler for the vector.  Use the FRED path
-	 * when it's available even if FRED isn't fully enabled, e.g. even if
-	 * FRED isn't supported in hardware, in order to avoid the indirect
-	 * CALL in the non-FRED path.
-	 */
+	/* For the IRQ to the core kernel for processing. */
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-	if (IS_ENABLED(CONFIG_X86_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
-	else
-		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
+	x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
 	kvm_after_interrupt(vcpu);
 
 	vcpu->arch.at_instruction_boundary = true;
@@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
 		return;
 
 	kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-	if (cpu_feature_enabled(X86_FEATURE_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
-	else
-		vmx_do_nmi_irqoff();
+	x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
 	kvm_after_interrupt(vcpu);
 }
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:55                     ` Sean Christopherson
  2026-04-21 20:06                       ` Peter Zijlstra
@ 2026-04-21 20:39                       ` Paolo Bonzini
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2026-04-21 20:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu,
	the arch/x86 maintainers, Paolo Bonzini

Il mar 21 apr 2026, 19:55 Sean Christopherson <seanjc@google.com> ha scritto:
>
> > > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.
>
> Hell no.
>
> > I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
> > there any performance benefit or is it just used because it's there?
>
> There are performance benefits, and it preserves ordering [...]
> NMIs are unconditionally "acked" on VM-Exit.

Not that I disagree but...

> Even if performance is "fine", changing decades of fundamental KVM behavior is
> terrifying.

... it's not decades, ack on VM exit is actually relatively recent (10
years out 20 :)). The reason why it was introduced is another killer
for the idea, though. Posted interrupts require it, for some reason
only known to Intel.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:06                       ` Peter Zijlstra
@ 2026-04-21 20:46                         ` Peter Zijlstra
  2026-04-21 20:57                         ` Sean Christopherson
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2026-04-21 20:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026 at 10:06:20PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote:
> 
> > Pulling in an earlier idea:
> > 
> >  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
> >  : handle_external_interrupt_irqoff(), which in turn requires to export
> >  : __hrtimer_rearm_deferred().
> > 
> 
> > Actually, even better would be to bury the FRED vs. not-FRED details in entry
> > code.  E.g. on the KVM invocation side, we could get to something like the below,
> > and I'm pretty sure _reduce_ the number of for-KVM exports in the process.
> 
> Something like so then?

And this one seems to build on ARCH=i386 too.

---
diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 72cae8e0ce85..83b4762d6ecb 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -13,7 +13,7 @@ CFLAGS_REMOVE_syscall_64.o	= $(CC_FLAGS_FTRACE)
 CFLAGS_syscall_32.o		+= -fno-stack-protector
 CFLAGS_syscall_64.o		+= -fno-stack-protector
 
-obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o
+obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o common.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..8de94a590b26
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/entry-common.h>
+#include <linux/kvm_types.h>
+#include <linux/hrtimer_rearm.h>
+#include <asm/fred.h>
+#include <asm/desc.h>
+
+noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
+{
+#ifdef CONFIG_X86_64
+	fred_entry_from_kvm(event_type, vector);
+#else
+	idt_entry_from_kvm(vector);
+#endif
+	if (event_type == EVENT_TYPE_EXTINT) {
+		instrumentation_begin();
+		hrtimer_rearm_deferred();
+		instrumentation_end();
+	}
+}
+EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 92c0b4a94e0a..9324e97d14cf 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1224,3 +1224,36 @@ SYM_CODE_START(rewind_stack_and_make_dead)
 1:	jmp 1b
 SYM_CODE_END(rewind_stack_and_make_dead)
 .popsection
+
+.pushsection .noinstr.text, "ax"
+.macro IDT_DO_EVENT_IRQOFF call_insn call_target
+	/*
+	 * Unconditionally create a stack frame, getting the correct RSP on the
+	 * stack (for x86-64) would take two instructions anyways, and RBP can
+	 * be used to restore RSP to make objtool happy (see below).
+	 */
+	push %ebp
+	mov %esp, %ebp
+
+	pushf
+	push $__KERNEL_CS
+	\call_insn \call_target
+
+	/*
+	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
+	 * the correct value.  objtool doesn't know the callee will IRET and,
+	 * without the explicit restore, thinks the stack is getting walloped.
+	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
+	 */
+	leave
+	RET
+.endm
+
+SYM_FUNC_START(idt_do_interrupt_irqoff)
+	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+SYM_FUNC_END(idt_do_interrupt_irqoff)
+
+SYM_FUNC_START(idt_do_nmi_irqoff)
+	IDT_DO_EVENT_IRQOFF call asm_exc_nmi
+SYM_FUNC_END(idt_do_nmi_irqoff)
+.popsection
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..0d2768ab836c 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -147,5 +147,4 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	RET
 
 SYM_FUNC_END(asm_fred_entry_from_kvm)
-EXPORT_SYMBOL_FOR_KVM(asm_fred_entry_from_kvm);
 #endif
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ec95fe44fa3a..f44d6a606b4c 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -438,6 +438,10 @@ extern void idt_setup_traps(void);
 extern void idt_setup_apic_and_irq_gates(void);
 extern bool idt_is_f00f_address(unsigned long address);
 
+extern void idt_do_interrupt_irqoff(unsigned int vector);
+extern void idt_do_nmi_irqoff(void);
+extern void idt_entry_from_kvm(unsigned int vector);
+
 #ifdef CONFIG_X86_64
 extern void idt_setup_early_pf(void);
 #else
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..eca24b5e07f4 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,6 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector);
+
 #endif
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 42bf6a58ec36..db4072875f5f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -633,17 +633,6 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC,	xenpv_exc_machine_check);
 #endif
 
 /* NMI */
-
-#if IS_ENABLED(CONFIG_KVM_INTEL)
-/*
- * Special entry point for VMX which invokes this on the kernel stack, even for
- * 64-bit, i.e. without using an IST.  asm_exc_nmi() requires an IST to work
- * correctly vs. the NMI 'executing' marker.  Used for 32-bit kernels as well
- * to avoid more ifdeffery.
- */
-DECLARE_IDTENTRY(X86_TRAP_NMI,		exc_nmi_kvm_vmx);
-#endif
-
 DECLARE_IDTENTRY_NMI(X86_TRAP_NMI,	exc_nmi);
 #ifdef CONFIG_XEN_PV
 DECLARE_IDTENTRY_RAW(X86_TRAP_NMI,	xenpv_exc_nmi);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..d95d8d196cd4 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -266,6 +266,14 @@ void __init idt_setup_early_pf(void)
 	idt_setup_from_table(idt_table, early_pf_idts,
 			     ARRAY_SIZE(early_pf_idts), true);
 }
+#else
+void idt_entry_from_kvm(unsigned int vector)
+{
+	if (vector == NMI_VECTOR)
+		idt_do_nmi_irqoff();
+	else
+		idt_do_interrupt_irqoff(gate_offset(idt_table + vector));
+}
 #endif
 
 static void __init idt_map_in_cea(void)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..06fe225fb0a2 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -609,14 +609,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 		goto nmi_restart;
 }
 
-#if IS_ENABLED(CONFIG_KVM_INTEL)
-DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx)
-{
-	exc_nmi(regs);
-}
-EXPORT_SYMBOL_FOR_KVM(asm_exc_nmi_kvm_vmx);
-#endif
-
 #ifdef CONFIG_NMI_CHECK_CPU
 
 static char *nmi_check_stall_msg[] = {
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..ff1f254a0ef4 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -31,38 +31,6 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
-	/*
-	 * Unconditionally create a stack frame, getting the correct RSP on the
-	 * stack (for x86-64) would take two instructions anyways, and RBP can
-	 * be used to restore RSP to make objtool happy (see below).
-	 */
-	push %_ASM_BP
-	mov %_ASM_SP, %_ASM_BP
-
-#ifdef CONFIG_X86_64
-	/*
-	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
-	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
-	 */
-	and  $-16, %rsp
-	push $__KERNEL_DS
-	push %rbp
-#endif
-	pushf
-	push $__KERNEL_CS
-	\call_insn \call_target
-
-	/*
-	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
-	 * the correct value.  objtool doesn't know the callee will IRET and,
-	 * without the explicit restore, thinks the stack is getting walloped.
-	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
-	 */
-	leave
-	RET
-.endm
-
 .section .noinstr.text, "ax"
 
 /**
@@ -320,10 +288,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 
 SYM_FUNC_END(__vmx_vcpu_run)
 
-SYM_FUNC_START(vmx_do_nmi_irqoff)
-	VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
-SYM_FUNC_END(vmx_do_nmi_irqoff)
-
 #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /**
@@ -375,13 +339,3 @@ SYM_FUNC_START(vmread_error_trampoline)
 	RET
 SYM_FUNC_END(vmread_error_trampoline)
 #endif
-
-.section .text, "ax"
-
-#ifndef CONFIG_X86_FRED
-
-SYM_FUNC_START(vmx_do_interrupt_irqoff)
-	VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
-SYM_FUNC_END(vmx_do_interrupt_irqoff)
-
-#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..f6f5c124ed3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
 		return;
 
-	/*
-	 * Invoke the kernel's IRQ handler for the vector.  Use the FRED path
-	 * when it's available even if FRED isn't fully enabled, e.g. even if
-	 * FRED isn't supported in hardware, in order to avoid the indirect
-	 * CALL in the non-FRED path.
-	 */
+	/* For the IRQ to the core kernel for processing. */
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-	if (IS_ENABLED(CONFIG_X86_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
-	else
-		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
+	x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
 	kvm_after_interrupt(vcpu);
 
 	vcpu->arch.at_instruction_boundary = true;
@@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
 		return;
 
 	kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-	if (cpu_feature_enabled(X86_FEATURE_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
-	else
-		vmx_do_nmi_irqoff();
+	x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
 	kvm_after_interrupt(vcpu);
 }
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:06                       ` Peter Zijlstra
  2026-04-21 20:46                         ` Peter Zijlstra
@ 2026-04-21 20:57                         ` Sean Christopherson
  1 sibling, 0 replies; 29+ messages in thread
From: Sean Christopherson @ 2026-04-21 20:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote:
> 
> > Pulling in an earlier idea:
> > 
> >  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
> >  : handle_external_interrupt_irqoff(), which in turn requires to export
> >  : __hrtimer_rearm_deferred().
> > 
> 
> > Actually, even better would be to bury the FRED vs. not-FRED details in entry
> > code.  E.g. on the KVM invocation side, we could get to something like the below,
> > and I'm pretty sure _reduce_ the number of for-KVM exports in the process.
> 
> Something like so then?

Yep!

> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> new file mode 100644
> index 000000000000..4b0171abb083
> --- /dev/null
> +++ b/arch/x86/entry/common.c
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/kvm_types.h>
> +#include <linux/hrtimer_rearm.h>

For CONFIG_X86_FRED=n, which is possible on x86-64 if CONFIG_KVM_INTEL=n, this

#include <linux/sched/task_stack.h>

is needed so that task_pt_regs() can find task_stack_page() (and including
task_stack.h in processor.h would create cyclical includes).

> +#include <asm/entry-common.h>
> +#include <asm/fred.h>
> +#include <asm/desc.h>
> +

Related to CONFIG_X86_FRED=n, I vote to wrap this API with #if IS_ENABLED(CONFIG_KVM_INTEL)
and then delete the fred_entry_from_kvm() stub so that a goof results in a build
failure.  That'd also be a good place for a comment to explain some of the usage.

> +noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
> +{
> +#ifdef CONFIG_X86_64
> +	fred_entry_from_kvm(event_type, vector);
> +#else
> +	idt_entry_from_kvm(vector);
> +#endif

...

> +SYM_FUNC_START(idt_do_interrupt_irqoff)
> +	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
> +SYM_FUNC_END(idt_do_interrupt_irqoff)
> +
> +SYM_FUNC_START(idt_do_nmi_irqoff)
> +	IDT_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
> +SYM_FUNC_END(idt_do_nmi_irqoff)

These need to be declared, and the KVM declarations can be deleted.

>  static void __init idt_map_in_cea(void)
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index 8a481dae9cae..ff1f254a0ef4 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -31,38 +31,6 @@
>  #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
>  #endif
>  
> -.macro VMX_DO_EVENT_IRQOFF call_insn call_target
> -	/*
> -	 * Unconditionally create a stack frame, getting the correct RSP on the
> -	 * stack (for x86-64) would take two instructions anyways, and RBP can
> -	 * be used to restore RSP to make objtool happy (see below).
> -	 */
> -	push %_ASM_BP
> -	mov %_ASM_SP, %_ASM_BP
> -
> -#ifdef CONFIG_X86_64
> -	/*
> -	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
> -	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
> -	 */
> -	and  $-16, %rsp
> -	push $__KERNEL_DS
> -	push %rbp
> -#endif

For anyone else having an -ENOCOFFEE moment, this has been dead code since commit
28d11e4548b7 ("x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y").

This as delta? (I had typed this all up before Peter posted a new verison, so
dammit I'm sending it!)


diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 4b0171abb083..b039276bede9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -2,10 +2,20 @@
 
 #include <linux/kvm_types.h>
 #include <linux/hrtimer_rearm.h>
+#include <linux/sched/task_stack.h>
 #include <asm/entry-common.h>
 #include <asm/fred.h>
 #include <asm/desc.h>
 
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+/*
+ * On VMX, NMIs and IRQs (as configured by KVM) are acknowledge by hardware as
+ * part of the VM-Exit, i.e. the event itself is consumed as part the VM-Exit.
+ * x86_entry_from_kvm() is invoked by KVM to effectively forward NMIs and IRQs
+ * to the kernel for servicing.  On SVM, a.k.a. AMD, the NMI/IRQ VM-Exit is
+ * purely a signal that an NMI/IRQ is pending, i.e. the event that triggered
+ * the VM-Exit is held pending until it's unblocked in the host.
+ */
 noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
 {
 #ifdef CONFIG_X86_64
@@ -20,3 +30,4 @@ noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
        }
 }
 EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
+#endif
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index eca24b5e07f4..2421b1edf77e 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -98,5 +98,7 @@ static __always_inline void arch_exit_to_user_mode(void)
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
 extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector);
+extern void idt_do_interrupt_irqoff(unsigned long entry);
+extern void idt_do_nmi_irqoff(void);
 
 #endif
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 2bb65677c079..18a2f811c358 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -110,7 +110,6 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { ret
 static inline void cpu_init_fred_exceptions(void) { }
 static inline void cpu_init_fred_rsps(void) { }
 static inline void fred_complete_exception_setup(void) { }
-static inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { }
 static inline void fred_sync_rsp0(unsigned long rsp0) { }
 static inline void fred_update_rsp0(void) { }
 #endif /* CONFIG_X86_FRED */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..52a3afb1b79e 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -614,7 +614,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx)
 {
        exc_nmi(regs);
 }
-EXPORT_SYMBOL_FOR_KVM(asm_exc_nmi_kvm_vmx);
 #endif
 
 #ifdef CONFIG_NMI_CHECK_CPU
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6f5c124ed3b..753f0dbb9cf8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7083,9 +7083,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
        vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-void vmx_do_interrupt_irqoff(unsigned long entry);
-void vmx_do_nmi_irqoff(void);
-
 static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 {
        /*



^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-04-21 20:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22   ` Thomas Gleixner
2026-04-20 20:57   ` Verma, Vishal L
2026-04-20 22:19     ` Thomas Gleixner
2026-04-20 22:24       ` Verma, Vishal L
2026-04-21  6:29         ` Thomas Gleixner
2026-04-21  4:51   ` Binbin Wu
2026-04-21  7:39     ` Thomas Gleixner
2026-04-21 11:18       ` Peter Zijlstra
2026-04-21 11:32         ` Peter Zijlstra
2026-04-21 11:34           ` Peter Zijlstra
2026-04-21 11:49             ` Peter Zijlstra
2026-04-21 12:05               ` Peter Zijlstra
2026-04-21 13:19                 ` Peter Zijlstra
2026-04-21 13:29                   ` Peter Zijlstra
2026-04-21 16:36                     ` Thomas Gleixner
2026-04-21 18:11                     ` Verma, Vishal L
2026-04-21 17:11               ` Thomas Gleixner
2026-04-21 17:20                 ` Jim Mattson
2026-04-21 18:29                   ` Thomas Gleixner
2026-04-21 18:55                     ` Sean Christopherson
2026-04-21 20:06                       ` Peter Zijlstra
2026-04-21 20:46                         ` Peter Zijlstra
2026-04-21 20:57                         ` Sean Christopherson
2026-04-21 20:39                       ` Paolo Bonzini
2026-04-21 19:18                 ` Verma, Vishal L
2026-04-21 16:30           ` Thomas Gleixner
2026-04-21 16:11       ` Verma, Vishal L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox