CPU Lockups in KVM with deferred hrtimer rearming

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* CPU Lockups in KVM with deferred hrtimer rearming
@ 2026-04-16 20:50 Verma, Vishal L
  2026-04-20 15:00 ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Verma, Vishal L @ 2026-04-16 20:50 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

Hi Peter,

We noticed a KVM Unit test 'x2apic' - (APIC LVT timer one shot)
failing, and also some TDX specific tests running into multiple CPUs in
hard lockups on a 192-CPU Emerald Rapids system, and we traced it to
the htrimers deferred rearming merge.

Making CONFIG_HRTIMER_REARM_DEFERRED default to n in Kconfig made both
pass.

This is the hard lockup splat:

   watchdog: CPU98: Watchdog detected hard LOCKUP on cpu 98
   Modules linked in: openvswitch nsh tls ipt_REJECT iptable_mangle iptable_nat iptable_filter ip_tables bridge stp llc kvm_intel kvm irqbypass sunrpc
   irq event stamp: 34998
   hardirqs last  enabled at (34997): [<ffffffffc090ce6d>] tdx_vcpu_run+0x5d/0x350 [kvm_intel]
   hardirqs last disabled at (34998): [<ffffffffb9add6df>] exc_nmi+0xaf/0x1a0
   softirqs last  enabled at (34404): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
   softirqs last disabled at (34395): [<ffffffffb83fdd93>] __irq_exit_rcu+0xe3/0x160
   CPU: 98 UID: 0 PID: 54785 Comm: qemu-system-x86 Not tainted 7.0.0-g10324ed6a556 #1 PREEMPT(full) 
   Hardware name: HPE ProLiant DL380 Gen11/ProLiant DL380 Gen11, BIOS 2.48 03/11/2025
   RIP: 0010:vmx_do_nmi_irqoff+0x13/0x20 [kvm_intel]
   Code: ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 e8 3d db 6e f7 <c9> c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90
   RSP: 0018:ff8d3a069bdf3af0 EFLAGS: 00000086
   RAX: ff3cc96963d68000 RBX: ff3cc96963d68000 RCX: 4000000200000000
   RDX: 0000000080000200 RSI: ff3cc96963d699d0 RDI: ff3cc96963d68000
   RBP: ff8d3a069bdf3af0 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
   R13: ff3cc968d03d0000 R14: ff3cc968d03d0000 R15: 0000000000000000
   FS:  00007f26ab7fe6c0(0000) GS:ff3cc98782d76000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000000 CR3: 00000001544af004 CR4: 0000000000f73ef0
   PKRU: 00000000
   Call Trace:
    <TASK>
    vmx_handle_nmi+0xdf/0x140 [kvm_intel]
    tdx_vcpu_enter_exit+0xd5/0x300 [kvm_intel]
    tdx_vcpu_run+0x5d/0x350 [kvm_intel]
    vcpu_run+0xd4a/0x1800 [kvm]
    ? __local_bh_enable_ip+0x7b/0xf0
    ? kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
    ? kvm_arch_vcpu_ioctl_run+0xb9/0x5f0 [kvm]
    kvm_arch_vcpu_ioctl_run+0x38b/0x5f0 [kvm]
    kvm_vcpu_ioctl+0x2ef/0xb00 [kvm]
    ? __fget_files+0x2b/0x190
    ? find_held_lock+0x2b/0x80
    __x64_sys_ioctl+0x97/0xe0
    do_syscall_64+0xf4/0x1540
    ? __x64_sys_ioctl+0xb1/0xe0
    ? trace_hardirqs_on_prepare+0xd2/0xf0
    ? do_syscall_64+0x225/0x1540
    ? trace_hardirqs_on+0x18/0x100
    ? __local_bh_enable_ip+0x7b/0xf0
    ? arch_do_signal_or_restart+0x155/0x250
    ? trace_hardirqs_off+0x4e/0xf0
    ? exit_to_user_mode_loop+0x150/0x4e0
    ? trace_hardirqs_on_prepare+0xd2/0xf0
    ? do_syscall_64+0x225/0x1540
    ? do_user_addr_fault+0x36c/0x6b0
    ? lockdep_hardirqs_on_prepare+0xdb/0x190
    ? trace_hardirqs_on+0x18/0x100
    ? do_syscall_64+0xab/0x1540
    ? exc_page_fault+0x12c/0x2b0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e
   RIP: 0033:0x7f45f7ae00ed
   Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
   RSP: 002b:00007f26ab7f3e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f26ab7fe6c0 RCX: 00007f45f7ae00ed
   RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000099
   RBP: 00007f26ab7f3ec0 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 00007f26ab7fe6c0
   R13: 00007ffdc7adecd0 R14: 00007f26ab7fecdc R15: 00007ffdc7adedd7
    </TASK>

I tried out AI assisted and patch (below) which does happen to solve
it, but I'm not familiar in this area, and not sure if this is the
right fix.

---

diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..c4856c252412 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -4,6 +4,7 @@
 
 #include <linux/static_call_types.h>
 #include <linux/resume_user_mode.h>
+#include <linux/hrtimer_rearm.h>
 #include <linux/syscalls.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
@@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
 static inline void xfer_to_guest_mode_prepare(void)
 {
        lockdep_assert_irqs_disabled();
+       hrtimer_rearm_deferred();
        tick_nohz_user_enter_prepare();
 }
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5bd6efe598f0..f3bd084d9a72 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2058,6 +2058,7 @@ void __hrtimer_rearm_deferred(void)
        }
        hrtimer_rearm(cpu_base, expires_next, true);
 }
+EXPORT_SYMBOL_GPL(__hrtimer_rearm_deferred);
 
 static __always_inline void
 hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
@ 2026-04-20 15:00 ` Thomas Gleixner
  2026-04-20 15:22   ` Thomas Gleixner
                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-20 15:00 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
> I tried out AI assisted and patch (below) which does happen to solve
> it, but I'm not familiar in this area, and not sure if this is the
> right fix.
>
> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
> index bfa767702d9a..c4856c252412 100644
> --- a/include/linux/entry-virt.h
> +++ b/include/linux/entry-virt.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/static_call_types.h>
>  #include <linux/resume_user_mode.h>
> +#include <linux/hrtimer_rearm.h>
>  #include <linux/syscalls.h>
>  #include <linux/seccomp.h>
>  #include <linux/sched.h>
> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
>  static inline void xfer_to_guest_mode_prepare(void)
>  {
>         lockdep_assert_irqs_disabled();
> +       hrtimer_rearm_deferred();
>         tick_nohz_user_enter_prepare();


This code should never be reached with a rearm pending. Something else
went wrong earlier. So while the patch "works" it papers over the
underlying problem.

Can you please do the following:

    1) Apply the patch below

    2) Enable function tracing and the hrtimer* trace events

    3) Enable tracing if it has been disabled already

       echo 1 >/sys/kernel/tracing/tracing_on

    4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
       become 0, which means the problem triggered.

    5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
       somewhere to download from or send it to me compressed offlist.

Thanks,

        tglx
---

diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
index bfa767702d9a..ab73963a7496 100644
--- a/include/linux/entry-virt.h
+++ b/include/linux/entry-virt.h
@@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void);
 static inline void xfer_to_guest_mode_prepare(void)
 {
 	lockdep_assert_irqs_disabled();
+	if (test_thread_flag(TIF_HRTIMER_REARM)) {
+		tracing_off();
+		hrtimer_rearm_deferred();
+	}
 	tick_nohz_user_enter_prepare();
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 15:00 ` Thomas Gleixner
@ 2026-04-20 15:22   ` Thomas Gleixner
  2026-04-20 20:57   ` Verma, Vishal L
  2026-04-21  4:51   ` Binbin Wu
  2 siblings, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-20 15:22 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, Apr 20 2026 at 17:00, Thomas Gleixner wrote:
> On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.

Peter just noticed that this should be fixed with

   1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 15:00 ` Thomas Gleixner
  2026-04-20 15:22   ` Thomas Gleixner
@ 2026-04-20 20:57   ` Verma, Vishal L
  2026-04-20 22:19     ` Thomas Gleixner
  2026-04-21  4:51   ` Binbin Wu
  2 siblings, 1 reply; 39+ messages in thread
From: Verma, Vishal L @ 2026-04-20 20:57 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
> 
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.
> 
> Can you please do the following:
> 
>     1) Apply the patch below
> 
>     2) Enable function tracing and the hrtimer* trace events
> 
>     3) Enable tracing if it has been disabled already
> 
>        echo 1 >/sys/kernel/tracing/tracing_on
> 
>     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>        become 0, which means the problem triggered.
> 
>     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>        somewhere to download from or send it to me compressed offlist.

Hi Thomas,

I've uploaded the trace here (~75MB compressed):
https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB

As for:

1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")

I already had that commit in the branch that was tested and it didn't
fix it.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 20:57   ` Verma, Vishal L
@ 2026-04-20 22:19     ` Thomas Gleixner
  2026-04-20 22:24       ` Verma, Vishal L
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-20 22:19 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote:
> On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
>> 
>> This code should never be reached with a rearm pending. Something else
>> went wrong earlier. So while the patch "works" it papers over the
>> underlying problem.
>> 
>> Can you please do the following:
>> 
>>     1) Apply the patch below
>> 
>>     2) Enable function tracing and the hrtimer* trace events
>> 
>>     3) Enable tracing if it has been disabled already
>> 
>>        echo 1 >/sys/kernel/tracing/tracing_on
>> 
>>     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>>        become 0, which means the problem triggered.
>> 
>>     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>>        somewhere to download from or send it to me compressed offlist.
>
> Hi Thomas,
>
> I've uploaded the trace here (~75MB compressed):
> https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
>
> As for:
>
> 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
>
> I already had that commit in the branch that was tested and it didn't
> fix it.

Thanks for the update. Can you try to provide the information I asked
for above?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 22:19     ` Thomas Gleixner
@ 2026-04-20 22:24       ` Verma, Vishal L
  2026-04-21  6:29         ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Verma, Vishal L @ 2026-04-20 22:24 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 20:57, Verma, Vishal L wrote:
> > On Mon, 2026-04-20 at 17:00 +0200, Thomas Gleixner wrote:
> > > 
> > > This code should never be reached with a rearm pending. Something else
> > > went wrong earlier. So while the patch "works" it papers over the
> > > underlying problem.
> > > 
> > > Can you please do the following:
> > > 
> > >     1) Apply the patch below
> > > 
> > >     2) Enable function tracing and the hrtimer* trace events
> > > 
> > >     3) Enable tracing if it has been disabled already
> > > 
> > >        echo 1 >/sys/kernel/tracing/tracing_on
> > > 
> > >     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
> > >        become 0, which means the problem triggered.
> > > 
> > >     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
> > >        somewhere to download from or send it to me compressed offlist.
> > 
> > Hi Thomas,
> > 
> > I've uploaded the trace here (~75MB compressed):
> > https://drive.proton.me/urls/B9PY61XQ0C#07XwTVhE46eB
> > 
> > As for:
> > 
> > 1f5ffc672165 ("Fix mismerge of the arm64 / timer-core interrupt handling changes")
> > 
> > I already had that commit in the branch that was tested and it didn't
> > fix it.
> 
> Thanks for the update. Can you try to provide the information I asked
> for above?
> 
Ah sorry - I should've said that with your patch applied, tracing_on
did become 0, so the problem was triggered.

The trace from that is in the URL above.

This is how I collected it:

   tracefs=/sys/kernel/tracing
   echo 4096 > "$tracefs"/buffer_size_kb
   echo function > "$tracefs"/current_tracer
   echo 1 > "$tracefs"/events/hrtimer/enable
   echo 1 > "$tracefs"/tracing_on
   
   <run the test>
   
   tracing_on="$(cat "$tracefs"/tracing_on)"
   if [ "$tracing_on" -eq 0 ]; then
   	echo "Debug patch triggered, collecting trace"
   	cat "$tracefs"/trace | gzip > /tmp/hrtimer_rearm_trace.gz
   else
   	echo "Debug patch did not trigger (tracing_on still 1)"
   fi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 15:00 ` Thomas Gleixner
  2026-04-20 15:22   ` Thomas Gleixner
  2026-04-20 20:57   ` Verma, Vishal L
@ 2026-04-21  4:51   ` Binbin Wu
  2026-04-21  7:39     ` Thomas Gleixner
  2 siblings, 1 reply; 39+ messages in thread
From: Binbin Wu @ 2026-04-21  4:51 UTC (permalink / raw)
  To: Thomas Gleixner, Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org



On 4/20/2026 11:00 PM, Thomas Gleixner wrote:
> On Thu, Apr 16 2026 at 20:50, Vishal L. Verma wrote:
>> I tried out AI assisted and patch (below) which does happen to solve
>> it, but I'm not familiar in this area, and not sure if this is the
>> right fix.
>>
>> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
>> index bfa767702d9a..c4856c252412 100644
>> --- a/include/linux/entry-virt.h
>> +++ b/include/linux/entry-virt.h
>> @@ -4,6 +4,7 @@
>>  
>>  #include <linux/static_call_types.h>
>>  #include <linux/resume_user_mode.h>
>> +#include <linux/hrtimer_rearm.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/seccomp.h>
>>  #include <linux/sched.h>
>> @@ -58,6 +59,7 @@ int xfer_to_guest_mode_handle_work(void);
>>  static inline void xfer_to_guest_mode_prepare(void)
>>  {
>>         lockdep_assert_irqs_disabled();
>> +       hrtimer_rearm_deferred();
>>         tick_nohz_user_enter_prepare();
> 
> 
> This code should never be reached with a rearm pending. Something else
> went wrong earlier. So while the patch "works" it papers over the
> underlying problem.

IIUC, the problem might be:

HRTimer -> VMExit:
[IRQ is disabled]
    kvm_x86_call(handle_exit_irqoff)(vcpu)
        vmx_handle_exit_irqoff
            handle_external_interrupt_irqoff
                sysvec_apic_timer_interrupt
                    irqentry_enter
                    ...
                    irqentry_exit
                        irqentry_exit_to_kernel_mode
                            if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer 
                                hrtimer_rearm_deferred()         rearm is skipped!


This issue is triggered on TDX since TDX can't use preemption timer while normal
VMX VM uses preemption timer by default.


> 
> Can you please do the following:
> 
>     1) Apply the patch below
> 
>     2) Enable function tracing and the hrtimer* trace events
> 
>     3) Enable tracing if it has been disabled already
> 
>        echo 1 >/sys/kernel/tracing/tracing_on
> 
>     4) Run the tests and wait for /sys/kernel/tracing/tracing_on to
>        become 0, which means the problem triggered.
> 
>     5) Retrieve the trace from /sys/kernel/tracing/trace and provide it
>        somewhere to download from or send it to me compressed offlist.
> 
> Thanks,
> 
>         tglx
> ---
> 
> diff --git a/include/linux/entry-virt.h b/include/linux/entry-virt.h
> index bfa767702d9a..ab73963a7496 100644
> --- a/include/linux/entry-virt.h
> +++ b/include/linux/entry-virt.h
> @@ -58,6 +58,10 @@ int xfer_to_guest_mode_handle_work(void);
>  static inline void xfer_to_guest_mode_prepare(void)
>  {
>  	lockdep_assert_irqs_disabled();
> +	if (test_thread_flag(TIF_HRTIMER_REARM)) {
> +		tracing_off();
> +		hrtimer_rearm_deferred();
> +	}
>  	tick_nohz_user_enter_prepare();
>  }
>  
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-20 22:24       ` Verma, Vishal L
@ 2026-04-21  6:29         ` Thomas Gleixner
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21  6:29 UTC (permalink / raw)
  To: Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Mon, Apr 20 2026 at 22:24, Verma, Vishal L wrote:
> On Tue, 2026-04-21 at 00:19 +0200, Thomas Gleixner wrote:
>> Thanks for the update. Can you try to provide the information I asked
>> for above?
>> 
> Ah sorry - I should've said that with your patch applied, tracing_on
> did become 0, so the problem was triggered.
>
> The trace from that is in the URL above.

I clearly can't read :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21  4:51   ` Binbin Wu
@ 2026-04-21  7:39     ` Thomas Gleixner
  2026-04-21 11:18       ` Peter Zijlstra
  2026-04-21 16:11       ` Verma, Vishal L
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21  7:39 UTC (permalink / raw)
  To: Binbin Wu, Verma, Vishal L, peterz@infradead.org
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Tue, Apr 21 2026 at 12:51, Binbin Wu wrote:
> On 4/20/2026 11:00 PM, Thomas Gleixner wrote:
>>>  static inline void xfer_to_guest_mode_prepare(void)
>>>  {
>>>         lockdep_assert_irqs_disabled();
>>> +       hrtimer_rearm_deferred();
>>>         tick_nohz_user_enter_prepare();
>> 
>> 
>> This code should never be reached with a rearm pending. Something else
>> went wrong earlier. So while the patch "works" it papers over the
>> underlying problem.
>
> IIUC, the problem might be:
>
> HRTimer -> VMExit:
> [IRQ is disabled]
>     kvm_x86_call(handle_exit_irqoff)(vcpu)
>         vmx_handle_exit_irqoff
>             handle_external_interrupt_irqoff
>                 sysvec_apic_timer_interrupt
>                     irqentry_enter
>                     ...
>                     irqentry_exit
>                         irqentry_exit_to_kernel_mode
>                             if (!regs_irqs_disabled(regs)) //<-- This is false, hrtimer 
>                                 hrtimer_rearm_deferred()         rearm is skipped!
>
>
> This issue is triggered on TDX since TDX can't use preemption timer while normal
> VMX VM uses preemption timer by default.

Kinda.

The issue is that vmx_handle_exit_irqoff() always hands in regs with
regs->flags.X86_EFLAGS_IF == 0. That has absolutely nothing to do with
TDX and the preemption timer.

The patch below solves the problem right there in the exit code, which
is unfortunate as there might be a NEED_RESCHED pending. But that can't
be taken into account as KVM enables interrupts _before_ reaching the
exit work point.

Yet another proof that virt creates more problems than it solves.

Thanks,

        tglx
---
Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
From: Thomas Gleixner <tglx@kernel.org>
Date: Tue, 21 Apr 2026 09:00:52 +0200

irqentry_exit_to_kernel_mode_after_preempt() invokes
hrtimer_rearm_deferred() only when the interrupted context had interrupts
enabled. That's a correct decision because the timer interrupt can only be
delivered in interrupt enabled contexts. The interrupt disabled path is
used by exceptions and traps which never touch the hrtimer mechanics.

So much for the theory, but then there is VIRT which ruins everything.

KVM invokes regular interrupts with pt_regs which have interrupts
disabled. That's correct from the KVM point of view, but completely
violates the obviously correct expectations of the interrupt entry/exit
code.

Cure this by adding a hrtimer_rearm_deferred() invocation into the
interrupted context has interrupt disabled path of
irqentry_exit_to_kernel_mode_after_preempt().

That's unfortunate when there is an actual reschedule pending, but it can't
be avoided because KVM invokes a lot of code and also reenables interrupts
_before_ reaching the point where the reschedule condition is handled. That
can delay the rearming significantly, which in turn can cause artificial
latencies.

Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
---
 include/linux/irq-entry-common.h |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
 		instrumentation_end();
 	} else {
 		/*
+		 * This is sadly required due to KVM, which invokes regular
+		 * interrupt handlers with interrupt disabled state in @regs.
+		 */
+		instrumentation_begin();
+		hrtimer_rearm_deferred();
+		instrumentation_end();
+
+		/*
 		 * IRQ flags state is correct already. Just tell RCU if it
 		 * was not watching on entry.
 		 */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21  7:39     ` Thomas Gleixner
@ 2026-04-21 11:18       ` Peter Zijlstra
  2026-04-21 11:32         ` Peter Zijlstra
  2026-04-21 16:11       ` Verma, Vishal L
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:

> ---
> Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> From: Thomas Gleixner <tglx@kernel.org>
> Date: Tue, 21 Apr 2026 09:00:52 +0200
> 
> irqentry_exit_to_kernel_mode_after_preempt() invokes
> hrtimer_rearm_deferred() only when the interrupted context had interrupts
> enabled. That's a correct decision because the timer interrupt can only be
> delivered in interrupt enabled contexts. The interrupt disabled path is
> used by exceptions and traps which never touch the hrtimer mechanics.
> 
> So much for the theory, but then there is VIRT which ruins everything.
> 
> KVM invokes regular interrupts with pt_regs which have interrupts
> disabled. That's correct from the KVM point of view, but completely
> violates the obviously correct expectations of the interrupt entry/exit
> code.

Mooo :-(

That also complicates the comment that goes with
hrtimer_rearm_deferred(). Not sure how to 'fix' that.

> Cure this by adding a hrtimer_rearm_deferred() invocation into the
> interrupted context has interrupt disabled path of
> irqentry_exit_to_kernel_mode_after_preempt().
> 
> That's unfortunate when there is an actual reschedule pending, but it can't
> be avoided because KVM invokes a lot of code and also reenables interrupts
> _before_ reaching the point where the reschedule condition is handled. That
> can delay the rearming significantly, which in turn can cause artificial
> latencies.

Yeah, this is a trainwreck. If they want it better, KVM needs to get
'fixed' to not play silly games like this.

> Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  include/linux/irq-entry-common.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
>  		instrumentation_end();
>  	} else {
>  		/*
> +		 * This is sadly required due to KVM, which invokes regular
> +		 * interrupt handlers with interrupt disabled state in @regs.
> +		 */
> +		instrumentation_begin();
> +		hrtimer_rearm_deferred();
> +		instrumentation_end();
> +
> +		/*
>  		 * IRQ flags state is correct already. Just tell RCU if it
>  		 * was not watching on entry.
>  		 */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:18       ` Peter Zijlstra
@ 2026-04-21 11:32         ` Peter Zijlstra
  2026-04-21 11:34           ` Peter Zijlstra
  2026-04-21 16:30           ` Thomas Gleixner
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> 
> > ---
> > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > From: Thomas Gleixner <tglx@kernel.org>
> > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > 
> > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > enabled. That's a correct decision because the timer interrupt can only be
> > delivered in interrupt enabled contexts. The interrupt disabled path is
> > used by exceptions and traps which never touch the hrtimer mechanics.
> > 
> > So much for the theory, but then there is VIRT which ruins everything.
> > 
> > KVM invokes regular interrupts with pt_regs which have interrupts
> > disabled. That's correct from the KVM point of view, but completely
> > violates the obviously correct expectations of the interrupt entry/exit
> > code.
> 
> Mooo :-(
> 
> That also complicates the comment that goes with
> hrtimer_rearm_deferred(). Not sure how to 'fix' that.
> 
> > Cure this by adding a hrtimer_rearm_deferred() invocation into the
> > interrupted context has interrupt disabled path of
> > irqentry_exit_to_kernel_mode_after_preempt().
> > 
> > That's unfortunate when there is an actual reschedule pending, but it can't
> > be avoided because KVM invokes a lot of code and also reenables interrupts
> > _before_ reaching the point where the reschedule condition is handled. That
> > can delay the rearming significantly, which in turn can cause artificial
> > latencies.
> 
> Yeah, this is a trainwreck. If they want it better, KVM needs to get
> 'fixed' to not play silly games like this.
> 
> > Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> > Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> > Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> > Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> > ---
> >  include/linux/irq-entry-common.h |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > --- a/include/linux/irq-entry-common.h
> > +++ b/include/linux/irq-entry-common.h
> > @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
> >  		instrumentation_end();
> >  	} else {
> >  		/*
> > +		 * This is sadly required due to KVM, which invokes regular
> > +		 * interrupt handlers with interrupt disabled state in @regs.
> > +		 */
> > +		instrumentation_begin();
> > +		hrtimer_rearm_deferred();
> > +		instrumentation_end();
> > +
> > +		/*
> >  		 * IRQ flags state is correct already. Just tell RCU if it
> >  		 * was not watching on entry.
> >  		 */

Ohhh, wait. What happens if you take a page-fault from NMI context? Does
this then not result in trying to program the timer from NMI context?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:32         ` Peter Zijlstra
@ 2026-04-21 11:34           ` Peter Zijlstra
  2026-04-21 11:49             ` Peter Zijlstra
  2026-04-21 16:30           ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > 
> > > ---
> > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > From: Thomas Gleixner <tglx@kernel.org>
> > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > 
> > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > enabled. That's a correct decision because the timer interrupt can only be
> > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > 
> > > So much for the theory, but then there is VIRT which ruins everything.
> > > 
> > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > disabled. That's correct from the KVM point of view, but completely
> > > violates the obviously correct expectations of the interrupt entry/exit
> > > code.
> > 
> > Mooo :-(

Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
use GENERIC_ENTRY?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:34           ` Peter Zijlstra
@ 2026-04-21 11:49             ` Peter Zijlstra
  2026-04-21 12:05               ` Peter Zijlstra
  2026-04-21 17:11               ` Thomas Gleixner
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 11:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > 
> > > > ---
> > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > 
> > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > 
> > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > 
> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > disabled. That's correct from the KVM point of view, but completely
> > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > code.
> > > 
> > > Mooo :-(
> 
> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> use GENERIC_ENTRY?

Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
on the fake frame instead? We know it will enable IRQs after doing
handle_exit_irqoff() in vcpu_enter_guest().

SVM does not seem affected with this particular insanity.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:49             ` Peter Zijlstra
@ 2026-04-21 12:05               ` Peter Zijlstra
  2026-04-21 13:19                 ` Peter Zijlstra
  2026-04-21 17:11               ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > 
> > > > > ---
> > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > 
> > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > 
> > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > 
> > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > code.
> > > > 
> > > > Mooo :-(
> > 
> > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > use GENERIC_ENTRY?
> 
> Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> on the fake frame instead? We know it will enable IRQs after doing
> handle_exit_irqoff() in vcpu_enter_guest().

Moo, you can't do that either, because it will ERETS/IRET and fuck up
the state :/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 12:05               ` Peter Zijlstra
@ 2026-04-21 13:19                 ` Peter Zijlstra
  2026-04-21 13:29                   ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 13:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > > 
> > > > > > ---
> > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > > 
> > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > > 
> > > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > > 
> > > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > > code.
> > > > > 
> > > > > Mooo :-(
> > > 
> > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > > use GENERIC_ENTRY?
> > 
> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > on the fake frame instead? We know it will enable IRQs after doing
> > handle_exit_irqoff() in vcpu_enter_guest().
> 
> Moo, you can't do that either, because it will ERETS/IRET and fuck up
> the state :/

How insane is something like this?

---
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..f3e2a8fde1ab 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	push %rdi			/* fred_ss handed in by the caller */
 	push %rbp
 	pushf
+	or $X86_EFLAGS_KVM, (%rsp)
 	push $__KERNEL_CS
 
 	/*
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..aab93f07e768 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs)
+{
+#ifdef CONFIG_KVM_INTEL
+	/*
+	 * KVM is a reserved bit and must always be 0. Hardware will #GP on
+	 * IRET/ERETS with this bit set.
+	 */
+	regs->flags &= ~X86_EFLAGS_KVM;
+#endif
+}
+#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode
+
 #endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 7bb7bd90355d..c31f7bc2eba2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
 
 static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 {
-	return !(regs->flags & X86_EFLAGS_IF);
+	/*
+	 * return context | IF | KVM
+	 * ---------------+----+----
+	 * IRQ-off        |  0 |  0
+	 * IRQ-on         |  0 |  1
+	 * IRQ-on         |  1 |  0
+	 * invalid        |  1 |  1
+	 */
+	return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0;
 }
 
 /* Query offset/name of register from its name/offset */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 81d0c8bf1137..d32edefde587 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -14,6 +14,8 @@
 #define X86_EFLAGS_FIXED	_BITUL(X86_EFLAGS_FIXED_BIT)
 #define X86_EFLAGS_PF_BIT	2 /* Parity Flag */
 #define X86_EFLAGS_PF		_BITUL(X86_EFLAGS_PF_BIT)
+#define X86_EFLAGS_KVM_BIT	3 /* KVM Flag -- must be 0 */
+#define X86_EFLAGS_KVM		_BITUL(X86_EFLAGS_PF_BIT)
 #define X86_EFLAGS_AF_BIT	4 /* Auxiliary carry Flag */
 #define X86_EFLAGS_AF		_BITUL(X86_EFLAGS_AF_BIT)
 #define X86_EFLAGS_ZF_BIT	6 /* Zero Flag */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..3d0d0fb8de79 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -50,6 +50,7 @@
 	push %rbp
 #endif
 	pushf
+	or $X86_EFLAGS_KVM, (%_ASM_SP)
 	push $__KERNEL_CS
 	\call_insn \call_target
 
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf04..0acc20b63513 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void);
 static __always_inline void arch_exit_to_user_mode(void) { }
 #endif
 
+#ifndef arch_exit_to_kernel_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { }
+#endif
+
 /**
  * arch_do_signal_or_restart -  Architecture specific signal delivery function
  * @regs:	Pointer to currents pt_regs
@@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
 	instrumentation_end();
 
 	irqentry_exit_to_kernel_mode_after_preempt(regs, state);
+	arch_exit_to_kernel_mode(regs);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 13:19                 ` Peter Zijlstra
@ 2026-04-21 13:29                   ` Peter Zijlstra
  2026-04-21 16:36                     ` Thomas Gleixner
  2026-04-21 18:11                     ` Verma, Vishal L
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 13:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 02:05:31PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:49:40PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 21, 2026 at 01:32:12PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Apr 21, 2026 at 09:39:14AM +0200, Thomas Gleixner wrote:
> > > > > > 
> > > > > > > ---
> > > > > > > Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> > > > > > > From: Thomas Gleixner <tglx@kernel.org>
> > > > > > > Date: Tue, 21 Apr 2026 09:00:52 +0200
> > > > > > > 
> > > > > > > irqentry_exit_to_kernel_mode_after_preempt() invokes
> > > > > > > hrtimer_rearm_deferred() only when the interrupted context had interrupts
> > > > > > > enabled. That's a correct decision because the timer interrupt can only be
> > > > > > > delivered in interrupt enabled contexts. The interrupt disabled path is
> > > > > > > used by exceptions and traps which never touch the hrtimer mechanics.
> > > > > > > 
> > > > > > > So much for the theory, but then there is VIRT which ruins everything.
> > > > > > > 
> > > > > > > KVM invokes regular interrupts with pt_regs which have interrupts
> > > > > > > disabled. That's correct from the KVM point of view, but completely
> > > > > > > violates the obviously correct expectations of the interrupt entry/exit
> > > > > > > code.
> > > > > > 
> > > > > > Mooo :-(
> > > > 
> > > > Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> > > > use GENERIC_ENTRY?
> > > 
> > > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > > on the fake frame instead? We know it will enable IRQs after doing
> > > handle_exit_irqoff() in vcpu_enter_guest().
> > 
> > Moo, you can't do that either, because it will ERETS/IRET and fuck up
> > the state :/
> 
> How insane is something like this?

Small matter of actually building...

---
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..cc2c961a5683 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -98,6 +98,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	push %rdi			/* fred_ss handed in by the caller */
 	push %rbp
 	pushf
+	orq $X86_EFLAGS_KVM, (%rsp)
 	push $__KERNEL_CS
 
 	/*
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 0e8c611bc9e2..75568a85b2d3 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -43,6 +43,7 @@
 #define _ASM_SUB	__ASM_SIZE(sub)
 #define _ASM_XADD	__ASM_SIZE(xadd)
 #define _ASM_MUL	__ASM_SIZE(mul)
+#define _ASM_OR		__ASM_SIZE(or)
 
 #define _ASM_AX		__ASM_REG(ax)
 #define _ASM_BX		__ASM_REG(bx)
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..aab93f07e768 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,16 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs)
+{
+#ifdef CONFIG_KVM_INTEL
+	/*
+	 * KVM is a reserved bit and must always be 0. Hardware will #GP on
+	 * IRET/ERETS with this bit set.
+	 */
+	regs->flags &= ~X86_EFLAGS_KVM;
+#endif
+}
+#define arch_exit_to_kernel_mode arch_exit_to_kernel_mode
+
 #endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 7bb7bd90355d..c31f7bc2eba2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -311,7 +311,15 @@ void user_stack_pointer_set(struct pt_regs *regs, unsigned long val)
 
 static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 {
-	return !(regs->flags & X86_EFLAGS_IF);
+	/*
+	 * return context | IF | KVM
+	 * ---------------+----+----
+	 * IRQ-off        |  0 |  0
+	 * IRQ-on         |  0 |  1
+	 * IRQ-on         |  1 |  0
+	 * invalid        |  1 |  1
+	 */
+	return (regs->flags & (X86_EFLAGS_IF | X86_EFLAGS_KVM)) == 0;
 }
 
 /* Query offset/name of register from its name/offset */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 81d0c8bf1137..d32edefde587 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -14,6 +14,8 @@
 #define X86_EFLAGS_FIXED	_BITUL(X86_EFLAGS_FIXED_BIT)
 #define X86_EFLAGS_PF_BIT	2 /* Parity Flag */
 #define X86_EFLAGS_PF		_BITUL(X86_EFLAGS_PF_BIT)
+#define X86_EFLAGS_KVM_BIT	3 /* KVM Flag -- must be 0 */
+#define X86_EFLAGS_KVM		_BITUL(X86_EFLAGS_PF_BIT)
 #define X86_EFLAGS_AF_BIT	4 /* Auxiliary carry Flag */
 #define X86_EFLAGS_AF		_BITUL(X86_EFLAGS_AF_BIT)
 #define X86_EFLAGS_ZF_BIT	6 /* Zero Flag */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..cb9ab3ce030b 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -6,6 +6,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/percpu.h>
 #include <asm/segment.h>
+#include <asm/processor-flags.h>
 #include "kvm-asm-offsets.h"
 #include "run_flags.h"
 
@@ -50,6 +51,7 @@
 	push %rbp
 #endif
 	pushf
+	_ASM_OR $X86_EFLAGS_KVM, (%_ASM_SP)
 	push $__KERNEL_CS
 	\call_insn \call_target
 
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf04..0acc20b63513 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -167,6 +167,10 @@ static __always_inline void arch_exit_to_user_mode(void);
 static __always_inline void arch_exit_to_user_mode(void) { }
 #endif
 
+#ifndef arch_exit_to_kernel_mode
+static __always_inline void arch_exit_to_kernel_mode(struct pt_regs *regs) { }
+#endif
+
 /**
  * arch_do_signal_or_restart -  Architecture specific signal delivery function
  * @regs:	Pointer to currents pt_regs
@@ -548,6 +552,7 @@ static __always_inline void irqentry_exit_to_kernel_mode(struct pt_regs *regs,
 	instrumentation_end();
 
 	irqentry_exit_to_kernel_mode_after_preempt(regs, state);
+	arch_exit_to_kernel_mode(regs);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21  7:39     ` Thomas Gleixner
  2026-04-21 11:18       ` Peter Zijlstra
@ 2026-04-21 16:11       ` Verma, Vishal L
  1 sibling, 0 replies; 39+ messages in thread
From: Verma, Vishal L @ 2026-04-21 16:11 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org, binbin.wu@linux.intel.com
  Cc: kvm@vger.kernel.org, Edgecombe, Rick P, Wu, Binbin,
	x86@kernel.org

On Tue, 2026-04-21 at 09:39 +0200, Thomas Gleixner wrote:
> 
> Subject: entry: Enforce hrtimer rearming in the irqentry_exit path
> From: Thomas Gleixner <tglx@kernel.org>
> Date: Tue, 21 Apr 2026 09:00:52 +0200
> 
> irqentry_exit_to_kernel_mode_after_preempt() invokes
> hrtimer_rearm_deferred() only when the interrupted context had interrupts
> enabled. That's a correct decision because the timer interrupt can only be
> delivered in interrupt enabled contexts. The interrupt disabled path is
> used by exceptions and traps which never touch the hrtimer mechanics.
> 
> So much for the theory, but then there is VIRT which ruins everything.
> 
> KVM invokes regular interrupts with pt_regs which have interrupts
> disabled. That's correct from the KVM point of view, but completely
> violates the obviously correct expectations of the interrupt entry/exit
> code.
> 
> Cure this by adding a hrtimer_rearm_deferred() invocation into the
> interrupted context has interrupt disabled path of
> irqentry_exit_to_kernel_mode_after_preempt().
> 
> That's unfortunate when there is an actual reschedule pending, but it can't
> be avoided because KVM invokes a lot of code and also reenables interrupts
> _before_ reaching the point where the reschedule condition is handled. That
> can delay the rearming significantly, which in turn can cause artificial
> latencies.
> 
> Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
> Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Closes: https://lore.kernel.org/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel@intel.com

Hi Thomas, I tested this and verified it solves both the tests, no more
lockups. If this is the final fix, you can add:

Tested-by: Vishal Verma <vishal.l.verma@intel.com>

(I'm queueing up Peter's patch on the CI now too)

> ---
>  include/linux/irq-entry-common.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -516,6 +516,14 @@ irqentry_exit_to_kernel_mode_after_preem
>  		instrumentation_end();
>  	} else {
>  		/*
> +		 * This is sadly required due to KVM, which invokes regular
> +		 * interrupt handlers with interrupt disabled state in @regs.
> +		 */
> +		instrumentation_begin();
> +		hrtimer_rearm_deferred();
> +		instrumentation_end();
> +
> +		/*
>  		 * IRQ flags state is correct already. Just tell RCU if it
>  		 * was not watching on entry.
>  		 */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:32         ` Peter Zijlstra
  2026-04-21 11:34           ` Peter Zijlstra
@ 2026-04-21 16:30           ` Thomas Gleixner
  1 sibling, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21 2026 at 13:32, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:18:58PM +0200, Peter Zijlstra wrote:
>> >  		/*
>> > +		 * This is sadly required due to KVM, which invokes regular
>> > +		 * interrupt handlers with interrupt disabled state in @regs.
>> > +		 */
>> > +		instrumentation_begin();
>> > +		hrtimer_rearm_deferred();
>> > +		instrumentation_end();
>> > +
>> > +		/*
>> >  		 * IRQ flags state is correct already. Just tell RCU if it
>> >  		 * was not watching on entry.
>> >  		 */
>
> Ohhh, wait. What happens if you take a page-fault from NMI context? Does
> this then not result in trying to program the timer from NMI context?

Uuuurgh, yes.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 13:29                   ` Peter Zijlstra
@ 2026-04-21 16:36                     ` Thomas Gleixner
  2026-04-21 18:11                     ` Verma, Vishal L
  1 sibling, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org

On Tue, Apr 21 2026 at 15:29, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 03:19:53PM +0200, Peter Zijlstra wrote:
>> > Moo, you can't do that either, because it will ERETS/IRET and fuck up
>> > the state :/
>> 
>> How insane is something like this?

Pretty insane :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 11:49             ` Peter Zijlstra
  2026-04-21 12:05               ` Peter Zijlstra
@ 2026-04-21 17:11               ` Thomas Gleixner
  2026-04-21 17:20                 ` Jim Mattson
  2026-04-21 19:18                 ` Verma, Vishal L
  1 sibling, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org,
	Sean Christopherson, Paolo Bonzini

On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
>> > > > KVM invokes regular interrupts with pt_regs which have interrupts
>> > > > disabled. That's correct from the KVM point of view, but completely
>> > > > violates the obviously correct expectations of the interrupt entry/exit
>> > > > code.
>> > > 
>> > > Mooo :-(
>> 
>> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
>> use GENERIC_ENTRY?
>
> Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> on the fake frame instead? We know it will enable IRQs after doing
> handle_exit_irqoff() in vcpu_enter_guest().

Doesn't work :)

> SVM does not seem affected with this particular insanity.

Looks like. It will take the interrupt after local_irq_enable().

Now for VMX, that hrtimer_rearm_deferred() call should really go into
handle_external_interrupt_irqoff(), which in turn requires to export
__hrtimer_rearm_deferred().

But we can avoid that alltogether. Something like the untested below.

Thanks,

        tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -42,9 +42,10 @@
 #include <linux/timer.h>
 #include <linux/freezer.h>
 #include <linux/compat.h>
-
 #include <linux/uaccess.h>
 
+#include <asm/irq_regs.h>
+
 #include <trace/events/timer.h>
 
 #include "tick-internal.h"
@@ -2062,11 +2063,16 @@ void __hrtimer_rearm_deferred(void)
 static __always_inline void
 hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
 {
-	/* hrtimer_interrupt() just re-evaluated the first expiring timer */
-	cpu_base->deferred_needs_update = false;
-	/* Cache the expiry time */
-	cpu_base->deferred_expires_next = expires_next;
-	set_thread_flag(TIF_HRTIMER_REARM);
+	/* Lies, damned lies and virt */
+	if (likely(!regs_irqs_disabled(get_irq_regs()))) {
+		/* hrtimer_interrupt() just re-evaluated the first expiring timer */
+		cpu_base->deferred_needs_update = false;
+		/* Cache the expiry time */
+		cpu_base->deferred_expires_next = expires_next;
+		set_thread_flag(TIF_HRTIMER_REARM);
+	} else {
+		hrtimer_rearm(cpu_base, expires_next, false);
+	}
 }
 #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
 static __always_inline void



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 17:11               ` Thomas Gleixner
@ 2026-04-21 17:20                 ` Jim Mattson
  2026-04-21 18:29                   ` Thomas Gleixner
  2026-04-21 19:18                 ` Verma, Vishal L
  1 sibling, 1 reply; 39+ messages in thread
From: Jim Mattson @ 2026-04-21 17:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org,
	Sean Christopherson, Paolo Bonzini

.

On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> >> > > > disabled. That's correct from the KVM point of view, but completely
> >> > > > violates the obviously correct expectations of the interrupt entry/exit
> >> > > > code.
> >> > >
> >> > > Mooo :-(
> >>
> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> >> use GENERIC_ENTRY?
> >
> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> > on the fake frame instead? We know it will enable IRQs after doing
> > handle_exit_irqoff() in vcpu_enter_guest().
>
> Doesn't work :)
>
> > SVM does not seem affected with this particular insanity.
>
> Looks like. It will take the interrupt after local_irq_enable().

FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 13:29                   ` Peter Zijlstra
  2026-04-21 16:36                     ` Thomas Gleixner
@ 2026-04-21 18:11                     ` Verma, Vishal L
  1 sibling, 0 replies; 39+ messages in thread
From: Verma, Vishal L @ 2026-04-21 18:11 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: Wu, Binbin, kvm@vger.kernel.org, binbin.wu@linux.intel.com,
	Edgecombe, Rick P, x86@kernel.org

On Tue, 2026-04-21 at 15:29 +0200, Peter Zijlstra wrote:
> 
> diff --git a/arch/x86/include/uapi/asm/processor-flags.h
> b/arch/x86/include/uapi/asm/processor-flags.h
> index 81d0c8bf1137..d32edefde587 100644
> --- a/arch/x86/include/uapi/asm/processor-flags.h
> +++ b/arch/x86/include/uapi/asm/processor-flags.h
> @@ -14,6 +14,8 @@
>  #define X86_EFLAGS_FIXED	_BITUL(X86_EFLAGS_FIXED_BIT)
>  #define X86_EFLAGS_PF_BIT	2 /* Parity Flag */
>  #define X86_EFLAGS_PF		_BITUL(X86_EFLAGS_PF_BIT)
> +#define X86_EFLAGS_KVM_BIT	3 /* KVM Flag -- must be 0 */
> +#define X86_EFLAGS_KVM		_BITUL(X86_EFLAGS_PF_BIT)

I fixed up the copy-paste typo here -   _BITUL(X86_EFLAGS_KVM_BIT)

.. and with that the tests pass.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 17:20                 ` Jim Mattson
@ 2026-04-21 18:29                   ` Thomas Gleixner
  2026-04-21 18:55                     ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21 18:29 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Peter Zijlstra, Binbin Wu, Verma, Vishal L, kvm@vger.kernel.org,
	Edgecombe, Rick P, Wu, Binbin, x86@kernel.org,
	Sean Christopherson, Paolo Bonzini

On Tue, Apr 21 2026 at 10:20, Jim Mattson wrote:
> On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote:
>>
>> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
>> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
>> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts
>> >> > > > disabled. That's correct from the KVM point of view, but completely
>> >> > > > violates the obviously correct expectations of the interrupt entry/exit
>> >> > > > code.
>> >> > >
>> >> > > Mooo :-(
>> >>
>> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
>> >> use GENERIC_ENTRY?
>> >
>> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
>> > on the fake frame instead? We know it will enable IRQs after doing
>> > handle_exit_irqoff() in vcpu_enter_guest().
>>
>> Doesn't work :)
>>
>> > SVM does not seem affected with this particular insanity.
>>
>> Looks like. It will take the interrupt after local_irq_enable().
>
> FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.

I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
there any performance benefit or is it just used because it's there?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:29                   ` Thomas Gleixner
@ 2026-04-21 18:55                     ` Sean Christopherson
  2026-04-21 20:06                       ` Peter Zijlstra
                                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Sean Christopherson @ 2026-04-21 18:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026, Thomas Gleixner wrote:
> On Tue, Apr 21 2026 at 10:20, Jim Mattson wrote:
> > On Tue, Apr 21, 2026 at 10:14 AM Thomas Gleixner <tglx@kernel.org> wrote:
> >>
> >> On Tue, Apr 21 2026 at 13:49, Peter Zijlstra wrote:
> >> > On Tue, Apr 21, 2026 at 01:34:07PM +0200, Peter Zijlstra wrote:
> >> >> > > > KVM invokes regular interrupts with pt_regs which have interrupts
> >> >> > > > disabled. That's correct from the KVM point of view, but completely
> >> >> > > > violates the obviously correct expectations of the interrupt entry/exit
> >> >> > > > code.
> >> >> > >
> >> >> > > Mooo :-(
> >> >>
> >> >> Also, is this a x86/KVM 'special' or is this true for all arch/KVM that
> >> >> use GENERIC_ENTRY?
> >> >
> >> > Should we not make asm_fred_entry_from_kvm()/VMX_DO_EVENT_IRQOFF fix IF
> >> > on the fake frame instead? We know it will enable IRQs after doing
> >> > handle_exit_irqoff() in vcpu_enter_guest().
> >>
> >> Doesn't work :)
> >>
> >> > SVM does not seem affected with this particular insanity.
> >>
> >> Looks like. It will take the interrupt after local_irq_enable().
> >
> > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.

Hell no.

> I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
> there any performance benefit or is it just used because it's there?

There are performance benefits, and it preserves ordering: the first IRQ that's
serviced by the host is guaranteed to be _the_ IRQ that triggered the VM-Exit.
E.g. with AMD's approach, any IRQs that arrive between the VM-Exit and STI (which
is a pretty big swath of code) could be serviced before the IRQ that triggered
the exit, depending on priority.

VM_EXIT_ACK_INTR_ON_EXIT also provides symmetry with Intel's handing of NMIs, as
NMIs are unconditionally "acked" on VM-Exit.

Even if performance is "fine", changing decades of fundamental KVM behavior is
terrifying.

Pulling in an earlier idea:

 : Now for VMX, that hrtimer_rearm_deferred() call should really go into
 : handle_external_interrupt_irqoff(), which in turn requires to export
 : __hrtimer_rearm_deferred().

IMO, that's the way to go.  But instead of exporting __hrtimer_rearm_deferred(),
move vmx_do_nmi_irqoff() and vmx_do_interrupt_irqoff() into core kernel entry code
(along with the assembly glue), and then EXPORT_SYMBOL_FOR_KVM those.  It'd mean
some extra surgery, e.g. to provide an equivalent to KVM's IDT lookup:

	gate_offset((gate_desc *)host_idt_base + vector)

But I suspect it would be a big net positive in the end.i  E.g. the entry code
would *know* it's dealing with a direct call from KVM, and thus shouldn't need
to play pt_regs games.

Actually, even better would be to bury the FRED vs. not-FRED details in entry
code.  E.g. on the KVM invocation side, we could get to something like the below,
and I'm pretty sure _reduce_ the number of for-KVM exports in the process.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..f6f5c124ed3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
            "unexpected VM-Exit interrupt info: 0x%x", intr_info))
                return;
 
-       /*
-        * Invoke the kernel's IRQ handler for the vector.  Use the FRED path
-        * when it's available even if FRED isn't fully enabled, e.g. even if
-        * FRED isn't supported in hardware, in order to avoid the indirect
-        * CALL in the non-FRED path.
-        */
+       /* For the IRQ to the core kernel for processing. */
        kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-       if (IS_ENABLED(CONFIG_X86_FRED))
-               fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
-       else
-               vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
+       x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
        kvm_after_interrupt(vcpu);
 
        vcpu->arch.at_instruction_boundary = true;
@@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
                return;
 
        kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-       if (cpu_feature_enabled(X86_FEATURE_FRED))
-               fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
-       else
-               vmx_do_nmi_irqoff();
+       x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
        kvm_after_interrupt(vcpu);
 }

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 17:11               ` Thomas Gleixner
  2026-04-21 17:20                 ` Jim Mattson
@ 2026-04-21 19:18                 ` Verma, Vishal L
  1 sibling, 0 replies; 39+ messages in thread
From: Verma, Vishal L @ 2026-04-21 19:18 UTC (permalink / raw)
  To: peterz@infradead.org, tglx@kernel.org
  Cc: Wu, Binbin, kvm@vger.kernel.org, bonzini@redhat.com,
	seanjc@google.com, binbin.wu@linux.intel.com, Edgecombe, Rick P,
	x86@kernel.org

On Tue, 2026-04-21 at 19:11 +0200, Thomas Gleixner wrote:
> 
> Now for VMX, that hrtimer_rearm_deferred() call should really go into
> handle_external_interrupt_irqoff(), which in turn requires to export
> __hrtimer_rearm_deferred().
> 
> But we can avoid that alltogether. Something like the untested below.

Tested with the below patch and the tests pass with this too.

> 
> Thanks,
> 
>         tglx
> ---
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -42,9 +42,10 @@
>  #include <linux/timer.h>
>  #include <linux/freezer.h>
>  #include <linux/compat.h>
> -
>  #include <linux/uaccess.h>
>  
> +#include <asm/irq_regs.h>
> +
>  #include <trace/events/timer.h>
>  
>  #include "tick-internal.h"
> @@ -2062,11 +2063,16 @@ void __hrtimer_rearm_deferred(void)
>  static __always_inline void
>  hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
>  {
> -	/* hrtimer_interrupt() just re-evaluated the first expiring timer */
> -	cpu_base->deferred_needs_update = false;
> -	/* Cache the expiry time */
> -	cpu_base->deferred_expires_next = expires_next;
> -	set_thread_flag(TIF_HRTIMER_REARM);
> +	/* Lies, damned lies and virt */
> +	if (likely(!regs_irqs_disabled(get_irq_regs()))) {
> +		/* hrtimer_interrupt() just re-evaluated the first expiring timer */
> +		cpu_base->deferred_needs_update = false;
> +		/* Cache the expiry time */
> +		cpu_base->deferred_expires_next = expires_next;
> +		set_thread_flag(TIF_HRTIMER_REARM);
> +	} else {
> +		hrtimer_rearm(cpu_base, expires_next, false);
> +	}
>  }
>  #else  /* CONFIG_HRTIMER_REARM_DEFERRED */
>  static __always_inline void
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:55                     ` Sean Christopherson
@ 2026-04-21 20:06                       ` Peter Zijlstra
  2026-04-21 20:46                         ` Peter Zijlstra
  2026-04-21 20:57                         ` Sean Christopherson
  2026-04-21 20:39                       ` Paolo Bonzini
  2026-04-21 21:49                       ` Thomas Gleixner
  2 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 20:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote:

> Pulling in an earlier idea:
> 
>  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
>  : handle_external_interrupt_irqoff(), which in turn requires to export
>  : __hrtimer_rearm_deferred().
> 

> Actually, even better would be to bury the FRED vs. not-FRED details in entry
> code.  E.g. on the KVM invocation side, we could get to something like the below,
> and I'm pretty sure _reduce_ the number of for-KVM exports in the process.

Something like so then?

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 72cae8e0ce85..83b4762d6ecb 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -13,7 +13,7 @@ CFLAGS_REMOVE_syscall_64.o	= $(CC_FLAGS_FTRACE)
 CFLAGS_syscall_32.o		+= -fno-stack-protector
 CFLAGS_syscall_64.o		+= -fno-stack-protector
 
-obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o
+obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o common.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..4b0171abb083
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/kvm_types.h>
+#include <linux/hrtimer_rearm.h>
+#include <asm/entry-common.h>
+#include <asm/fred.h>
+#include <asm/desc.h>
+
+noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
+{
+#ifdef CONFIG_X86_64
+	fred_entry_from_kvm(event_type, vector);
+#else
+	idt_entry_from_kvm(vector);
+#endif
+	if (event_type == EVENT_TYPE_EXTINT) {
+		instrumentation_begin();
+		hrtimer_rearm_deferred();
+		instrumentation_end();
+	}
+}
+EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 92c0b4a94e0a..96c3e9322297 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1224,3 +1224,36 @@ SYM_CODE_START(rewind_stack_and_make_dead)
 1:	jmp 1b
 SYM_CODE_END(rewind_stack_and_make_dead)
 .popsection
+
+.pushsection .noinstr.text, "ax"
+.macro IDT_DO_EVENT_IRQOFF call_insn call_target
+	/*
+	 * Unconditionally create a stack frame, getting the correct RSP on the
+	 * stack (for x86-64) would take two instructions anyways, and RBP can
+	 * be used to restore RSP to make objtool happy (see below).
+	 */
+	push %ebp
+	mov %esp, %ebp
+
+	pushf
+	push $__KERNEL_CS
+	\call_insn \call_target
+
+	/*
+	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
+	 * the correct value.  objtool doesn't know the callee will IRET and,
+	 * without the explicit restore, thinks the stack is getting walloped.
+	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
+	 */
+	leave
+	RET
+.endm
+
+SYM_FUNC_START(idt_do_interrupt_irqoff)
+	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+SYM_FUNC_END(idt_do_interrupt_irqoff)
+
+SYM_FUNC_START(idt_do_nmi_irqoff)
+	IDT_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
+SYM_FUNC_END(idt_do_nmi_irqoff)
+.popsection
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..0d2768ab836c 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -147,5 +147,4 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	RET
 
 SYM_FUNC_END(asm_fred_entry_from_kvm)
-EXPORT_SYMBOL_FOR_KVM(asm_fred_entry_from_kvm);
 #endif
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ec95fe44fa3a..cb24990f38fd 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -437,6 +437,7 @@ extern void idt_setup_early_traps(void);
 extern void idt_setup_traps(void);
 extern void idt_setup_apic_and_irq_gates(void);
 extern bool idt_is_f00f_address(unsigned long address);
+extern void idt_entry_from_kvm(unsigned int vector);
 
 #ifdef CONFIG_X86_64
 extern void idt_setup_early_pf(void);
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..eca24b5e07f4 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,6 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector);
+
 #endif
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..d95d8d196cd4 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -266,6 +266,14 @@ void __init idt_setup_early_pf(void)
 	idt_setup_from_table(idt_table, early_pf_idts,
 			     ARRAY_SIZE(early_pf_idts), true);
 }
+#else
+void idt_entry_from_kvm(unsigned int vector)
+{
+	if (vector == NMI_VECTOR)
+		idt_do_nmi_irqoff();
+	else
+		idt_do_interrupt_irqoff(gate_offset(idt_table + vector));
+}
 #endif
 
 static void __init idt_map_in_cea(void)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..ff1f254a0ef4 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -31,38 +31,6 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
-	/*
-	 * Unconditionally create a stack frame, getting the correct RSP on the
-	 * stack (for x86-64) would take two instructions anyways, and RBP can
-	 * be used to restore RSP to make objtool happy (see below).
-	 */
-	push %_ASM_BP
-	mov %_ASM_SP, %_ASM_BP
-
-#ifdef CONFIG_X86_64
-	/*
-	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
-	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
-	 */
-	and  $-16, %rsp
-	push $__KERNEL_DS
-	push %rbp
-#endif
-	pushf
-	push $__KERNEL_CS
-	\call_insn \call_target
-
-	/*
-	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
-	 * the correct value.  objtool doesn't know the callee will IRET and,
-	 * without the explicit restore, thinks the stack is getting walloped.
-	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
-	 */
-	leave
-	RET
-.endm
-
 .section .noinstr.text, "ax"
 
 /**
@@ -320,10 +288,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 
 SYM_FUNC_END(__vmx_vcpu_run)
 
-SYM_FUNC_START(vmx_do_nmi_irqoff)
-	VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
-SYM_FUNC_END(vmx_do_nmi_irqoff)
-
 #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /**
@@ -375,13 +339,3 @@ SYM_FUNC_START(vmread_error_trampoline)
 	RET
 SYM_FUNC_END(vmread_error_trampoline)
 #endif
-
-.section .text, "ax"
-
-#ifndef CONFIG_X86_FRED
-
-SYM_FUNC_START(vmx_do_interrupt_irqoff)
-	VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
-SYM_FUNC_END(vmx_do_interrupt_irqoff)
-
-#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..f6f5c124ed3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
 		return;
 
-	/*
-	 * Invoke the kernel's IRQ handler for the vector.  Use the FRED path
-	 * when it's available even if FRED isn't fully enabled, e.g. even if
-	 * FRED isn't supported in hardware, in order to avoid the indirect
-	 * CALL in the non-FRED path.
-	 */
+	/* For the IRQ to the core kernel for processing. */
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-	if (IS_ENABLED(CONFIG_X86_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
-	else
-		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
+	x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
 	kvm_after_interrupt(vcpu);
 
 	vcpu->arch.at_instruction_boundary = true;
@@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
 		return;
 
 	kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-	if (cpu_feature_enabled(X86_FEATURE_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
-	else
-		vmx_do_nmi_irqoff();
+	x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
 	kvm_after_interrupt(vcpu);
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:55                     ` Sean Christopherson
  2026-04-21 20:06                       ` Peter Zijlstra
@ 2026-04-21 20:39                       ` Paolo Bonzini
  2026-04-21 21:02                         ` Sean Christopherson
  2026-04-21 22:48                         ` Thomas Gleixner
  2026-04-21 21:49                       ` Thomas Gleixner
  2 siblings, 2 replies; 39+ messages in thread
From: Paolo Bonzini @ 2026-04-21 20:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu,
	the arch/x86 maintainers, Paolo Bonzini

Il mar 21 apr 2026, 19:55 Sean Christopherson <seanjc@google.com> ha scritto:
>
> > > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.
>
> Hell no.
>
> > I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
> > there any performance benefit or is it just used because it's there?
>
> There are performance benefits, and it preserves ordering [...]
> NMIs are unconditionally "acked" on VM-Exit.

Not that I disagree but...

> Even if performance is "fine", changing decades of fundamental KVM behavior is
> terrifying.

... it's not decades, ack on VM exit is actually relatively recent (10
years out 20 :)). The reason why it was introduced is another killer
for the idea, though. Posted interrupts require it, for some reason
only known to Intel.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:06                       ` Peter Zijlstra
@ 2026-04-21 20:46                         ` Peter Zijlstra
  2026-04-21 20:57                         ` Sean Christopherson
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 20:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026 at 10:06:20PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote:
> 
> > Pulling in an earlier idea:
> > 
> >  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
> >  : handle_external_interrupt_irqoff(), which in turn requires to export
> >  : __hrtimer_rearm_deferred().
> > 
> 
> > Actually, even better would be to bury the FRED vs. not-FRED details in entry
> > code.  E.g. on the KVM invocation side, we could get to something like the below,
> > and I'm pretty sure _reduce_ the number of for-KVM exports in the process.
> 
> Something like so then?

And this one seems to build on ARCH=i386 too.

---
diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 72cae8e0ce85..83b4762d6ecb 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -13,7 +13,7 @@ CFLAGS_REMOVE_syscall_64.o	= $(CC_FLAGS_FTRACE)
 CFLAGS_syscall_32.o		+= -fno-stack-protector
 CFLAGS_syscall_64.o		+= -fno-stack-protector
 
-obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o
+obj-y				:= entry.o entry_$(BITS).o syscall_$(BITS).o common.o
 
 obj-y				+= vdso/
 obj-y				+= vsyscall/
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
new file mode 100644
index 000000000000..8de94a590b26
--- /dev/null
+++ b/arch/x86/entry/common.c
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/entry-common.h>
+#include <linux/kvm_types.h>
+#include <linux/hrtimer_rearm.h>
+#include <asm/fred.h>
+#include <asm/desc.h>
+
+noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
+{
+#ifdef CONFIG_X86_64
+	fred_entry_from_kvm(event_type, vector);
+#else
+	idt_entry_from_kvm(vector);
+#endif
+	if (event_type == EVENT_TYPE_EXTINT) {
+		instrumentation_begin();
+		hrtimer_rearm_deferred();
+		instrumentation_end();
+	}
+}
+EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 92c0b4a94e0a..9324e97d14cf 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1224,3 +1224,36 @@ SYM_CODE_START(rewind_stack_and_make_dead)
 1:	jmp 1b
 SYM_CODE_END(rewind_stack_and_make_dead)
 .popsection
+
+.pushsection .noinstr.text, "ax"
+.macro IDT_DO_EVENT_IRQOFF call_insn call_target
+	/*
+	 * Unconditionally create a stack frame, getting the correct RSP on the
+	 * stack (for x86-64) would take two instructions anyways, and RBP can
+	 * be used to restore RSP to make objtool happy (see below).
+	 */
+	push %ebp
+	mov %esp, %ebp
+
+	pushf
+	push $__KERNEL_CS
+	\call_insn \call_target
+
+	/*
+	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
+	 * the correct value.  objtool doesn't know the callee will IRET and,
+	 * without the explicit restore, thinks the stack is getting walloped.
+	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
+	 */
+	leave
+	RET
+.endm
+
+SYM_FUNC_START(idt_do_interrupt_irqoff)
+	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+SYM_FUNC_END(idt_do_interrupt_irqoff)
+
+SYM_FUNC_START(idt_do_nmi_irqoff)
+	IDT_DO_EVENT_IRQOFF call asm_exc_nmi
+SYM_FUNC_END(idt_do_nmi_irqoff)
+.popsection
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 894f7f16eb80..0d2768ab836c 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -147,5 +147,4 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
 	RET
 
 SYM_FUNC_END(asm_fred_entry_from_kvm)
-EXPORT_SYMBOL_FOR_KVM(asm_fred_entry_from_kvm);
 #endif
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index ec95fe44fa3a..f44d6a606b4c 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -438,6 +438,10 @@ extern void idt_setup_traps(void);
 extern void idt_setup_apic_and_irq_gates(void);
 extern bool idt_is_f00f_address(unsigned long address);
 
+extern void idt_do_interrupt_irqoff(unsigned int vector);
+extern void idt_do_nmi_irqoff(void);
+extern void idt_entry_from_kvm(unsigned int vector);
+
 #ifdef CONFIG_X86_64
 extern void idt_setup_early_pf(void);
 #else
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 7535131c711b..eca24b5e07f4 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -97,4 +97,6 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector);
+
 #endif
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 42bf6a58ec36..db4072875f5f 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -633,17 +633,6 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC,	xenpv_exc_machine_check);
 #endif
 
 /* NMI */
-
-#if IS_ENABLED(CONFIG_KVM_INTEL)
-/*
- * Special entry point for VMX which invokes this on the kernel stack, even for
- * 64-bit, i.e. without using an IST.  asm_exc_nmi() requires an IST to work
- * correctly vs. the NMI 'executing' marker.  Used for 32-bit kernels as well
- * to avoid more ifdeffery.
- */
-DECLARE_IDTENTRY(X86_TRAP_NMI,		exc_nmi_kvm_vmx);
-#endif
-
 DECLARE_IDTENTRY_NMI(X86_TRAP_NMI,	exc_nmi);
 #ifdef CONFIG_XEN_PV
 DECLARE_IDTENTRY_RAW(X86_TRAP_NMI,	xenpv_exc_nmi);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 260456588756..d95d8d196cd4 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -266,6 +266,14 @@ void __init idt_setup_early_pf(void)
 	idt_setup_from_table(idt_table, early_pf_idts,
 			     ARRAY_SIZE(early_pf_idts), true);
 }
+#else
+void idt_entry_from_kvm(unsigned int vector)
+{
+	if (vector == NMI_VECTOR)
+		idt_do_nmi_irqoff();
+	else
+		idt_do_interrupt_irqoff(gate_offset(idt_table + vector));
+}
 #endif
 
 static void __init idt_map_in_cea(void)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..06fe225fb0a2 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -609,14 +609,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 		goto nmi_restart;
 }
 
-#if IS_ENABLED(CONFIG_KVM_INTEL)
-DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx)
-{
-	exc_nmi(regs);
-}
-EXPORT_SYMBOL_FOR_KVM(asm_exc_nmi_kvm_vmx);
-#endif
-
 #ifdef CONFIG_NMI_CHECK_CPU
 
 static char *nmi_check_stall_msg[] = {
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 8a481dae9cae..ff1f254a0ef4 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -31,38 +31,6 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
-	/*
-	 * Unconditionally create a stack frame, getting the correct RSP on the
-	 * stack (for x86-64) would take two instructions anyways, and RBP can
-	 * be used to restore RSP to make objtool happy (see below).
-	 */
-	push %_ASM_BP
-	mov %_ASM_SP, %_ASM_BP
-
-#ifdef CONFIG_X86_64
-	/*
-	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
-	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
-	 */
-	and  $-16, %rsp
-	push $__KERNEL_DS
-	push %rbp
-#endif
-	pushf
-	push $__KERNEL_CS
-	\call_insn \call_target
-
-	/*
-	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
-	 * the correct value.  objtool doesn't know the callee will IRET and,
-	 * without the explicit restore, thinks the stack is getting walloped.
-	 * Using an unwind hint is problematic due to x86-64's dynamic alignment.
-	 */
-	leave
-	RET
-.endm
-
 .section .noinstr.text, "ax"
 
 /**
@@ -320,10 +288,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 
 SYM_FUNC_END(__vmx_vcpu_run)
 
-SYM_FUNC_START(vmx_do_nmi_irqoff)
-	VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
-SYM_FUNC_END(vmx_do_nmi_irqoff)
-
 #ifndef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /**
@@ -375,13 +339,3 @@ SYM_FUNC_START(vmread_error_trampoline)
 	RET
 SYM_FUNC_END(vmread_error_trampoline)
 #endif
-
-.section .text, "ax"
-
-#ifndef CONFIG_X86_FRED
-
-SYM_FUNC_START(vmx_do_interrupt_irqoff)
-	VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
-SYM_FUNC_END(vmx_do_interrupt_irqoff)
-
-#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..f6f5c124ed3b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7127,17 +7127,9 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
 		return;
 
-	/*
-	 * Invoke the kernel's IRQ handler for the vector.  Use the FRED path
-	 * when it's available even if FRED isn't fully enabled, e.g. even if
-	 * FRED isn't supported in hardware, in order to avoid the indirect
-	 * CALL in the non-FRED path.
-	 */
+	/* For the IRQ to the core kernel for processing. */
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
-	if (IS_ENABLED(CONFIG_X86_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
-	else
-		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
+	x86_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
 	kvm_after_interrupt(vcpu);
 
 	vcpu->arch.at_instruction_boundary = true;
@@ -7447,10 +7439,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
 		return;
 
 	kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
-	if (cpu_feature_enabled(X86_FEATURE_FRED))
-		fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
-	else
-		vmx_do_nmi_irqoff();
+	x86_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
 	kvm_after_interrupt(vcpu);
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:06                       ` Peter Zijlstra
  2026-04-21 20:46                         ` Peter Zijlstra
@ 2026-04-21 20:57                         ` Sean Christopherson
  2026-04-21 21:02                           ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2026-04-21 20:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 11:55:33AM -0700, Sean Christopherson wrote:
> 
> > Pulling in an earlier idea:
> > 
> >  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
> >  : handle_external_interrupt_irqoff(), which in turn requires to export
> >  : __hrtimer_rearm_deferred().
> > 
> 
> > Actually, even better would be to bury the FRED vs. not-FRED details in entry
> > code.  E.g. on the KVM invocation side, we could get to something like the below,
> > and I'm pretty sure _reduce_ the number of for-KVM exports in the process.
> 
> Something like so then?

Yep!

> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> new file mode 100644
> index 000000000000..4b0171abb083
> --- /dev/null
> +++ b/arch/x86/entry/common.c
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/kvm_types.h>
> +#include <linux/hrtimer_rearm.h>

For CONFIG_X86_FRED=n, which is possible on x86-64 if CONFIG_KVM_INTEL=n, this

#include <linux/sched/task_stack.h>

is needed so that task_pt_regs() can find task_stack_page() (and including
task_stack.h in processor.h would create cyclical includes).

> +#include <asm/entry-common.h>
> +#include <asm/fred.h>
> +#include <asm/desc.h>
> +

Related to CONFIG_X86_FRED=n, I vote to wrap this API with #if IS_ENABLED(CONFIG_KVM_INTEL)
and then delete the fred_entry_from_kvm() stub so that a goof results in a build
failure.  That'd also be a good place for a comment to explain some of the usage.

> +noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
> +{
> +#ifdef CONFIG_X86_64
> +	fred_entry_from_kvm(event_type, vector);
> +#else
> +	idt_entry_from_kvm(vector);
> +#endif

...

> +SYM_FUNC_START(idt_do_interrupt_irqoff)
> +	IDT_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
> +SYM_FUNC_END(idt_do_interrupt_irqoff)
> +
> +SYM_FUNC_START(idt_do_nmi_irqoff)
> +	IDT_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
> +SYM_FUNC_END(idt_do_nmi_irqoff)

These need to be declared, and the KVM declarations can be deleted.

>  static void __init idt_map_in_cea(void)
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index 8a481dae9cae..ff1f254a0ef4 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -31,38 +31,6 @@
>  #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
>  #endif
>  
> -.macro VMX_DO_EVENT_IRQOFF call_insn call_target
> -	/*
> -	 * Unconditionally create a stack frame, getting the correct RSP on the
> -	 * stack (for x86-64) would take two instructions anyways, and RBP can
> -	 * be used to restore RSP to make objtool happy (see below).
> -	 */
> -	push %_ASM_BP
> -	mov %_ASM_SP, %_ASM_BP
> -
> -#ifdef CONFIG_X86_64
> -	/*
> -	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
> -	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
> -	 */
> -	and  $-16, %rsp
> -	push $__KERNEL_DS
> -	push %rbp
> -#endif

For anyone else having an -ENOCOFFEE moment, this has been dead code since commit
28d11e4548b7 ("x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y").

This as delta? (I had typed this all up before Peter posted a new verison, so
dammit I'm sending it!)


diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 4b0171abb083..b039276bede9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -2,10 +2,20 @@
 
 #include <linux/kvm_types.h>
 #include <linux/hrtimer_rearm.h>
+#include <linux/sched/task_stack.h>
 #include <asm/entry-common.h>
 #include <asm/fred.h>
 #include <asm/desc.h>
 
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+/*
+ * On VMX, NMIs and IRQs (as configured by KVM) are acknowledge by hardware as
+ * part of the VM-Exit, i.e. the event itself is consumed as part the VM-Exit.
+ * x86_entry_from_kvm() is invoked by KVM to effectively forward NMIs and IRQs
+ * to the kernel for servicing.  On SVM, a.k.a. AMD, the NMI/IRQ VM-Exit is
+ * purely a signal that an NMI/IRQ is pending, i.e. the event that triggered
+ * the VM-Exit is held pending until it's unblocked in the host.
+ */
 noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
 {
 #ifdef CONFIG_X86_64
@@ -20,3 +30,4 @@ noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
        }
 }
 EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
+#endif
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index eca24b5e07f4..2421b1edf77e 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -98,5 +98,7 @@ static __always_inline void arch_exit_to_user_mode(void)
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
 extern void x86_entry_from_kvm(unsigned int entry_type, unsigned int vector);
+extern void idt_do_interrupt_irqoff(unsigned long entry);
+extern void idt_do_nmi_irqoff(void);
 
 #endif
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 2bb65677c079..18a2f811c358 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -110,7 +110,6 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { ret
 static inline void cpu_init_fred_exceptions(void) { }
 static inline void cpu_init_fred_rsps(void) { }
 static inline void fred_complete_exception_setup(void) { }
-static inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { }
 static inline void fred_sync_rsp0(unsigned long rsp0) { }
 static inline void fred_update_rsp0(void) { }
 #endif /* CONFIG_X86_FRED */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 3d239ed12744..52a3afb1b79e 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -614,7 +614,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx)
 {
        exc_nmi(regs);
 }
-EXPORT_SYMBOL_FOR_KVM(asm_exc_nmi_kvm_vmx);
 #endif
 
 #ifdef CONFIG_NMI_CHECK_CPU
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6f5c124ed3b..753f0dbb9cf8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7083,9 +7083,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
        vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-void vmx_do_interrupt_irqoff(unsigned long entry);
-void vmx_do_nmi_irqoff(void);
-
 static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 {
        /*



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:57                         ` Sean Christopherson
@ 2026-04-21 21:02                           ` Peter Zijlstra
  2026-04-21 21:42                             ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2026-04-21 21:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026 at 08:57:24PM +0000, Sean Christopherson wrote:

> > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > new file mode 100644
> > index 000000000000..4b0171abb083
> > --- /dev/null
> > +++ b/arch/x86/entry/common.c
> > @@ -0,0 +1,22 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#include <linux/kvm_types.h>
> > +#include <linux/hrtimer_rearm.h>
> 
> For CONFIG_X86_FRED=n, which is possible on x86-64 if CONFIG_KVM_INTEL=n, this
> 
> #include <linux/sched/task_stack.h>
> 
> is needed so that task_pt_regs() can find task_stack_page() (and including
> task_stack.h in processor.h would create cyclical includes).
> 
> > +#include <asm/entry-common.h>

This wanted to be linux/entry-common.h, otherwise I could not get 32bit
to build. And that sorts that same include issue you pointed out. My
x86_64 build didn't seem to care...

> > +#include <asm/fred.h>
> > +#include <asm/desc.h>
> > +
> 
> Related to CONFIG_X86_FRED=n, I vote to wrap this API with #if IS_ENABLED(CONFIG_KVM_INTEL)
> and then delete the fred_entry_from_kvm() stub so that a goof results in a build
> failure.  That'd also be a good place for a comment to explain some of the usage.

Yeah, it definitely wants a few comments. I'll do the KVM_INTEL thing,
I'd forgotten we'd made the FRED=y thing conditinoal on that, I thought
we'd had it unconditionally =y for 64bit.


> This as delta? (I had typed this all up before Peter posted a new verison, so
> dammit I'm sending it!)

:-)

I'll go stare at it in the morning, I'm about to go crash out.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:39                       ` Paolo Bonzini
@ 2026-04-21 21:02                         ` Sean Christopherson
  2026-04-21 22:48                         ` Thomas Gleixner
  1 sibling, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2026-04-21 21:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, Jim Mattson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu,
	the arch/x86 maintainers, Paolo Bonzini

On Tue, Apr 21, 2026, Paolo Bonzini wrote:
> Il mar 21 apr 2026, 19:55 Sean Christopherson <seanjc@google.com> ha scritto:
> >
> > > > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.
> >
> > Hell no.
> >
> > > I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
> > > there any performance benefit or is it just used because it's there?
> >
> > There are performance benefits, and it preserves ordering [...]
> > NMIs are unconditionally "acked" on VM-Exit.
> 
> Not that I disagree but...
> 
> > Even if performance is "fine", changing decades of fundamental KVM behavior is
> > terrifying.
> 
> ... it's not decades, ack on VM exit is actually relatively recent (10
> years out 20 :)). The reason why it was introduced is another killer
> for the idea, though. Posted interrupts require it,

Oh, I forgot about posted interrupts.  So yeah, what Paolo said :-)

> for some reason only known to Intel.

My guess is because the ucode that morphs the notification vector into posted
interrupt processing is a sub-clause of the ACK flow.  And from an architectural
perspective, having the CPU ACK/dismiss some IRQs (notification vectors) but not
others (everything else) would be kludgy (even more so than special casing the
notification vector already is).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 21:02                           ` Peter Zijlstra
@ 2026-04-21 21:42                             ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2026-04-21 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Jim Mattson, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026, Peter Zijlstra wrote:
> On Tue, Apr 21, 2026 at 08:57:24PM +0000, Sean Christopherson wrote:
> > This as delta? (I had typed this all up before Peter posted a new verison, so
> > dammit I'm sending it!)
> 
> :-)
> 
> I'll go stare at it in the morning, I'm about to go crash out.

New delta against your effective v2, builds all of my configs (which isn't _that_
many, but I think they cover most of the weirder ways to include KVM (or not)).

I'll start testing the full thing to try and get early signal on the health.

diff --git arch/x86/entry/common.c arch/x86/entry/common.c
index 8de94a590b26..0532c1b65dd9 100644
--- arch/x86/entry/common.c
+++ arch/x86/entry/common.c
@@ -6,6 +6,15 @@
 #include <asm/fred.h>
 #include <asm/desc.h>
 
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+/*
+ * On VMX, NMIs and IRQs (as configured by KVM) are acknowledge by hardware as
+ * part of the VM-Exit, i.e. the event itself is consumed as part the VM-Exit.
+ * x86_entry_from_kvm() is invoked by KVM to effectively forward NMIs and IRQs
+ * to the kernel for servicing.  On SVM, a.k.a. AMD, the NMI/IRQ VM-Exit is
+ * purely a signal that an NMI/IRQ is pending, i.e. the event that triggered
+ * the VM-Exit is held pending until it's unblocked in the host.
+ */
 noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
 {
 #ifdef CONFIG_X86_64
@@ -20,3 +29,4 @@ noinstr void x86_entry_from_kvm(unsigned int event_type, unsigned int vector)
        }
 }
 EXPORT_SYMBOL_FOR_KVM(x86_entry_from_kvm);
+#endif
diff --git arch/x86/include/asm/fred.h arch/x86/include/asm/fred.h
index 2bb65677c079..18a2f811c358 100644
--- arch/x86/include/asm/fred.h
+++ arch/x86/include/asm/fred.h
@@ -110,7 +110,6 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { ret
 static inline void cpu_init_fred_exceptions(void) { }
 static inline void cpu_init_fred_rsps(void) { }
 static inline void fred_complete_exception_setup(void) { }
-static inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { }
 static inline void fred_sync_rsp0(unsigned long rsp0) { }
 static inline void fred_update_rsp0(void) { }
 #endif /* CONFIG_X86_FRED */
diff --git arch/x86/kernel/idt.c arch/x86/kernel/idt.c
index d95d8d196cd4..d69d27846424 100644
--- arch/x86/kernel/idt.c
+++ arch/x86/kernel/idt.c
@@ -267,7 +267,7 @@ void __init idt_setup_early_pf(void)
                             ARRAY_SIZE(early_pf_idts), true);
 }
 #else
-void idt_entry_from_kvm(unsigned int vector)
+noinstr void idt_entry_from_kvm(unsigned int vector)
 {
        if (vector == NMI_VECTOR)
                idt_do_nmi_irqoff();
diff --git arch/x86/kvm/vmx/vmx.c arch/x86/kvm/vmx/vmx.c
index f6f5c124ed3b..753f0dbb9cf8 100644
--- arch/x86/kvm/vmx/vmx.c
+++ arch/x86/kvm/vmx/vmx.c
@@ -7083,9 +7083,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
        vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-void vmx_do_interrupt_irqoff(unsigned long entry);
-void vmx_do_nmi_irqoff(void);
-
 static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 {
        /*


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 18:55                     ` Sean Christopherson
  2026-04-21 20:06                       ` Peter Zijlstra
  2026-04-21 20:39                       ` Paolo Bonzini
@ 2026-04-21 21:49                       ` Thomas Gleixner
  2026-04-21 22:07                         ` Sean Christopherson
  2026-04-21 22:24                         ` Paolo Bonzini
  2 siblings, 2 replies; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21 21:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21 2026 at 11:55, Sean Christopherson wrote:
> On Tue, Apr 21, 2026, Thomas Gleixner wrote:
>> >> Looks like. It will take the interrupt after local_irq_enable().
>> >
>> > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.
>
> Hell no.

I knew for sure that someone from the KVM camp would cry murder :)

>> I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
>> there any performance benefit or is it just used because it's there?
>
> There are performance benefits, and it preserves ordering: the first IRQ that's
> serviced by the host is guaranteed to be _the_ IRQ that triggered the VM-Exit.
> E.g. with AMD's approach, any IRQs that arrive between the VM-Exit and STI (which
> is a pretty big swath of code) could be serviced before the IRQ that triggered
> the exit, depending on priority.

I might eventually buy the performance benefit, but the ordering is not
interesting at all. That's a pure virt-cult fallacy to believe that it
matters. Why?

Look at this bare metal scenario with two interrupts A and B where B has
a higher priority than A:

     cli
     interrupt A is raised in the APIC
     tons of code
     interrupt B is raised in the APIC
     sti
     handle(B)
     handle(A)

or
     cli
     interrupt A is raised in the APIC
     tons of code
     sti
     handle(A)
        interrupt B is raised in the APIC
     handle(B)

It's completely uninteresting which one is handled first. Otherwise this
'handle it directly' approach in VMX would not be correct at all.

The only valid argument here is performance and I'm not really convinced
that it actually matters given the amount of other nonsense which has to
be done on a VMEXIT nowadays.

The point is that the early handling only affects the actual response
time to the interrupt itself, but it does not affect the response time
to anything the interrupt might trigger which requires interrupt and/or
preemption enabled context:

      VMENTER
 -> Host interrupt
      VMEXIT
      handle_early()
        irqentry_enter()
          irq_enter();
          handle();

          irq_exit();        // Cannot handle soft interrupts because IF = 0

        irqentry_exit();     // Cannot handle preemption because IF = 0

I understand that this is optimizing for the case where neither soft
interrupts nor preemption has to be handled, but all I have seen so far
is handwaving about the actual performance benefits. See below.

> VM_EXIT_ACK_INTR_ON_EXIT also provides symmetry with Intel's handing of NMIs, as
> NMIs are unconditionally "acked" on VM-Exit.

What's the exact point you are trying to make?

The symmetry is a cosmetic nice to have bullet point, but neither a
functional nor a correctness requirement. The fact that hardware people
provided something which looks "useful" at the first glance does not
make it so.

> Even if performance is "fine", changing decades of fundamental KVM behavior is
> terrifying.

It worked perfectly fine before this was introduced in commit
a547c6db4d2f ("KVM: VMX: Enable acknowledge interupt on vmexit") in 2013.

If you decrypt that commit message and read the patch then you'll notice
that back then this issue would not have happened at all because the
register frame had IF set.

This got changed by f2485b3e0c6c ("KVM: x86: use guest_exit_irqoff") in
June 2016 to save an completely unspecified amount of 'few cycles'.

So much for decades and for useful changelogs which actually prove that
something has a substantial benefit.

Given the amount of changes since then it would be really interesting to
see actual numbers for the benefit of VM_EXIT_ACK_INTR_ON_EXIT before we
end up with more KVM/VIRT specific oddities all over the place.

I'm more than mildly amused that you are terrified by the thought of
reverting back to something which is known _and_ guaranteed to work
while at the same time you are willing to accept any shortcut in the so
fundamental KVM behavior to gain a cycle for the price that everything
else has to adjust to the semantically broken view of KVM.

There is plenty of proof in the git history that KVM follows the
performance first, correctness later principle and I personally have
wasted a lot of _my_ precious time due to that since the day KVM was
shoved into the kernel, which was actually almost _two_ decades ago.

> Pulling in an earlier idea:
>
>  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
>  : handle_external_interrupt_irqoff(), which in turn requires to export
>  : __hrtimer_rearm_deferred().
>
> IMO, that's the way to go.  But instead of exporting __hrtimer_rearm_deferred(),
> move vmx_do_nmi_irqoff() and vmx_do_interrupt_irqoff() into core kernel entry code

Surely not into core kernel entry code as this is x86 specific hackery.

> (along with the assembly glue), and then EXPORT_SYMBOL_FOR_KVM those.  It'd mean
> some extra surgery, e.g. to provide an equivalent to KVM's IDT lookup:
>
> 	gate_offset((gate_desc *)host_idt_base + vector)
>
> But I suspect it would be a big net positive in the end.i  E.g. the entry code
> would *know* it's dealing with a direct call from KVM, and thus shouldn't need
> to play pt_regs games.

As this is x86 specific the generic entry code knows absolutely nothing
unless there is a magic indicator like PeterZ's hack or yet another
duplicated version of the irqentry_exit() code just to accomodate KVM
for handwaving reasons.

As Peter and myself pointed out before this will also not solve the
problem that due to that KVM won't be able to benefit from the recent
hrtimer/hrtick improvements on VMX(TDX) hosts.

To be entirely clear: We are not going to disable HRTICK for the benefit
of this dubious "decades old performance" hack.

> Actually, even better would be to bury the FRED vs. not-FRED details in entry
> code.  E.g. on the KVM invocation side, we could get to something like the below,
> and I'm pretty sure _reduce_ the number of for-KVM exports in the
> process.

That's an orthogonal issue. The problem at hand is independent of FRED
or not-FRED as both end up providing a pt_regs frame with eflags.IF = 0.

For the short term fix, which is required no matter what, checking
irq_regs in hrtimer_interrupt_rearm() is not the worst solution as it
covers _all_ not yet unearthed issues which are nicely hidden in some
dusty corners of architecture specific KVM optimizations and will only
come out around 7.1-rc7 or later when people actually can be bothered to
test stuff...

I just booted a big machine with that patch applied. get_irq_regs() and
the regs_irqs_disabled() check are barely visible in perf because the
cache line is 99% of the time hot and as it is strictly per CPU there is
no contention at all. The only case where it shows up is when there is a
massive amount of hrtimers to expire at the same time with D-cache
consuming callbacks. But in that case the extra cache miss of
get_irq_regs() is just in the noise and not really relevant.

So far that deferred reprogram mechanism seems to be the only known
mechanism which relies on the irqentry_exit() pt_regs::flags::IF state
being correct, but in the long run that's not a sustainable solution.

You really want to come up with real numbers which prove the performance
benefit to justify the extra complexity of this.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 21:49                       ` Thomas Gleixner
@ 2026-04-21 22:07                         ` Sean Christopherson
  2026-04-21 22:24                         ` Paolo Bonzini
  1 sibling, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2026-04-21 22:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma,
	kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu, x86@kernel.org,
	Paolo Bonzini

On Tue, Apr 21, 2026, Thomas Gleixner wrote:
> On Tue, Apr 21 2026 at 11:55, Sean Christopherson wrote:
> > On Tue, Apr 21, 2026, Thomas Gleixner wrote:
> >> >> Looks like. It will take the interrupt after local_irq_enable().
> > VM_EXIT_ACK_INTR_ON_EXIT also provides symmetry with Intel's handing of NMIs, as
> > NMIs are unconditionally "acked" on VM-Exit.
> 
> What's the exact point you are trying to make?

That no matter what we do for IRQs, KVM needs a direct call into the kernel to
handle an asynchronous event that arrived in the past.

> The symmetry is a cosmetic nice to have bullet point, but neither a
> functional nor a correctness requirement. The fact that hardware people
> provided something which looks "useful" at the first glance does not
> make it so.
> 
> > Even if performance is "fine", changing decades of fundamental KVM behavior is
> > terrifying.
> 
> It worked perfectly fine before this was introduced in commit
> a547c6db4d2f ("KVM: VMX: Enable acknowledge interupt on vmexit") in 2013.

Yes, but that configuration hasn't been tested (by KVM) on any CPU released in
the last decade+.  That's what scares me.  Do I think it's at all likely that
there's a lurking ucode bug?  No.  But the risk vs. reward isn't there for me.

But as Paolo pointed out, the "killer" feature gated by ACK-on-exit is posted
interrupts, and _that_ provides a massive performance win.

> > IMO, that's the way to go.  But instead of exporting __hrtimer_rearm_deferred(),
> > move vmx_do_nmi_irqoff() and vmx_do_interrupt_irqoff() into core kernel entry code
> 
> Surely not into core kernel entry code as this is x86 specific hackery.

Oh come on.  I have a hard time believing that you really truly thought that's
what I was suggesting.

> > (along with the assembly glue), and then EXPORT_SYMBOL_FOR_KVM those.  It'd mean
> > some extra surgery, e.g. to provide an equivalent to KVM's IDT lookup:
> >
> > 	gate_offset((gate_desc *)host_idt_base + vector)
> >
> > But I suspect it would be a big net positive in the end.i  E.g. the entry code
> > would *know* it's dealing with a direct call from KVM, and thus shouldn't need
> > to play pt_regs games.
> 
> As this is x86 specific the generic entry code knows absolutely nothing
> unless there is a magic indicator like PeterZ's hack or yet another
> duplicated version of the irqentry_exit() code just to accomodate KVM
> for handwaving reasons.
> 
> As Peter and myself pointed out before this will also not solve the
> problem that due to that KVM won't be able to benefit from the recent
> hrtimer/hrtick improvements on VMX(TDX) hosts.

Sorry, you lost me here.  What's the TDX angle?  Or are you just saying that VMX
is currently hosed with the deferred rearm?

> To be entirely clear: We are not going to disable HRTICK for the benefit
> of this dubious "decades old performance" hack.

No one suggested that.

> > Actually, even better would be to bury the FRED vs. not-FRED details in entry
> > code.  E.g. on the KVM invocation side, we could get to something like the below,
> > and I'm pretty sure _reduce_ the number of for-KVM exports in the
> > process.
> 
> That's an orthogonal issue. The problem at hand is independent of FRED
> or not-FRED as both end up providing a pt_regs frame with eflags.IF = 0.

Eh, not if it gives us a clean, maintable solution for for the problem.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 21:49                       ` Thomas Gleixner
  2026-04-21 22:07                         ` Sean Christopherson
@ 2026-04-21 22:24                         ` Paolo Bonzini
  1 sibling, 0 replies; 39+ messages in thread
From: Paolo Bonzini @ 2026-04-21 22:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sean Christopherson, Jim Mattson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm@vger.kernel.org, Rick P Edgecombe, Binbin Wu,
	x86@kernel.org, Paolo Bonzini

On Tue, Apr 21, 2026 at 10:49 PM Thomas Gleixner <tglx@kernel.org> wrote:
> This got changed by f2485b3e0c6c ("KVM: x86: use guest_exit_irqoff") in
> June 2016 to save an completely unspecified amount of 'few cycles'.

Sean answered in way too much detail :) but since I was about to do it
too that is a local_irq_save/restore pair, so 20-50 cycles per vmexit.
Nowadays, due to the overhead from mitigating side channel attacks,
the cost of vmexits went up from ~1000 clock cycles to (depending on
your settings) even 2500-3000.

Back when posted interrupt support was added, "ack intr on exit" was
always used when available, but was optional. In 2016 I made it
mandatory. The micro-optimization at the time had a consistent 2-3%
gain on some benchmarks like UDP_RR, but also on realistic workloads
with lots of thread wakeups; the simplification was also a nice bonus,
but these days this part is almost negligible these days.

However the optimization that Sean remembered is only a small part of
the story. The real issue is that "ack intr on exit" is a requirement
for posted intr support. For each interrupt from assigned devices, it
saves a full vmexit/vmentry for the vCPU. Adding the extra latency
from processing the interrupt, writing to the eventfd and triggering
the vmexit, that's at least a couple microseconds in total. That's two
orders of magnitude more than the "few cycles" that commit
f2485b3e0c6c added on top.

The reference to TDX is not clear to me either, but I am probably
missing something obvious because it's a bit late here.

Paolo

> You really want to come up with real numbers which prove the performance
> benefit to justify the extra complexity of this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 20:39                       ` Paolo Bonzini
  2026-04-21 21:02                         ` Sean Christopherson
@ 2026-04-21 22:48                         ` Thomas Gleixner
  2026-04-21 23:15                           ` Paolo Bonzini
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2026-04-21 22:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Jim Mattson, Peter Zijlstra, Binbin Wu, Vishal L Verma, kvm,
	Rick P Edgecombe, Binbin Wu, the arch/x86 maintainers,
	Paolo Bonzini

On Tue, Apr 21 2026 at 21:39, Paolo Bonzini wrote:
> Il mar 21 apr 2026, 19:55 Sean Christopherson <seanjc@google.com> ha scritto:
>> Even if performance is "fine", changing decades of fundamental KVM behavior is
>> terrifying.
>
> ... it's not decades, ack on VM exit is actually relatively recent (10
> years out 20 :)). The reason why it was introduced is another killer
> for the idea, though. Posted interrupts require it, for some reason
> only known to Intel.

That's yet another attempt to provide an halfways decent argument, but
as anything else related to KVM/VIRT it is neither documented nor proven
to be true.

Your claim is just another prove of the KVM handwaving approach as the
change log which introduced this (commit a547c6db4d2f) does not mention
it at all. In fact posted interrupt support came _three_ years after
that and there is exactly _zero_ information about this dependency if I
read the relevant change logs correctly.

Especially commit f2485b3e0c6c ("KVM: x86: use guest_exit_irqoff") which
fundamentally changed the behavior from regs::flags::IF = 1 to
regs::flags::IF = 0 does not mention this at all:

    KVM: x86: use guest_exit_irqoff

    This gains a few clock cycles per vmexit.  On Intel there is no need
    anymore to enable the interrupts in vmx_handle_external_intr, since
    we are using the "acknowledge interrupt on exit" feature.  AMD
    needs to do that, and must be careful to avoid the interrupt shadow.

Written and committed by a Paolo Bonzini dude. You might know him
perhaps.

Where exactly is the reference to posted interrupts and how did you or
someone else establish that this hack is a requirement for posted
interrupts in the first place?

Nice try. Try again.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 22:48                         ` Thomas Gleixner
@ 2026-04-21 23:15                           ` Paolo Bonzini
  2026-04-21 23:34                             ` Jim Mattson
  0 siblings, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2026-04-21 23:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sean Christopherson, Jim Mattson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu,
	the arch/x86 maintainers, Paolo Bonzini

On Tue, Apr 21, 2026 at 11:48 PM Thomas Gleixner <tglx@kernel.org> wrote:
> On Tue, Apr 21 2026 at 21:39, Paolo Bonzini wrote:
> > ... it's not decades, ack on VM exit is actually relatively recent (10
> > years out 20 :)). The reason why it was introduced is another killer
> > for the idea, though. Posted interrupts require it, for some reason
> > only known to Intel.
>
> That's yet another attempt to provide an halfways decent argument, but
> as anything else related to KVM/VIRT it is neither documented nor proven
> to be true.

SDM, "27.2.1 Checks on VMX Controls", under "VM-Execution Control Fields":

If the “process posted interrupts” VM-execution control is 1, the
following must be true: [...]
— The “acknowledge interrupt on exit” VM-exit control is 1.

I grant you the "not documented" (not cross referenced to the manual
in commit 01e439be775) part, but "not proven to be true"? Neither I
nor my predecessors are *that* clueless.

And under "posted-interrupt processing" the following two steps are listed:

1. The local APIC is acknowledged; this provides the processor core
with an interrupt vector, called here the physical vector.
2. If the physical vector equals the posted-interrupt notification
vector, the logical processor continues to the next step. Otherwise, a
VM exit occurs as it would normally due to an external interrupt; the
vector is saved in the VM-exit interruption-information field.

So the reason is actually documented by the SDM: "ack intr on exit" is
needed in order to retrieve the interrupt vector, which is then
compared against the posted-interrupt notification vector. As far as
the APIC is concerned, the interrupt is gone; there's no way to "send
it back" and have it delivered after IF=1.

> Your claim is just another prove of the KVM handwaving approach as the
> change log which introduced this (commit a547c6db4d2f) does not mention
> it at all. In fact posted interrupt support came _three_ years after
> that and there is exactly _zero_ information about this dependency if I
> read the relevant change logs correctly.

No, posted interrupt support on the CPU (not in interrupt remapping)
arrived in the same series with commit 5a71785dde30 ("KVM: VMX: Use
posted interrupt to deliver virtual interrupt", 2013-04-16).

There were two different steps. Posted interrupt support on the CPU
was introduced in Ivy Bridge. Posted interrupt support on the IOMMU in
Broadwell (and added to KVM with commit efc644048ecd, "KVM: x86:
Update IRTE for posted-interrupts", 2015-10-01). New virt features
used to come on the "tock" instead of the "tick").

> Where exactly is the reference to posted interrupts and how did you or
> someone else establish that this hack is a requirement for posted
> interrupts in the first place?

Is reading the manual enough?

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 23:15                           ` Paolo Bonzini
@ 2026-04-21 23:34                             ` Jim Mattson
  2026-04-21 23:37                               ` Paolo Bonzini
  0 siblings, 1 reply; 39+ messages in thread
From: Jim Mattson @ 2026-04-21 23:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Thomas Gleixner, Sean Christopherson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu,
	the arch/x86 maintainers, Paolo Bonzini

On Tue, Apr 21, 2026 at 4:16 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> ...
> There were two different steps. Posted interrupt support on the CPU
> was introduced in Ivy Bridge. Posted interrupt support on the IOMMU in
> Broadwell (and added to KVM with commit efc644048ecd, "KVM: x86:
> Update IRTE for posted-interrupts", 2015-10-01). New virt features
> used to come on the "tock" instead of the "tick").

Nit: I believe posted interrupts were only introduced in the higher
end Ivy Town parts, but I could be wrong. :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CPU Lockups in KVM with deferred hrtimer rearming
  2026-04-21 23:34                             ` Jim Mattson
@ 2026-04-21 23:37                               ` Paolo Bonzini
  0 siblings, 0 replies; 39+ messages in thread
From: Paolo Bonzini @ 2026-04-21 23:37 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Thomas Gleixner, Sean Christopherson, Peter Zijlstra, Binbin Wu,
	Vishal L Verma, kvm, Rick P Edgecombe, Binbin Wu,
	the arch/x86 maintainers, Paolo Bonzini

On Wed, Apr 22, 2026 at 12:34 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Tue, Apr 21, 2026 at 4:16 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > ...
> > There were two different steps. Posted interrupt support on the CPU
> > was introduced in Ivy Bridge. Posted interrupt support on the IOMMU in
> > Broadwell (and added to KVM with commit efc644048ecd, "KVM: x86:
> > Update IRTE for posted-interrupts", 2015-10-01). New virt features
> > used to come on the "tock" instead of the "tick").
>
> Nit: I believe posted interrupts were only introduced in the higher
> end Ivy Town parts, but I could be wrong. :)

Yes, they were fused out in Xeon E3 and below. I was only referring to
the processor generation ("v2" in the CPUID model id) - never heard
about "Town" until now.

Paolo


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2026-04-21 23:37 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22   ` Thomas Gleixner
2026-04-20 20:57   ` Verma, Vishal L
2026-04-20 22:19     ` Thomas Gleixner
2026-04-20 22:24       ` Verma, Vishal L
2026-04-21  6:29         ` Thomas Gleixner
2026-04-21  4:51   ` Binbin Wu
2026-04-21  7:39     ` Thomas Gleixner
2026-04-21 11:18       ` Peter Zijlstra
2026-04-21 11:32         ` Peter Zijlstra
2026-04-21 11:34           ` Peter Zijlstra
2026-04-21 11:49             ` Peter Zijlstra
2026-04-21 12:05               ` Peter Zijlstra
2026-04-21 13:19                 ` Peter Zijlstra
2026-04-21 13:29                   ` Peter Zijlstra
2026-04-21 16:36                     ` Thomas Gleixner
2026-04-21 18:11                     ` Verma, Vishal L
2026-04-21 17:11               ` Thomas Gleixner
2026-04-21 17:20                 ` Jim Mattson
2026-04-21 18:29                   ` Thomas Gleixner
2026-04-21 18:55                     ` Sean Christopherson
2026-04-21 20:06                       ` Peter Zijlstra
2026-04-21 20:46                         ` Peter Zijlstra
2026-04-21 20:57                         ` Sean Christopherson
2026-04-21 21:02                           ` Peter Zijlstra
2026-04-21 21:42                             ` Sean Christopherson
2026-04-21 20:39                       ` Paolo Bonzini
2026-04-21 21:02                         ` Sean Christopherson
2026-04-21 22:48                         ` Thomas Gleixner
2026-04-21 23:15                           ` Paolo Bonzini
2026-04-21 23:34                             ` Jim Mattson
2026-04-21 23:37                               ` Paolo Bonzini
2026-04-21 21:49                       ` Thomas Gleixner
2026-04-21 22:07                         ` Sean Christopherson
2026-04-21 22:24                         ` Paolo Bonzini
2026-04-21 19:18                 ` Verma, Vishal L
2026-04-21 16:30           ` Thomas Gleixner
2026-04-21 16:11       ` Verma, Vishal L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox