Re: CPU Lockups in KVM with deferred hrtimer rearming

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Sean Christopherson <seanjc@google.com>
Cc: Jim Mattson <jmattson@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Binbin Wu <binbin.wu@linux.intel.com>,
	Vishal L Verma <vishal.l.verma@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Rick P Edgecombe <rick.p.edgecombe@intel.com>,
	Binbin Wu <binbin.wu@intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	Paolo Bonzini <bonzini@redhat.com>
Subject: Re: CPU Lockups in KVM with deferred hrtimer rearming
Date: Tue, 21 Apr 2026 23:49:15 +0200	[thread overview]
Message-ID: <87zf2w9e78.ffs@tglx> (raw)
In-Reply-To: <aefIJR_FcEeP-fcS@google.com>

On Tue, Apr 21 2026 at 11:55, Sean Christopherson wrote:
> On Tue, Apr 21, 2026, Thomas Gleixner wrote:
>> >> Looks like. It will take the interrupt after local_irq_enable().
>> >
>> > FWIW, VMX should work just like SVM if we clear VM_EXIT_ACK_INTR_ON_EXIT.
>
> Hell no.

I knew for sure that someone from the KVM camp would cry murder :)

>> I know. What's the point of that VM_EXIT_ACK_INTR_ON_EXIT exercise? Is
>> there any performance benefit or is it just used because it's there?
>
> There are performance benefits, and it preserves ordering: the first IRQ that's
> serviced by the host is guaranteed to be _the_ IRQ that triggered the VM-Exit.
> E.g. with AMD's approach, any IRQs that arrive between the VM-Exit and STI (which
> is a pretty big swath of code) could be serviced before the IRQ that triggered
> the exit, depending on priority.

I might eventually buy the performance benefit, but the ordering is not
interesting at all. That's a pure virt-cult fallacy to believe that it
matters. Why?

Look at this bare metal scenario with two interrupts A and B where B has
a higher priority than A:

     cli
     interrupt A is raised in the APIC
     tons of code
     interrupt B is raised in the APIC
     sti
     handle(B)
     handle(A)

or
     cli
     interrupt A is raised in the APIC
     tons of code
     sti
     handle(A)
        interrupt B is raised in the APIC
     handle(B)

It's completely uninteresting which one is handled first. Otherwise this
'handle it directly' approach in VMX would not be correct at all.

The only valid argument here is performance and I'm not really convinced
that it actually matters given the amount of other nonsense which has to
be done on a VMEXIT nowadays.

The point is that the early handling only affects the actual response
time to the interrupt itself, but it does not affect the response time
to anything the interrupt might trigger which requires interrupt and/or
preemption enabled context:

      VMENTER
 -> Host interrupt
      VMEXIT
      handle_early()
        irqentry_enter()
          irq_enter();
          handle();

          irq_exit();        // Cannot handle soft interrupts because IF = 0

        irqentry_exit();     // Cannot handle preemption because IF = 0

I understand that this is optimizing for the case where neither soft
interrupts nor preemption has to be handled, but all I have seen so far
is handwaving about the actual performance benefits. See below.

> VM_EXIT_ACK_INTR_ON_EXIT also provides symmetry with Intel's handing of NMIs, as
> NMIs are unconditionally "acked" on VM-Exit.

What's the exact point you are trying to make?

The symmetry is a cosmetic nice to have bullet point, but neither a
functional nor a correctness requirement. The fact that hardware people
provided something which looks "useful" at the first glance does not
make it so.

> Even if performance is "fine", changing decades of fundamental KVM behavior is
> terrifying.

It worked perfectly fine before this was introduced in commit
a547c6db4d2f ("KVM: VMX: Enable acknowledge interupt on vmexit") in 2013.

If you decrypt that commit message and read the patch then you'll notice
that back then this issue would not have happened at all because the
register frame had IF set.

This got changed by f2485b3e0c6c ("KVM: x86: use guest_exit_irqoff") in
June 2016 to save an completely unspecified amount of 'few cycles'.

So much for decades and for useful changelogs which actually prove that
something has a substantial benefit.

Given the amount of changes since then it would be really interesting to
see actual numbers for the benefit of VM_EXIT_ACK_INTR_ON_EXIT before we
end up with more KVM/VIRT specific oddities all over the place.

I'm more than mildly amused that you are terrified by the thought of
reverting back to something which is known _and_ guaranteed to work
while at the same time you are willing to accept any shortcut in the so
fundamental KVM behavior to gain a cycle for the price that everything
else has to adjust to the semantically broken view of KVM.

There is plenty of proof in the git history that KVM follows the
performance first, correctness later principle and I personally have
wasted a lot of _my_ precious time due to that since the day KVM was
shoved into the kernel, which was actually almost _two_ decades ago.

> Pulling in an earlier idea:
>
>  : Now for VMX, that hrtimer_rearm_deferred() call should really go into
>  : handle_external_interrupt_irqoff(), which in turn requires to export
>  : __hrtimer_rearm_deferred().
>
> IMO, that's the way to go.  But instead of exporting __hrtimer_rearm_deferred(),
> move vmx_do_nmi_irqoff() and vmx_do_interrupt_irqoff() into core kernel entry code

Surely not into core kernel entry code as this is x86 specific hackery.

> (along with the assembly glue), and then EXPORT_SYMBOL_FOR_KVM those.  It'd mean
> some extra surgery, e.g. to provide an equivalent to KVM's IDT lookup:
>
> 	gate_offset((gate_desc *)host_idt_base + vector)
>
> But I suspect it would be a big net positive in the end.i  E.g. the entry code
> would *know* it's dealing with a direct call from KVM, and thus shouldn't need
> to play pt_regs games.

As this is x86 specific the generic entry code knows absolutely nothing
unless there is a magic indicator like PeterZ's hack or yet another
duplicated version of the irqentry_exit() code just to accomodate KVM
for handwaving reasons.

As Peter and myself pointed out before this will also not solve the
problem that due to that KVM won't be able to benefit from the recent
hrtimer/hrtick improvements on VMX(TDX) hosts.

To be entirely clear: We are not going to disable HRTICK for the benefit
of this dubious "decades old performance" hack.

> Actually, even better would be to bury the FRED vs. not-FRED details in entry
> code.  E.g. on the KVM invocation side, we could get to something like the below,
> and I'm pretty sure _reduce_ the number of for-KVM exports in the
> process.

That's an orthogonal issue. The problem at hand is independent of FRED
or not-FRED as both end up providing a pt_regs frame with eflags.IF = 0.

For the short term fix, which is required no matter what, checking
irq_regs in hrtimer_interrupt_rearm() is not the worst solution as it
covers _all_ not yet unearthed issues which are nicely hidden in some
dusty corners of architecture specific KVM optimizations and will only
come out around 7.1-rc7 or later when people actually can be bothered to
test stuff...

I just booted a big machine with that patch applied. get_irq_regs() and
the regs_irqs_disabled() check are barely visible in perf because the
cache line is 99% of the time hot and as it is strictly per CPU there is
no contention at all. The only case where it shows up is when there is a
massive amount of hrtimers to expire at the same time with D-cache
consuming callbacks. But in that case the extra cache miss of
get_irq_regs() is just in the noise and not really relevant.

So far that deferred reprogram mechanism seems to be the only known
mechanism which relies on the irqentry_exit() pt_regs::flags::IF state
being correct, but in the long run that's not a sustainable solution.

You really want to come up with real numbers which prove the performance
benefit to justify the extra complexity of this.

Thanks,

        tglx

next prev parent reply	other threads:[~2026-04-21 21:49 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-16 20:50 CPU Lockups in KVM with deferred hrtimer rearming Verma, Vishal L
2026-04-20 15:00 ` Thomas Gleixner
2026-04-20 15:22   ` Thomas Gleixner
2026-04-20 20:57   ` Verma, Vishal L
2026-04-20 22:19     ` Thomas Gleixner
2026-04-20 22:24       ` Verma, Vishal L
2026-04-21  6:29         ` Thomas Gleixner
2026-04-21  4:51   ` Binbin Wu
2026-04-21  7:39     ` Thomas Gleixner
2026-04-21 11:18       ` Peter Zijlstra
2026-04-21 11:32         ` Peter Zijlstra
2026-04-21 11:34           ` Peter Zijlstra
2026-04-21 11:49             ` Peter Zijlstra
2026-04-21 12:05               ` Peter Zijlstra
2026-04-21 13:19                 ` Peter Zijlstra
2026-04-21 13:29                   ` Peter Zijlstra
2026-04-21 16:36                     ` Thomas Gleixner
2026-04-21 18:11                     ` Verma, Vishal L
2026-04-21 17:11               ` Thomas Gleixner
2026-04-21 17:20                 ` Jim Mattson
2026-04-21 18:29                   ` Thomas Gleixner
2026-04-21 18:55                     ` Sean Christopherson
2026-04-21 20:06                       ` Peter Zijlstra
2026-04-21 20:46                         ` Peter Zijlstra
2026-04-21 20:57                         ` Sean Christopherson
2026-04-21 21:02                           ` Peter Zijlstra
2026-04-21 21:42                             ` Sean Christopherson
2026-04-22  6:55                               ` Peter Zijlstra
2026-04-21 20:39                       ` Paolo Bonzini
2026-04-21 21:02                         ` Sean Christopherson
2026-04-21 22:48                         ` Thomas Gleixner
2026-04-21 23:15                           ` Paolo Bonzini
2026-04-21 23:34                             ` Jim Mattson
2026-04-21 23:37                               ` Paolo Bonzini
2026-04-22  2:10                             ` Thomas Gleixner
2026-04-21 21:49                       ` Thomas Gleixner [this message]
2026-04-21 22:07                         ` Sean Christopherson
2026-04-21 22:24                         ` Paolo Bonzini
2026-04-21 19:18                 ` Verma, Vishal L
2026-04-21 16:30           ` Thomas Gleixner
2026-04-21 16:11       ` Verma, Vishal L

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zf2w9e78.ffs@tglx \
    --to=tglx@kernel.org \
    --cc=binbin.wu@intel.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=bonzini@redhat.com \
    --cc=jmattson@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=vishal.l.verma@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox