Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Eric Mackay <eric.mackay@oracle.com>
To: boris.ostrovsky@oracle.com
Cc: eric.mackay@oracle.com, imammedo@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	seanjc@google.com
Subject: Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
Date: Thu, 26 Sep 2024 18:22:39 -0700	[thread overview]
Message-ID: <20240927012239.34406-1-eric.mackay@oracle.com> (raw)
In-Reply-To: <4274f9be-1c3d-4246-abe9-69c4d8ca8964@oracle.com>

> On 9/24/24 5:40 AM, Igor Mammedov wrote:
>> On Fri, 19 Apr 2024 12:17:01 -0400
>> boris.ostrovsky@oracle.com wrote:
>> 
>>> On 4/17/24 9:58 AM, boris.ostrovsky@oracle.com wrote:
>>>>
>>>> I noticed that I was using a few months old qemu bits and now I am
>>>> having trouble reproducing this on latest bits. Let me see if I can get
>>>> this to fail with latest first and then try to trace why the processor
>>>> is in this unexpected state.
>>>
>>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
>>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
>>>
>>> I need to understand whether lack of failures is a side effect of timing
>>> changes that simply make hotplug fail less likely or if this is an
>>> actual (but seemingly unintentional) fix.
>> 
>> Agreed, we should find out culprit of the problem.
>
>
> I haven't been able to spend much time on this unfortunately, Eric is 
> now starting to look at this again.
>
> One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to 
> vcpus serially while on HW my understanding is that this is done as a 
> broadcast so I thought this could cause a race. I had a quick test with 
> pausing and resuming all vcpus around the loop but that didn't help.
>
>
>> 
>> PS:
>> also if you are using AMD host, there was a regression in OVMF
>> where where vCPU that OSPM was already online-ing, was yanked
>> from under OSMP feet by OVMF (which depending on timing could
>> manifest as lost SIPI).
>> 
>> edk2 commit that should fix it is:
>>      https://github.com/tianocore/edk2/commit/1c19ccd5103b
>> 
>> Switching to Intel host should rule that out at least.
>> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
>> if you are forced to use AMD host)

I haven't been able to reproduce the issue on an Intel host thus far,
but it may not be an apples-to-apples comparison because my AMD hosts
have a much higher core count.

>
> I just tried with latest bits that include this commit and still was 
> able to reproduce the problem.
>
>
>-boris

The initial hotplug of each CPU appears to complete from the
perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
sequence sent from the guest that doesn't complete.

My working theory has been that some CPU/AP is lagging behind the others
when the BSP is waiting for all the APs to go into SMM, and the BSP just
gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
CPU does finally go into SMM, and other CPUs are in normal mode.

I've been able to observe the SMI handler for the problematic CPU will
sometimes start running when no BSP is elected. This means we have a
window of time where the CPU will ignore SIPI, and least 1 CPU is in
normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
the guest.

next prev parent reply	other threads:[~2024-09-27  1:22 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-16 20:47 [PATCH] KVM/x86: Do not clear SIPI while in SMM Boris Ostrovsky
2024-04-16 20:53 ` Paolo Bonzini
2024-04-16 20:57   ` boris.ostrovsky
2024-04-16 22:03     ` Paolo Bonzini
2024-04-16 22:14       ` Sean Christopherson
2024-04-16 23:02         ` boris.ostrovsky
2024-04-16 22:56       ` boris.ostrovsky
2024-04-16 23:17         ` Sean Christopherson
2024-04-16 23:37           ` boris.ostrovsky
2024-04-17 12:40             ` Igor Mammedov
2024-04-17 13:58               ` boris.ostrovsky
2024-04-19 16:17                 ` boris.ostrovsky
2024-09-24  9:40                   ` Igor Mammedov
2024-09-24 21:59                     ` boris.ostrovsky
2024-09-27  1:22                       ` Eric Mackay [this message]
2024-09-27  9:28                         ` Igor Mammedov
2024-09-30 23:34                           ` Eric Mackay
2024-10-01  8:18                             ` Igor Mammedov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240927012239.34406-1-eric.mackay@oracle.com \
    --to=eric.mackay@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=imammedo@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox