public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* Re: About patch bdedff263132 - KVM: x86: Route pending NMIs
       [not found] <CAE8KmOw1DzOr-GvQ9E+Y5RCX1GQ1h1Bumk5pB++9=SjMUPHxBg@mail.gmail.com>
@ 2023-10-30 15:10 ` Sean Christopherson
  2023-10-31 12:15   ` Prasad Pandit
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2023-10-30 15:10 UTC (permalink / raw)
  To: Prasad Pandit; +Cc: kvm, linux-kernel

+KVM and LKML

https://people.kernel.org/tglx/notes-about-netiquette

On Mon, Oct 30, 2023, Prasad Pandit wrote:
> Hello Sean,
> 
> Please see:
>     -> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdedff263132c862924f5cad96f0e82eeeb4e2e6
> 
> * While testing a real-time host/guest setup, the above patch is
> causing a strange regression wherien guest boot delays by indefinite
> time. Sometimes it boots within a minute, sometimes it takes much
> longer. Maybe the guest VM is waiting for a NMI event.
> 
> * Reverting the above patch helps to fix this issue. I'm wondering if
> a fix patch like below would be acceptable OR reverting above patch is
> more reasonable?

No, a revert would break AMD's vNMI.

> ===
> # cat ~test/rpmbuild/SOURCES/linux-kernel-test.patch
> +++ linux-5.14.0-372.el9/arch/x86/kvm/x86.c     2023-10-30
> 09:05:05.172815973 -0400
> @@ -5277,7 +5277,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_e
>         if (events->flags & KVM_VCPUEVENT_VALID_NMI_PENDING) {
>                 vcpu->arch.nmi_pending = 0;
>                 atomic_set(&vcpu->arch.nmi_queued, events->nmi.pending);
> -               kvm_make_request(KVM_REQ_NMI, vcpu);
> +               if (events->nmi.pending)
> +                       kvm_make_request(KVM_REQ_NMI, vcpu);

This looks sane, but it should be unnecessary as KVM_REQ_NMI nmi_queued=0 should
be a (costly) nop.  Hrm, unless the vCPU is in HLT, in which case KVM will treat
a spurious KVM_REQ_NMI as a wake event.  When I made this change, my assumption
was that userspace would set KVM_VCPUEVENT_VALID_NMI_PENDING iff there was
relevant information to process.  But if I'm reading the code correctly, QEMU
invokes KVM_SET_VCPU_EVENTS with KVM_VCPUEVENT_VALID_NMI_PENDING at the end of
machine creation.

Hmm, but even that should be benign unless userspace is stuffing other guest
state.  E.g. KVM will spuriously exit to userspace with -EAGAIN while the vCPU
is in KVM_MP_STATE_UNINITIALIZED, and I don't see a way for the vCPU to be put
into a blocking state after transitioning out of UNINITIATED via INIT+SIPI without
processing KVM_REQ_NMI.

>         }
>         static_call(kvm_x86_set_nmi_mask)(vcpu, events->nmi.masked);
> ===
> 
> * Could you please have a look and suggest what could be a better fix?

Please provide more information on what is breaking and/or how to reproduce the
issue.  E.g. at the very least, a trace of KVM_{G,S}ET_VCPU_EVENTS.   There's not
even enough info here to write a changelog.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About patch bdedff263132 - KVM: x86: Route pending NMIs
  2023-10-30 15:10 ` About patch bdedff263132 - KVM: x86: Route pending NMIs Sean Christopherson
@ 2023-10-31 12:15   ` Prasad Pandit
       [not found]     ` <CAE8KmOxd-Xib+qfiiBepP-ydjSAn32gjOTdLLUqm-i5vgzTv8w@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Prasad Pandit @ 2023-10-31 12:15 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel

Hello Sean,

On Mon, 30 Oct 2023 at 20:41, Sean Christopherson <seanjc@google.com> wrote:
>> if a fix patch like below would be acceptable OR reverting above patch is
>> more reasonable?
>
> No, a revert would break AMD's vNMI.

* Okay, that confirmation helps.

>> -               kvm_make_request(KVM_REQ_NMI, vcpu);
>> +               if (events->nmi.pending)
>> +                       kvm_make_request(KVM_REQ_NMI, vcpu);
>
> This looks sane, but it should be unnecessary as KVM_REQ_NMI nmi_queued=0 should
> be a (costly) nop.  Hrm, unless the vCPU is in HLT, in which case KVM will treat
> a spurious KVM_REQ_NMI as a wake event.  When I made this change, my assumption
> was that userspace would set KVM_VCPUEVENT_VALID_NMI_PENDING iff there was
> relevant information to process.  But if I'm reading the code correctly, QEMU
> invokes KVM_SET_VCPU_EVENTS with KVM_VCPUEVENT_VALID_NMI_PENDING at the end of
> machine creation.
>
> Hmm, but even that should be benign unless userspace is stuffing other guest
> state.  E.g. KVM will spuriously exit to userspace with -EAGAIN while the vCPU
> is in KVM_MP_STATE_UNINITIALIZED, and I don't see a way for the vCPU to be put
> into a blocking state after transitioning out of UNINITIATED via INIT+SIPI without
> processing KVM_REQ_NMI.
>
> Please provide more information on what is breaking and/or how to reproduce the
> issue.  E.g. at the very least, a trace of KVM_{G,S}ET_VCPU_EVENTS.   There's not
> even enough info here to write a changelog.
>

* I see, I'll try to understand in more detail about what's really
happening and will get back asap.

Thank you.
---
  - Prasad


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fwd: About patch bdedff263132 - KVM: x86: Route pending NMIs
       [not found]     ` <CAE8KmOxd-Xib+qfiiBepP-ydjSAn32gjOTdLLUqm-i5vgzTv8w@mail.gmail.com>
@ 2023-12-29 17:58       ` Prasad Pandit
  2024-01-02 23:00         ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Prasad Pandit @ 2023-12-29 17:58 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, Sean Christopherson

Hello Sean,

On Tue, 31 Oct 2023 at 17:45, Prasad Pandit <ppandit@redhat.com> wrote:
> On Mon, 30 Oct 2023 at 20:41, Sean Christopherson <seanjc@google.com> wrote:
>>> -               kvm_make_request(KVM_REQ_NMI, vcpu);
>>> +               if (events->nmi.pending)
>>> +                       kvm_make_request(KVM_REQ_NMI, vcpu);
> >
> > This looks sane, but it should be unnecessary as KVM_REQ_NMI nmi_queued=0 should
> > be a (costly) nop.  Hrm, unless the vCPU is in HLT, in which case KVM will treat
> > a spurious KVM_REQ_NMI as a wake event.  When I made this change, my assumption
> > was that userspace would set KVM_VCPUEVENT_VALID_NMI_PENDING iff there was
> > relevant information to process.  But if I'm reading the code correctly, QEMU
> > invokes KVM_SET_VCPU_EVENTS with KVM_VCPUEVENT_VALID_NMI_PENDING at the end of
> > machine creation.
> >

QEMU:
qemu_thread_start
 kvm_start_vcpu_thread
  kvm_vcpu_thread_fn
   kvm_cpu_exec
    kvm_arch_put_registers
     kvm_put_vcpu_events (cpu=..., level=1)

qemu_thread_start (args=0x559fdc852110) at ../util/qemu-thread-posix.c:534
 kvm_vcpu_thread_fn (arg=0x559fdc84cdc0) at ../accel/kvm/kvm-accel-ops.c:56
  qemu_wait_io_event (cpu=0x559fdc84cdc0) at ../softmmu/cpus.c:435
   qemu_wait_io_event_common (cpu=0x559fdc84cdc0) at ../softmmu/cpus.c:411
    process_queued_cpu_work (cpu=0x559fdc84cdc0) at ../cpus-common.c:351
     do_kvm_cpu_synchronize_post_reset (cpu=0x559fdc84cdc0, arg=...)
at ../accel/kvm/kvm-all.c:2808
      kvm_arch_put_registers (cpu=0x559fdc84cdc0, level=2) at
../target/i386/kvm/kvm.c:4664
       kvm_put_vcpu_events (cpu=0x559fdc84cdc0, level=2) at
../target/i386/kvm/kvm.c:4308

qemu_thread_start (args=0x559fdc852110) at ../util/qemu-thread-posix.c:534
 kvm_vcpu_thread_fn (arg=0x559fdc84cdc0) at ../accel/kvm/kvm-accel-ops.c:56
  qemu_wait_io_event (cpu=0x559fdc84cdc0) at ../softmmu/cpus.c:435
   qemu_wait_io_event_common (cpu=0x559fdc84cdc0) at ../softmmu/cpus.c:411
    process_queued_cpu_work (cpu=0x559fdc84cdc0) at ../cpus-common.c:351
     do_kvm_cpu_synchronize_post_init (cpu=0x559fdc84cdc0, arg=...) at
../accel/kvm/kvm-all.c:2819
      kvm_arch_put_registers (cpu=0x559fdc84cdc0, level=3) at
../target/i386/kvm/kvm.c:4664
       kvm_put_vcpu_events (cpu=0x559fdc84cdc0, level=3) at
../target/i386/kvm/kvm.c:4308

Kernel:
  kvm_vcpu_ioctl
   mutex_lock_killable(&vcpu->mutex)
    kvm_arch_vcpu_ioctl(, KVM_SET_VCPU_EVENTS, ... )
   mutex_unlock(&vcpu->mutex);
     -> kvm_vcpu_ioctl_x86_set_vcpu_events()

* Above are 3 different ways in which KVM_SET_VCPU_EVENTS ioctl(2) gets called.
        QEMU/target/i386/kvm/kvm.c: kvm_put_vcpu_events()
         if (level >= KVM_PUT_RESET_STATE) {
             events.flags |= KVM_VCPUEVENT_VALID_NMI_PENDING;
  But KVM_VCPUEVENT_VALID_NMI_PENDING is set only when level >=
2(KVM_PUT_RESET_STATE). ie. in the first (level=1) case _NMI_PENDING
is not set.

* In the real-time host set-up I have, KVM_VCPUEVENT_VALID_NMI_PENDING
is called twice for each VCPU and after that kernel goes into what
looks like a lock contention loop. Each time
KVM_VCPUEVENT_VALID_NMI_PENDING is called with 'cpu->env->nmi_injected
= 0' and  'cpu->env->nmi_pending = 0'.  ie. for each VCPU two NMI
events are injected via - kvm_make_request(KVM_REQ_NMI, vcpu), when
vcpu has no NMIs pending.

# perf lock report -t
                Name   acquired  contended     avg wait   total wait
  max wait     min wait

           CPU 3/KVM     154017     154017     62.19 us      9.58 s
 101.01 us      1.49 us
           CPU 9/KVM     152796     152796     62.67 us      9.58 s
  95.92 us      1.49 us
           CPU 7/KVM     151554     151554     63.16 us      9.57 s
 102.70 us      1.48 us
           CPU 1/KVM     151273     151273     65.30 us      9.88 s
  98.88 us      1.52 us
           CPU 6/KVM     151107     151107     63.34 us      9.57 s
 107.64 us      1.50 us
           CPU 8/KVM     151038     151038     63.37 us      9.57 s
 102.93 us      1.51 us
           CPU 2/KVM     150701     150701     63.52 us      9.57 s
  99.24 us      1.50 us
           CPU 5/KVM     150695     150695     63.56 us      9.58 s
 142.15 us      1.50 us
           CPU 4/KVM     150527     150527     63.60 us      9.57 s
 102.04 us      1.44 us
     qemu-system-x86        665        665     65.92 us     43.84 ms
 100.67 us      1.55 us
           CPU 0/KVM             2          2    210.46 us    420.92
us    411.89 us      9.03 us
     qemu-system-x86          1          1    404.91 us    404.91 us
 404.91 us    404.91 us
        TC tc-pc.ram               1          1    414.22 us    414.22
us    414.22 us    414.22 us
  === output for debug===
bad: 10, total: 13
bad rate: 76.92 %
histogram of events caused bad sequence
    acquire: 0
   acquired: 10
  contended: 0
    release: 0


* VCPU#0 thread seems to wait indefinitely to get
qemu_mutex_iothread_lock() to make any progress. The proposed patch
above to check 'events->nmi_pending' for non-zero value helps to fix
this issue.

...wdyt?

Thank you.
---
  - Prasad
PS: The kvm_make_request() routine has following comment, I wonder if
this is what is happening with empty NMI events.
         Request that don't require vCPU action should never be logged in
         vcpu->requests.  The vCPU won't clear the request, so it will stay
         logged indefinitely and prevent the vCPU from entering the guest.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fwd: About patch bdedff263132 - KVM: x86: Route pending NMIs
  2023-12-29 17:58       ` Fwd: " Prasad Pandit
@ 2024-01-02 23:00         ` Sean Christopherson
  2024-01-03  8:04           ` Prasad Pandit
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2024-01-02 23:00 UTC (permalink / raw)
  To: Prasad Pandit; +Cc: kvm, linux-kernel

On Fri, Dec 29, 2023, Prasad Pandit wrote:
> Hello Sean,
> 
> On Tue, 31 Oct 2023 at 17:45, Prasad Pandit <ppandit@redhat.com> wrote:
> > On Mon, 30 Oct 2023 at 20:41, Sean Christopherson <seanjc@google.com> wrote:
> >>> -               kvm_make_request(KVM_REQ_NMI, vcpu);
> >>> +               if (events->nmi.pending)
> >>> +                       kvm_make_request(KVM_REQ_NMI, vcpu);
> > >
> > > This looks sane, but it should be unnecessary as KVM_REQ_NMI nmi_queued=0 should
> > > be a (costly) nop.  Hrm, unless the vCPU is in HLT, in which case KVM will treat
> > > a spurious KVM_REQ_NMI as a wake event.  When I made this change, my assumption
> > > was that userspace would set KVM_VCPUEVENT_VALID_NMI_PENDING iff there was
> > > relevant information to process.  But if I'm reading the code correctly, QEMU
> > > invokes KVM_SET_VCPU_EVENTS with KVM_VCPUEVENT_VALID_NMI_PENDING at the end of
> > > machine creation.
> > >

...

> * Above are 3 different ways in which KVM_SET_VCPU_EVENTS ioctl(2) gets called.
>         QEMU/target/i386/kvm/kvm.c: kvm_put_vcpu_events()
>          if (level >= KVM_PUT_RESET_STATE) {
>              events.flags |= KVM_VCPUEVENT_VALID_NMI_PENDING;
>   But KVM_VCPUEVENT_VALID_NMI_PENDING is set only when level >=
> 2(KVM_PUT_RESET_STATE). ie. in the first (level=1) case _NMI_PENDING
> is not set.
> 
> * In the real-time host set-up I have, KVM_VCPUEVENT_VALID_NMI_PENDING
> is called twice for each VCPU and after that kernel goes into what
> looks like a lock contention loop. Each time
> KVM_VCPUEVENT_VALID_NMI_PENDING is called with 'cpu->env->nmi_injected
> = 0' and  'cpu->env->nmi_pending = 0'.  ie. for each VCPU two NMI
> events are injected via - kvm_make_request(KVM_REQ_NMI, vcpu), when
> vcpu has no NMIs pending.
> 
> # perf lock report -t
>                 Name   acquired  contended     avg wait   total wait   max wait     min wait
> 
>            CPU 3/KVM     154017     154017     62.19 us      9.58 s   101.01 us      1.49 us
>            CPU 9/KVM     152796     152796     62.67 us      9.58 s    95.92 us      1.49 us
>            CPU 7/KVM     151554     151554     63.16 us      9.57 s   102.70 us      1.48 us
>            CPU 1/KVM     151273     151273     65.30 us      9.88 s    98.88 us      1.52 us
>            CPU 6/KVM     151107     151107     63.34 us      9.57 s   107.64 us      1.50 us
>            CPU 8/KVM     151038     151038     63.37 us      9.57 s   102.93 us      1.51 us
>            CPU 2/KVM     150701     150701     63.52 us      9.57 s    99.24 us      1.50 us
>            CPU 5/KVM     150695     150695     63.56 us      9.58 s   142.15 us      1.50 us
>            CPU 4/KVM     150527     150527     63.60 us      9.57 s   102.04 us      1.44 us
>      qemu-system-x86        665        665     65.92 us     43.84 ms  100.67 us      1.55 us
>            CPU 0/KVM          2          2    210.46 us    420.92 us  411.89 us      9.03 us
>      qemu-system-x86          1          1    404.91 us    404.91 us  404.91 us    404.91 us
>         TC tc-pc.ram          1          1    414.22 us    414.22 us  414.22 us    414.22 us
>   === output for debug===
> bad: 10, total: 13
> bad rate: 76.92 %
> histogram of events caused bad sequence
>     acquire: 0
>    acquired: 10
>   contended: 0
>     release: 0
> 
> 
> * VCPU#0 thread seems to wait indefinitely to get qemu_mutex_iothread_lock() to
> make any progress.

Heh, I don't know that I would describe "412 microseconds" as "indefinitely", but
it's certainly a long time, especially during boot.

> The proposed patch above to check 'events->nmi_pending' for non-zero value helps
> to fix this issue.
> 
> ...wdyt?

Piecing things together, the issue is I was wrong about the -EAGAIN exit being
benign.

 : Hmm, but even that should be benign unless userspace is stuffing other guest
 : state.  E.g. KVM will spuriously exit to userspace with -EAGAIN while the vCPU
 : is in KVM_MP_STATE_UNINITIALIZED, and I don't see a way for the vCPU to be put
 : into a blocking state after transitioning out of UNINITIATED via INIT+SIPI without
 : processing KVM_REQ_NMI.

QEMU responds to the spurious exit by bailing from the vCPU's inner runloop, and
when that happens, the associated task (briefly) acquires a global mutex, the
so called BQL (Big QEMU Lock).  I assumed that QEMU would eat the -EAGAIN and do
nothing interesting, but QEMU interprets the -EAGAIN as "there might be a global
state change the vCPU needs to handle".

As you discovered, having 9 vCPUs constantly acquiring and releasing a single
mutex makes for slow going when vCPU0 needs to acquire said mutex, e.g. to do
emulated MMIO.

Ah, and the other wrinkle is that KVM won't actually yield during KVM_RUN for
UNINITIALIZED vCPUs, i.e. all those vCPU tasks will stay at 100% utilization even
though there's nothing for them to do.  That may or may not matter in your case,
but it would be awful behavior in a setup with oversubscribed vCPUs.

I'm not 100% confident there isn't something else going on, e.g. a 400+ microsecond
wait time is a little odd, but this is inarguably a KVM regression and I doubt
it's worth anyone's time to dig deeper.

Can you give me a Signed-off-by for this?  I'll write a changelog and post a
proper patch.

# cat ~test/rpmbuild/SOURCES/linux-kernel-test.patch
+++ linux-5.14.0-372.el9/arch/x86/kvm/x86.c     2023-10-30
09:05:05.172815973 -0400
@@ -5277,7 +5277,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_e
        if (events->flags & KVM_VCPUEVENT_VALID_NMI_PENDING) {
                vcpu->arch.nmi_pending = 0;
                atomic_set(&vcpu->arch.nmi_queued, events->nmi.pending);
-               kvm_make_request(KVM_REQ_NMI, vcpu);
+               if (events->nmi.pending)
+                       kvm_make_request(KVM_REQ_NMI, vcpu);
        }
        static_call(kvm_x86_set_nmi_mask)(vcpu, events->nmi.masked);

> PS: The kvm_make_request() routine has following comment, I wonder if
> this is what is happening with empty NMI events.
>          Request that don't require vCPU action should never be logged in
>          vcpu->requests.  The vCPU won't clear the request, so it will stay
>          logged indefinitely and prevent the vCPU from entering the guest.

Yeah, that's kinda sorta what's happening, although that comment is about requests
that are never cleared in *any* path, e.g. violation of that rule causes a vCPU
to be 100% stuck.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fwd: About patch bdedff263132 - KVM: x86: Route pending NMIs
  2024-01-02 23:00         ` Sean Christopherson
@ 2024-01-03  8:04           ` Prasad Pandit
  0 siblings, 0 replies; 5+ messages in thread
From: Prasad Pandit @ 2024-01-03  8:04 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel

Hello Sean,

On Wed, 3 Jan 2024 at 04:30, Sean Christopherson <seanjc@google.com> wrote:
> Heh, I don't know that I would describe "412 microseconds" as "indefinitely", but
> it's certainly a long time, especially during boot.

* Indefinitely because it does not come out of it. I've left the guest
overnight and still it did not boot.

> Piecing things together, the issue is I was wrong about the -EAGAIN exit being
> benign.
>
> QEMU responds to the spurious exit by bailing from the vCPU's inner runloop, and
> when that happens, the associated task (briefly) acquires a global mutex, the
> so called BQL (Big QEMU Lock).  I assumed that QEMU would eat the -EAGAIN and do
> nothing interesting, but QEMU interprets the -EAGAIN as "there might be a global
> state change the vCPU needs to handle".
>
> As you discovered, having 9 vCPUs constantly acquiring and releasing a single
> mutex makes for slow going when vCPU0 needs to acquire said mutex, e.g. to do
> emulated MMIO.
>
> Ah, and the other wrinkle is that KVM won't actually yield during KVM_RUN for
> UNINITIALIZED vCPUs, i.e. all those vCPU tasks will stay at 100% utilization even
> though there's nothing for them to do.  That may or may not matter in your case,
> but it would be awful behavior in a setup with oversubscribed vCPUs.
...
> Yeah, that's kinda sorta what's happening, although that comment is about requests
> that are never cleared in *any* path, e.g. violation of that rule causes a vCPU
> to be 100% stuck.

* I see, interesting.

> I'm not 100% confident there isn't something else going on, e.g. a 400+ microsecond
> wait time is a little odd,

* It could be vCPU thread's sched priority/policy.

> but this is inarguably a KVM regression and I doubt it's worth anyone's time to dig deeper.
> Can you give me a Signed-off-by for this?  I'll write a changelog and post a proper patch.

* I have sent a formal patch to you. Please feel free to edit the
commit/change log as you see fit. Thanks so much.

Thank you.
---
  - Prasad


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-01-03  8:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAE8KmOw1DzOr-GvQ9E+Y5RCX1GQ1h1Bumk5pB++9=SjMUPHxBg@mail.gmail.com>
2023-10-30 15:10 ` About patch bdedff263132 - KVM: x86: Route pending NMIs Sean Christopherson
2023-10-31 12:15   ` Prasad Pandit
     [not found]     ` <CAE8KmOxd-Xib+qfiiBepP-ydjSAn32gjOTdLLUqm-i5vgzTv8w@mail.gmail.com>
2023-12-29 17:58       ` Fwd: " Prasad Pandit
2024-01-02 23:00         ` Sean Christopherson
2024-01-03  8:04           ` Prasad Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox