linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2
@ 2024-09-06 17:57 Jon Kohler
  2024-09-09 19:11 ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: Jon Kohler @ 2024-09-06 17:57 UTC (permalink / raw)
  To: kvm @ vger . kernel . org
  Cc: linux-pm@vger.kernel.org, Fenghua Yu, kyung.min.park@intel.com,
	Tony Luck

Reaching out to report an observation and get some advice.

Comments in __wait_lapic_expire introduced on [1] are no longer 
completely accurate, as __delay() will not call delay_tsc on systems
that support WAITPKG, such as Intel Sapphire Rapids and higher.
Instead, such systems will have their delay_fn configured to do
delay_halt, which calls delay_halt_fn in a loop until the amount of
cycles has passed. This was introduced on [2].

delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power
optimized version of tpause, which according to documentation [3] is
a slower wakeup latency and higher power savings, with an added benefit
of being more SMT yield friendly.

For datacenter, latency sensitive workloads, this is problematic as
the call to kvm_wait_lapic_expire happens directly prior to reentry
through vmx_vcpu_enter_exit, which is the exact wrong place for slow
wakeup latency.

Intel has a nice paper [4] that talks about TPAUSE in the context of
getting better power utilization using DPDK polling, which has a bunch
of neat measurements, facts, and figures. 

One stands out, according to Intel's paper in figure 5, TPAUSE
has 3.7 times the exit latency coming out of C0.2 when compared to 
C0.1, but it only saves ~15% power when comparing these two states.

Using TPAUSE_C02_STATE seems like the wrong behavior given the spirit
of kvm_wait_lapic_expire seems to be to delay ever so slightly and then
jump back into the guest as soon as that delay is over. If we're going
to have TPAUSE in the critical path, I *think* it should be using
TPAUSE_C01_STATE; however, there is no way to signal that at all.

Side note:
It's worth noting also that the delay_halt call does not do the same
things that delay_tsc does, which calls preempt_{enable|disable}() a
until the delay period if over. I'm not sure one way or the other if
this is the behavior we wanted in kvm_wait_lapic_expire in the first
place, so I'll reserve judgement.

So, with all of that said, there are a few things that could be done,
and I'm definitely open to ideas:
1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which
anecdotally seems inline with the spirit of how AMD implemented
MWAITX, which uses the same delay_halt loop, and calls mwaitx with
MWAITX_DISABLE_CSTATES. 
2. Provide system level configurability to delay.c to optionally use
C01 as a config knob, maybe a compile leve setting? That way distros
aiming at low energy deployments could use that, but otherwise
default is low latency instead?
3. Provide some different delay API that KVM could call, indicating it
wants low wakeup latency delays, if hardware supports it?
4. Pull this code into kvm code directly (boooooo?) and manage it
directly instead of using delay.c (boooooo?)
5. Something else?

[1] b6aa57c69cb ("KVM: lapic: Convert guest TSC to host time domain if necessary") 
[2] cec5f268cd0 ("x86/delay: Introduce TPAUSE delay") 
[3] https://www.felixcloutier.com/x86/tpause
[4] https://www.intel.com/content/www/us/en/content-details/751859/power-management-user-wait-instructions-power-saving-for-dpdk-pmd-polling-workloads-technology-guide.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2
  2024-09-06 17:57 KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2 Jon Kohler
@ 2024-09-09 19:11 ` Sean Christopherson
  2024-09-10 18:47   ` Jon Kohler
  0 siblings, 1 reply; 4+ messages in thread
From: Sean Christopherson @ 2024-09-09 19:11 UTC (permalink / raw)
  To: Jon Kohler
  Cc: kvm @ vger . kernel . org, linux-pm@vger.kernel.org, Fenghua Yu,
	kyung.min.park@intel.com, Tony Luck

On Fri, Sep 06, 2024, Jon Kohler wrote:
> delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power
> optimized version of tpause, which according to documentation [3] is
> a slower wakeup latency and higher power savings, with an added benefit
> of being more SMT yield friendly.
> 
> For datacenter, latency sensitive workloads, this is problematic as
> the call to kvm_wait_lapic_expire happens directly prior to reentry
> through vmx_vcpu_enter_exit, which is the exact wrong place for slow
> wakeup latency.

...

> So, with all of that said, there are a few things that could be done,
> and I'm definitely open to ideas:
> 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which
> anecdotally seems inline with the spirit of how AMD implemented
> MWAITX, which uses the same delay_halt loop, and calls mwaitx with
> MWAITX_DISABLE_CSTATES. 
> 2. Provide system level configurability to delay.c to optionally use
> C01 as a config knob, maybe a compile leve setting? That way distros
> aiming at low energy deployments could use that, but otherwise
> default is low latency instead?
> 3. Provide some different delay API that KVM could call, indicating it
> wants low wakeup latency delays, if hardware supports it?
> 4. Pull this code into kvm code directly (boooooo?) and manage it
> directly instead of using delay.c (boooooo?)
> 5. Something else?

The option that would likely give the best of both worlds would be to prioritize
lower wakeup latency for "small" delays.  That could be done in __delay() and/or
in KVM.  E.g. delay_halt_tpause() quite clearly assumes a relatively long delay,
which is a flawed assumption in this case.

	/*
	 * Hard code the deeper (C0.2) sleep state because exit latency is
	 * small compared to the "microseconds" that usleep() will delay.
	 */
	__tpause(TPAUSE_C02_STATE, edx, eax);

The reason I say "and/or KVM" is that even without TPAUSE in the picture, it might
make sense for KVM to avoid __delay() for anything but long delays.  Both because
the overhead of e.g. delay_tsc() could be higher than the delay itself, but also
because the intent of KVM's delay is somewhat unique.

By definition, KVM _knows_ there is an IRQ that is being deliver to the vCPU, i.e.
entering the guest and running the vCPU asap is a priority.  The _only_ reason KVM
is waiting is to not violate the architecture.  Reducing power consumption and
even letting an SMT sibling run are arguably non-goals, i.e. it might be best for
KVM to avoid even regular ol' PAUSE in this specific scenario, unless the wait
time is so high that delaying VM-Enter more than the absolute bare minimum
becomes a worthwhile tradeoff.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2
  2024-09-09 19:11 ` Sean Christopherson
@ 2024-09-10 18:47   ` Jon Kohler
  2024-09-11 21:32     ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: Jon Kohler @ 2024-09-10 18:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm @ vger . kernel . org, linux-pm@vger.kernel.org, Fenghua Yu,
	kyung.min.park@intel.com, Tony Luck



> On Sep 9, 2024, at 3:11 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Fri, Sep 06, 2024, Jon Kohler wrote:
>> delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power
>> optimized version of tpause, which according to documentation [3] is
>> a slower wakeup latency and higher power savings, with an added benefit
>> of being more SMT yield friendly.
>> 
>> For datacenter, latency sensitive workloads, this is problematic as
>> the call to kvm_wait_lapic_expire happens directly prior to reentry
>> through vmx_vcpu_enter_exit, which is the exact wrong place for slow
>> wakeup latency.
> 
> ...
> 
>> So, with all of that said, there are a few things that could be done,
>> and I'm definitely open to ideas:
>> 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which
>> anecdotally seems inline with the spirit of how AMD implemented
>> MWAITX, which uses the same delay_halt loop, and calls mwaitx with
>> MWAITX_DISABLE_CSTATES. 
>> 2. Provide system level configurability to delay.c to optionally use
>> C01 as a config knob, maybe a compile leve setting? That way distros
>> aiming at low energy deployments could use that, but otherwise
>> default is low latency instead?
>> 3. Provide some different delay API that KVM could call, indicating it
>> wants low wakeup latency delays, if hardware supports it?
>> 4. Pull this code into kvm code directly (boooooo?) and manage it
>> directly instead of using delay.c (boooooo?)
>> 5. Something else?
> 
> The option that would likely give the best of both worlds would be to prioritize
> lower wakeup latency for "small" delays.  That could be done in __delay() and/or
> in KVM.  E.g. delay_halt_tpause() quite clearly assumes a relatively long delay,
> which is a flawed assumption in this case.
> 
> /*
> * Hard code the deeper (C0.2) sleep state because exit latency is
> * small compared to the "microseconds" that usleep() will delay.
> */
> __tpause(TPAUSE_C02_STATE, edx, eax);
> 
> The reason I say "and/or KVM" is that even without TPAUSE in the picture, it might
> make sense for KVM to avoid __delay() for anything but long delays.  Both because
> the overhead of e.g. delay_tsc() could be higher than the delay itself, but also
> because the intent of KVM's delay is somewhat unique.
> 
> By definition, KVM _knows_ there is an IRQ that is being deliver to the vCPU, i.e.
> entering the guest and running the vCPU asap is a priority.  The _only_ reason KVM
> is waiting is to not violate the architecture.  Reducing power consumption and
> even letting an SMT sibling run are arguably non-goals, i.e. it might be best for
> KVM to avoid even regular ol' PAUSE in this specific scenario, unless the wait
> time is so high that delaying VM-Enter more than the absolute bare minimum
> becomes a worthwhile tradeoff.

Hey Sean - thanks for the sage advice as always.

How about something like this for the regular ol’ PAUSE route?

Note: the ndelay side would likely be a bit more annoying to handle to internalize
to KVM, but perhaps we could just have delay library return the amount of cycles,
and then do the loop I’ve got as a separate, KVM only func?

--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1627,16 +1627,39 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
 {
        u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns;
+       u64 start, end, delay_by = 0;
 
        /*
         * If the guest TSC is running at a different ratio than the host, then
-        * convert the delay to nanoseconds to achieve an accurate delay.  Note
-        * that __delay() uses delay_tsc whenever the hardware has TSC, thus
-        * always for VMX enabled hardware.
+        * convert the delay to nanoseconds to achieve an accurate delay.
+        *
+        * Note: open code delay function as KVM's use case is a bit special, as
+        * we know we need to reenter the guest at a specific time; however, the
+        * delay library may introduce architectural delays that we do not want,
+        * such as using TPAUSE. Our mission is to simply get into the guest as
+        * soon as possible without violating architectural constraints.
+        * RFC: Keep ndelay for help converting to nsec? or pull that in too?
         */
        if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
-               __delay(min(guest_cycles,
-                       nsec_to_cycles(vcpu, timer_advance_ns)));
+               delay_by = min(guest_cycles, 
+                                          nsec_to_cycles(vcpu, timer_advance_ns));
+
+               if (delay_by == 0) {
+                       return;
+               } else {
+                       start = rdtsc();
+
+                       for (;;) {
+                               cpu_relax();
+                               end = rdtsc();
+
+                               if (delay_by <= end - start)
+                                       break;
+
+                               delay_by -= end - start;
+                               start = end;
+                       }
+               }
        } else {
                u64 delay_ns = guest_cycles * 1000000ULL;
                do_div(delay_ns, vcpu->arch.virtual_tsc_khz);



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2
  2024-09-10 18:47   ` Jon Kohler
@ 2024-09-11 21:32     ` Sean Christopherson
  0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2024-09-11 21:32 UTC (permalink / raw)
  To: Jon Kohler
  Cc: kvm @ vger . kernel . org, linux-pm@vger.kernel.org, Fenghua Yu,
	kyung.min.park@intel.com, Tony Luck

On Tue, Sep 10, 2024, Jon Kohler wrote:
> How about something like this for the regular ol’ PAUSE route?
> 
> Note: the ndelay side would likely be a bit more annoying to handle to internalize
> to KVM, but perhaps we could just have delay library return the amount of cycles,
> and then do the loop I’ve got as a separate, KVM only func?
> 
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1627,16 +1627,39 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>  static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
>  {
>         u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns;
> +       u64 start, end, delay_by = 0;
>  
>         /*
>          * If the guest TSC is running at a different ratio than the host, then
> -        * convert the delay to nanoseconds to achieve an accurate delay.  Note
> -        * that __delay() uses delay_tsc whenever the hardware has TSC, thus
> -        * always for VMX enabled hardware.
> +        * convert the delay to nanoseconds to achieve an accurate delay.
> +        *
> +        * Note: open code delay function as KVM's use case is a bit special, as
> +        * we know we need to reenter the guest at a specific time; however, the
> +        * delay library may introduce architectural delays that we do not want,
> +        * such as using TPAUSE. Our mission is to simply get into the guest as
> +        * soon as possible without violating architectural constraints.
> +        * RFC: Keep ndelay for help converting to nsec? or pull that in too?
>          */
>         if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
> -               __delay(min(guest_cycles,
> -                       nsec_to_cycles(vcpu, timer_advance_ns)));
> +               delay_by = min(guest_cycles, 
> +                                          nsec_to_cycles(vcpu, timer_advance_ns));
> +
> +               if (delay_by == 0) {
> +                       return;
> +               } else {
> +                       start = rdtsc();
> +
> +                       for (;;) {
> +                               cpu_relax();
> +                               end = rdtsc();
> +
> +                               if (delay_by <= end - start)
> +                                       break;
> +
> +                               delay_by -= end - start;
> +                               start = end;
> +                       }

I don't want to replicate __delay() in KVM.  If the delay is short, KVM should
busy wait and intentionally NOT do PAUSE, i.e. not yield to the SMT sibling(s).
Because unlike most uses of PAUSE, this CPU isn't waiting on any resource except
time itself.  E.g. there's no need to give a different CPU cycles so that the
other CPU can finish a critical section.

And again, entering the guest so that the vCPU can service the IRQ is a priority,
e.g. see all of the ChromeOS work around paravirt scheduling to boost the priority
of vCPUs that have pending IRQs.  Not to mention the physical CPU is running with
IRQs disabled, i.e. if a _host_ IRQ becomes pending, then entering the guest is a
priority in order to recognize that IRQ as well.

> +               }
>         } else {
>                 u64 delay_ns = guest_cycles * 1000000ULL;
>                 do_div(delay_ns, vcpu->arch.virtual_tsc_khz);

Whatever we do, we should do the same thing irrespective of whether or not TSC
scaling is in use.  I don't think we have an existing helper to unscale a guest
TSC, but it shouldn't be hard to add (do math).

E.g. something like:

	delay_cycles = kvm_unscale_tsc(guest_cycles,
				       vcpu->arch.tsc_scaling_ratio);
	delay_cyles = min(guest_cycles, nsec_to_cycles(vcpu, timer_advance_ns));

	if (delay_cycles > N) {
		__delay(delay_cycles);
		return;
	}

	for (start = rdtsc(); rdtsc() - start < delay_cycles); )
		;

And then I also think we (handwavy "we") should do some profiling to ensure KVM
isn't regularly waiting for more than N cycles, where N is probably something
like 200.  Because if KVM is regularly waiting for 200+ cycles, then we likely
need to re-tune the timer_advance_ns logic to be more conservative, i.e. err more
on the side of the timer arriving slightly late so as not to waste CPU cycles.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-09-11 21:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-06 17:57 KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2 Jon Kohler
2024-09-09 19:11 ` Sean Christopherson
2024-09-10 18:47   ` Jon Kohler
2024-09-11 21:32     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).