From: Sean Christopherson <seanjc@google.com>
To: Jon Kohler <jon@nutanix.com>
Cc: "kvm @ vger . kernel . org" <kvm@vger.kernel.org>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
Fenghua Yu <fenghua.yu@intel.com>,
"kyung.min.park@intel.com" <kyung.min.park@intel.com>,
Tony Luck <tony.luck@intel.com>
Subject: Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2
Date: Wed, 11 Sep 2024 14:32:12 -0700 [thread overview]
Message-ID: <ZuIMXFVcjVUbf9j9@google.com> (raw)
In-Reply-To: <2E1344D0-E955-4E15-9766-73E3DAA73634@nutanix.com>
On Tue, Sep 10, 2024, Jon Kohler wrote:
> How about something like this for the regular ol’ PAUSE route?
>
> Note: the ndelay side would likely be a bit more annoying to handle to internalize
> to KVM, but perhaps we could just have delay library return the amount of cycles,
> and then do the loop I’ve got as a separate, KVM only func?
>
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1627,16 +1627,39 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
> {
> u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns;
> + u64 start, end, delay_by = 0;
>
> /*
> * If the guest TSC is running at a different ratio than the host, then
> - * convert the delay to nanoseconds to achieve an accurate delay. Note
> - * that __delay() uses delay_tsc whenever the hardware has TSC, thus
> - * always for VMX enabled hardware.
> + * convert the delay to nanoseconds to achieve an accurate delay.
> + *
> + * Note: open code delay function as KVM's use case is a bit special, as
> + * we know we need to reenter the guest at a specific time; however, the
> + * delay library may introduce architectural delays that we do not want,
> + * such as using TPAUSE. Our mission is to simply get into the guest as
> + * soon as possible without violating architectural constraints.
> + * RFC: Keep ndelay for help converting to nsec? or pull that in too?
> */
> if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
> - __delay(min(guest_cycles,
> - nsec_to_cycles(vcpu, timer_advance_ns)));
> + delay_by = min(guest_cycles,
> + nsec_to_cycles(vcpu, timer_advance_ns));
> +
> + if (delay_by == 0) {
> + return;
> + } else {
> + start = rdtsc();
> +
> + for (;;) {
> + cpu_relax();
> + end = rdtsc();
> +
> + if (delay_by <= end - start)
> + break;
> +
> + delay_by -= end - start;
> + start = end;
> + }
I don't want to replicate __delay() in KVM. If the delay is short, KVM should
busy wait and intentionally NOT do PAUSE, i.e. not yield to the SMT sibling(s).
Because unlike most uses of PAUSE, this CPU isn't waiting on any resource except
time itself. E.g. there's no need to give a different CPU cycles so that the
other CPU can finish a critical section.
And again, entering the guest so that the vCPU can service the IRQ is a priority,
e.g. see all of the ChromeOS work around paravirt scheduling to boost the priority
of vCPUs that have pending IRQs. Not to mention the physical CPU is running with
IRQs disabled, i.e. if a _host_ IRQ becomes pending, then entering the guest is a
priority in order to recognize that IRQ as well.
> + }
> } else {
> u64 delay_ns = guest_cycles * 1000000ULL;
> do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
Whatever we do, we should do the same thing irrespective of whether or not TSC
scaling is in use. I don't think we have an existing helper to unscale a guest
TSC, but it shouldn't be hard to add (do math).
E.g. something like:
delay_cycles = kvm_unscale_tsc(guest_cycles,
vcpu->arch.tsc_scaling_ratio);
delay_cyles = min(guest_cycles, nsec_to_cycles(vcpu, timer_advance_ns));
if (delay_cycles > N) {
__delay(delay_cycles);
return;
}
for (start = rdtsc(); rdtsc() - start < delay_cycles); )
;
And then I also think we (handwavy "we") should do some profiling to ensure KVM
isn't regularly waiting for more than N cycles, where N is probably something
like 200. Because if KVM is regularly waiting for 200+ cycles, then we likely
need to re-tune the timer_advance_ns logic to be more conservative, i.e. err more
on the side of the timer arriving slightly late so as not to waste CPU cycles.
prev parent reply other threads:[~2024-09-11 21:32 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-06 17:57 KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2 Jon Kohler
2024-09-09 19:11 ` Sean Christopherson
2024-09-10 18:47 ` Jon Kohler
2024-09-11 21:32 ` Sean Christopherson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZuIMXFVcjVUbf9j9@google.com \
--to=seanjc@google.com \
--cc=fenghua.yu@intel.com \
--cc=jon@nutanix.com \
--cc=kvm@vger.kernel.org \
--cc=kyung.min.park@intel.com \
--cc=linux-pm@vger.kernel.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).