Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Jim Mattson <jmattson@google.com>
Cc: Mingwei Zhang <mizhang@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	 Huang Rui <ray.huang@amd.com>,
	"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	 Mario Limonciello <mario.limonciello@amd.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	 Viresh Kumar <viresh.kumar@linaro.org>,
	 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Len Brown <lenb@kernel.org>,  "H. Peter Anvin" <hpa@zytor.com>,
	Perry Yuan <perry.yuan@amd.com>,
	kvm@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-pm@vger.kernel.org
Subject: Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
Date: Tue, 3 Dec 2024 17:59:09 -0800	[thread overview]
Message-ID: <Z0-3bc1reu1slCtL@google.com> (raw)
In-Reply-To: <CALMp9eTCe1-ZA47kcktTQ4WZ=GUbg8x3HpBd0Rf9Yx_pDFkkNg@mail.gmail.com>

On Tue, Dec 03, 2024, Jim Mattson wrote:
> On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > to read them directly by loading guest values into the hardware MSRs.
> > >
> > > These MSRs are continuously running counters whose values must be
> > > carefully tracked during all vCPU state transitions:
> > > - Guest IA32_APERF advances only during guest execution
> >
> > That's not what this series does though.  Guest APERF advances while the vCPU is
> > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > while the vCPU is actively executing in the guest.
> >
> > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> 
> Are you saying that APERF should stop completely outside of VMX
> non-root operation / guest mode?
> While that is possible, the overhead would be significantly
> higher...probably high enough to make it impractical.

No, I'm simply pointing out that the cover letter is misleading/inaccurate.

> > In general, the "why" of this series is missing.  What are the use cases you are
> > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > proposed those exact semantics?
> 
> I get the impression that the questions above are largely rhetorical, and

Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
about the side effects of things like swap and emulated I/O, and if you did, what
made you come to the conclusion that the "best" boundary is on sched_out() and
return to userspace.

> that you would not be happy with the answers anyway, but if you really are
> inviting a version 2, I will gladly expound upon the why.

No need for a new version at this time, just give me the details.

> > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > I/O to cause observed frequency to drop, but it's not obviously correct either.
> >
> > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > desirable for an oversubscribed setup requires a lot more work than defining
> > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > less just partitioned.  Not to mention the complexity for trying to support all
> > potential use cases is likely quite a bit higher.
> >
> > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > the guest utilization is outright wrong in that case.
> 
> My understanding is that Google Cloud customers have been asking for this
> feature for all manner of VM families for years, and most of those VM
> families are not slice-of-hardware, since we just launched our first such
> offering a few months ago.

But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
hat on, what someone's customers are asking for isn't relevant.  What's relevant
is what that someone wants to deliver/enable.

> > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> 
> Part of our goal has been to enable guest APERF/MPERF without impacting the
> use of host APERF/MPERF, since one of the first things our support teams look
> at in response to a performance complaint is the effective frequencies of the
> CPUs as reported on the host.

But is looking at the host's view even useful if (a) the only thing running on
those CPUs is a single vCPU, and (b) host userspace only sees the effective
frequencies when _host_ code is running?  Getting the effective frequency for
when the userspace VMM is processing emulated I/O probably isn't going to be all
that helpful.

And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
decisions or any other behaviors that rely on an accurate host view.

> I can explain all of this in excruciating detail, but I'm not really
> motivated by your initial response, which honestly seems a bit hostile.

Probably because this series made me a bit grumpy :-)  As presented, this feels
way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
stare at it just so, but devoid of details on why X was done instead of Y, and
seemingly ignores multiple edge cases.

I'm not saying you and Mingwei haven't thought about edge cases and design
tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
testcases (also none) communicates those thoughts to others.

> At least you looked at the code, which is a far warmer reception than I
> usually get.

next prev parent reply	other threads:[~2024-12-04  1:59 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 02/22] x86/aperfmperf: Introduce set_guest_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 03/22] x86/aperfmperf: Introduce restore_host_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 04/22] x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host offset Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 05/22] KVM: x86: Introduce kvm_vcpu_make_runnable() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE Mingwei Zhang
2024-12-03 19:07   ` Sean Christopherson
2024-11-21 18:52 ` [RFC PATCH 07/22] KVM: nSVM: Nested #VMEXIT " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 08/22] KVM: nVMX: Nested VM-exit " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 09/22] KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 10/22] KVM: x86: Make APERFMPERF a governed feature Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 11/22] KVM: x86: Initialize guest [am]perf at vcpu power-on Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 12/22] KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 13/22] KVM: x86: Load guest [am]perf when leaving halt state Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 14/22] KVM: x86: Introduce kvm_user_return_notifier_register() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 15/22] KVM: x86: Restore host IA32_[AM]PERF on userspace return Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 16/22] KVM: x86: Save guest [am]perf checkpoint on HLT Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 17/22] KVM: x86: Save guest [am]perf checkpoint on vcpu_put() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 18/22] KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 19/22] KVM: x86: Allow host and guest access to IA32_[AM]PERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 20/22] KVM: VMX: Pass through guest reads of IA32_[AM]PERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 21/22] KVM: SVM: " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 22/22] KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF Mingwei Zhang
2024-12-03 23:19 ` [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Sean Christopherson
2024-12-04  1:13   ` Jim Mattson
2024-12-04  1:59     ` Sean Christopherson [this message]
2024-12-04  4:00       ` Jim Mattson
2024-12-04  5:11       ` Mingwei Zhang
2024-12-04 12:30       ` Jim Mattson
2024-12-06 16:34         ` Sean Christopherson
2024-12-18 22:23           ` Jim Mattson
2025-01-13 19:15             ` Sean Christopherson
2024-12-05  8:59 ` Nikunj A Dadhania
2024-12-05 13:48   ` Jim Mattson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z0-3bc1reu1slCtL@google.com \
    --to=seanjc@google.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hpa@zytor.com \
    --cc=jmattson@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mario.limonciello@amd.com \
    --cc=mizhang@google.com \
    --cc=pbonzini@redhat.com \
    --cc=perry.yuan@amd.com \
    --cc=rafael@kernel.org \
    --cc=ray.huang@amd.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).