Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Mingwei Zhang <mizhang@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Huang Rui <ray.huang@amd.com>,
	 "Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Mario Limonciello <mario.limonciello@amd.com>,
	 "Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Len Brown <lenb@kernel.org>,  "H. Peter Anvin" <hpa@zytor.com>,
	Perry Yuan <perry.yuan@amd.com>,
	kvm@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-pm@vger.kernel.org,  Jim Mattson <jmattson@google.com>
Subject: Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
Date: Tue, 3 Dec 2024 15:19:23 -0800	[thread overview]
Message-ID: <Z0-R-_GPWu-iVAYM@google.com> (raw)
In-Reply-To: <20241121185315.3416855-1-mizhang@google.com>

On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> (250 Hz by default) to measure their effective CPU frequency. To avoid
> the overhead of intercepting these frequent MSR reads, allow the guest
> to read them directly by loading guest values into the hardware MSRs.
> 
> These MSRs are continuously running counters whose values must be
> carefully tracked during all vCPU state transitions:
> - Guest IA32_APERF advances only during guest execution

That's not what this series does though.  Guest APERF advances while the vCPU is
loaded by KVM_RUN, which is *very* different than letting APERF run freely only
while the vCPU is actively executing in the guest.

E.g. a vCPU that is memory oversubscribed via zswap will account a significant
amount of CPU time in APERF when faulting in swapped memory, whereas traditional
file-backed swap will not due to the task being scheduled out while waiting on I/O.

In general, the "why" of this series is missing.  What are the use cases you are
targeting?  What are the exact semantics you want to define?  *Why* did are you
proposed those exact semantics?

E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
requires userspace exits will not.  It's not necessarily wrong for heavy userspace
I/O to cause observed frequency to drop, but it's not obviously correct either.

The use cases matter a lot for APERF/MPERF, because trying to reason about what's
desirable for an oversubscribed setup requires a lot more work than defining
semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
less just partitioned.  Not to mention the complexity for trying to support all
potential use cases is likely quite a bit higher.

And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
workloads running on CPUs should be vCPUs.  It's not clear to me that observing
the guest utilization is outright wrong in that case.

One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
APERF/MPERF if and only if the feature is supported in hardware, but hidden from
the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.

> - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
>   in C0 state, even when not actively running
> - Host kernel access is redirected through get_host_[am]perf() which
>   adds per-CPU offsets to the hardware MSR values
> - Remote MSR reads through /dev/cpu/*/msr also account for these
>   offsets
> 
> Guest values persist in hardware while the vCPU is loaded and
> running. Host MSR values are restored on vcpu_put (either at KVM_RUN
> completion or when preempted) and when transitioning to halt state.
> 
> Note that guest TSC scaling via KVM_SET_TSC_KHZ is not supported, as
> it would require either intercepting MPERF reads on Intel (where MPERF
> ticks at host rate regardless of guest TSC scaling) or significantly
> complicating the cycle accounting on AMD.
> 
> The host must have both CONSTANT_TSC and NONSTOP_TSC capabilities
> since these ensure stable TSC frequency across C-states and P-states,
> which is required for accurate background MPERF accounting.

...

>  arch/x86/include/asm/kvm_host.h  |  11 ++
>  arch/x86/include/asm/topology.h  |  10 ++
>  arch/x86/kernel/cpu/aperfmperf.c |  65 +++++++++++-
>  arch/x86/kvm/cpuid.c             |  12 ++-
>  arch/x86/kvm/governed_features.h |   1 +
>  arch/x86/kvm/lapic.c             |   5 +-
>  arch/x86/kvm/reverse_cpuid.h     |   6 ++
>  arch/x86/kvm/svm/nested.c        |   2 +-
>  arch/x86/kvm/svm/svm.c           |   7 ++
>  arch/x86/kvm/svm/svm.h           |   2 +-
>  arch/x86/kvm/vmx/nested.c        |   2 +-
>  arch/x86/kvm/vmx/vmx.c           |   7 ++
>  arch/x86/kvm/vmx/vmx.h           |   2 +-
>  arch/x86/kvm/x86.c               | 171 ++++++++++++++++++++++++++++---
>  arch/x86/lib/msr-smp.c           |  11 ++
>  drivers/cpufreq/amd-pstate.c     |   4 +-
>  drivers/cpufreq/intel_pstate.c   |   5 +-
>  17 files changed, 295 insertions(+), 28 deletions(-)
> 
> 
> base-commit: 0a9b9d17f3a781dea03baca01c835deaa07f7cc3
> -- 
> 2.47.0.371.ga323438b13-goog
>

next prev parent reply	other threads:[~2024-12-03 23:19 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 02/22] x86/aperfmperf: Introduce set_guest_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 03/22] x86/aperfmperf: Introduce restore_host_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 04/22] x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host offset Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 05/22] KVM: x86: Introduce kvm_vcpu_make_runnable() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE Mingwei Zhang
2024-12-03 19:07   ` Sean Christopherson
2024-11-21 18:52 ` [RFC PATCH 07/22] KVM: nSVM: Nested #VMEXIT " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 08/22] KVM: nVMX: Nested VM-exit " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 09/22] KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 10/22] KVM: x86: Make APERFMPERF a governed feature Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 11/22] KVM: x86: Initialize guest [am]perf at vcpu power-on Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 12/22] KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 13/22] KVM: x86: Load guest [am]perf when leaving halt state Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 14/22] KVM: x86: Introduce kvm_user_return_notifier_register() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 15/22] KVM: x86: Restore host IA32_[AM]PERF on userspace return Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 16/22] KVM: x86: Save guest [am]perf checkpoint on HLT Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 17/22] KVM: x86: Save guest [am]perf checkpoint on vcpu_put() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 18/22] KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 19/22] KVM: x86: Allow host and guest access to IA32_[AM]PERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 20/22] KVM: VMX: Pass through guest reads of IA32_[AM]PERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 21/22] KVM: SVM: " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 22/22] KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF Mingwei Zhang
2024-12-03 23:19 ` Sean Christopherson [this message]
2024-12-04  1:13   ` [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Jim Mattson
2024-12-04  1:59     ` Sean Christopherson
2024-12-04  4:00       ` Jim Mattson
2024-12-04  5:11       ` Mingwei Zhang
2024-12-04 12:30       ` Jim Mattson
2024-12-06 16:34         ` Sean Christopherson
2024-12-18 22:23           ` Jim Mattson
2025-01-13 19:15             ` Sean Christopherson
2024-12-05  8:59 ` Nikunj A Dadhania
2024-12-05 13:48   ` Jim Mattson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z0-R-_GPWu-iVAYM@google.com \
    --to=seanjc@google.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hpa@zytor.com \
    --cc=jmattson@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mario.limonciello@amd.com \
    --cc=mizhang@google.com \
    --cc=pbonzini@redhat.com \
    --cc=perry.yuan@amd.com \
    --cc=rafael@kernel.org \
    --cc=ray.huang@amd.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).