[RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
@ 2024-11-21 18:52 Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf() Mingwei Zhang
                   ` (23 more replies)
  0 siblings, 24 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
(250 Hz by default) to measure their effective CPU frequency. To avoid
the overhead of intercepting these frequent MSR reads, allow the guest
to read them directly by loading guest values into the hardware MSRs.

These MSRs are continuously running counters whose values must be
carefully tracked during all vCPU state transitions:
- Guest IA32_APERF advances only during guest execution
- Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
  in C0 state, even when not actively running
- Host kernel access is redirected through get_host_[am]perf() which
  adds per-CPU offsets to the hardware MSR values
- Remote MSR reads through /dev/cpu/*/msr also account for these
  offsets

Guest values persist in hardware while the vCPU is loaded and
running. Host MSR values are restored on vcpu_put (either at KVM_RUN
completion or when preempted) and when transitioning to halt state.

Note that guest TSC scaling via KVM_SET_TSC_KHZ is not supported, as
it would require either intercepting MPERF reads on Intel (where MPERF
ticks at host rate regardless of guest TSC scaling) or significantly
complicating the cycle accounting on AMD.

The host must have both CONSTANT_TSC and NONSTOP_TSC capabilities
since these ensure stable TSC frequency across C-states and P-states,
which is required for accurate background MPERF accounting.

Jim Mattson (14):
  x86/aperfmperf: Introduce get_host_[am]perf()
  x86/aperfmperf: Introduce set_guest_[am]perf()
  x86/aperfmperf: Introduce restore_host_[am]perf()
  x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host
    offset
  KVM: x86: Introduce kvm_vcpu_make_runnable()
  KVM: x86: INIT may transition from HALTED to RUNNABLE
  KVM: nSVM: Nested #VMEXIT may transition from HALTED to RUNNABLE
  KVM: nVMX: Nested VM-exit may transition from HALTED to RUNNABLE
  KVM: x86: Make APERFMPERF a governed feature
  KVM: x86: Initialize guest [am]perf at vcpu power-on
  KVM: x86: Load guest [am]perf when leaving halt state
  KVM: x86: Introduce kvm_user_return_notifier_register()
  KVM: x86: Restore host IA32_[AM]PERF on userspace return
  KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions

Mingwei Zhang (8):
  KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF
  KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load()
  KVM: x86: Save guest [am]perf checkpoint on HLT
  KVM: x86: Save guest [am]perf checkpoint on vcpu_put()
  KVM: x86: Allow host and guest access to IA32_[AM]PERF
  KVM: VMX: Pass through guest reads of IA32_[AM]PERF
  KVM: SVM: Pass through guest reads of IA32_[AM]PERF
  KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF

 arch/x86/include/asm/kvm_host.h  |  11 ++
 arch/x86/include/asm/topology.h  |  10 ++
 arch/x86/kernel/cpu/aperfmperf.c |  65 +++++++++++-
 arch/x86/kvm/cpuid.c             |  12 ++-
 arch/x86/kvm/governed_features.h |   1 +
 arch/x86/kvm/lapic.c             |   5 +-
 arch/x86/kvm/reverse_cpuid.h     |   6 ++
 arch/x86/kvm/svm/nested.c        |   2 +-
 arch/x86/kvm/svm/svm.c           |   7 ++
 arch/x86/kvm/svm/svm.h           |   2 +-
 arch/x86/kvm/vmx/nested.c        |   2 +-
 arch/x86/kvm/vmx/vmx.c           |   7 ++
 arch/x86/kvm/vmx/vmx.h           |   2 +-
 arch/x86/kvm/x86.c               | 171 ++++++++++++++++++++++++++++---
 arch/x86/lib/msr-smp.c           |  11 ++
 drivers/cpufreq/amd-pstate.c     |   4 +-
 drivers/cpufreq/intel_pstate.c   |   5 +-
 17 files changed, 295 insertions(+), 28 deletions(-)


base-commit: 0a9b9d17f3a781dea03baca01c835deaa07f7cc3
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 02/22] x86/aperfmperf: Introduce set_guest_[am]perf() Mingwei Zhang
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

In preparation for KVM pass-through of IA32_APERF and IA32_MPERF,
introduce wrappers that read these MSRs. Going forward, all kernel
code that needs host APERF/MPERF values should use these wrappers
instead of rdmsrl().

While these functions currently just read the MSRs directly, future
patches will enhance them to handle cases where the MSRs contain guest
values. Moving all host APERF/MPERF reads to use these functions now
will make it easier to add this functionality later.

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/topology.h  |  3 +++
 arch/x86/kernel/cpu/aperfmperf.c | 22 ++++++++++++++++++----
 drivers/cpufreq/amd-pstate.c     |  4 ++--
 drivers/cpufreq/intel_pstate.c   |  5 +++--
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 92f3664dd933b..2ef9903cf85d7 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -302,6 +302,9 @@ static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
 #endif
 
+extern u64 get_host_aperf(void);
+extern u64 get_host_mperf(void);
+
 extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
 
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index f642de2ebdac8..3be5070ba3361 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -40,8 +40,8 @@ static void init_counter_refs(void)
 {
 	u64 aperf, mperf;
 
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
+	aperf = get_host_aperf();
+	mperf = get_host_mperf();
 
 	this_cpu_write(cpu_samples.aperf, aperf);
 	this_cpu_write(cpu_samples.mperf, mperf);
@@ -94,6 +94,20 @@ void arch_set_max_freq_ratio(bool turbo_disabled)
 }
 EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
 
+u64 get_host_aperf(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+	return native_read_msr(MSR_IA32_APERF);
+}
+EXPORT_SYMBOL_GPL(get_host_aperf);
+
+u64 get_host_mperf(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+	return native_read_msr(MSR_IA32_MPERF);
+}
+EXPORT_SYMBOL_GPL(get_host_mperf);
+
 static bool __init turbo_disabled(void)
 {
 	u64 misc_en;
@@ -474,8 +488,8 @@ void arch_scale_freq_tick(void)
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
 		return;
 
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
+	aperf = get_host_aperf();
+	mperf = get_host_mperf();
 	acnt = aperf - s->aperf;
 	mcnt = mperf - s->mperf;
 
diff --git a/drivers/cpufreq/amd-pstate.c b/drivers/cpufreq/amd-pstate.c
index b63863f77c677..c26092284be56 100644
--- a/drivers/cpufreq/amd-pstate.c
+++ b/drivers/cpufreq/amd-pstate.c
@@ -446,8 +446,8 @@ static inline bool amd_pstate_sample(struct amd_cpudata *cpudata)
 	unsigned long flags;
 
 	local_irq_save(flags);
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
+	aperf = get_host_aperf();
+	mperf = get_host_mperf();
 	tsc = rdtsc();
 
 	if (cpudata->prev.mperf == mperf || cpudata->prev.tsc == tsc) {
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 400337f3b572d..993a66095547f 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -27,6 +27,7 @@
 #include <linux/vmalloc.h>
 #include <linux/pm_qos.h>
 #include <linux/bitfield.h>
+#include <linux/topology.h>
 #include <trace/events/power.h>
 
 #include <asm/cpu.h>
@@ -2423,8 +2424,8 @@ static inline bool intel_pstate_sample(struct cpudata *cpu, u64 time)
 	u64 tsc;
 
 	local_irq_save(flags);
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
+	aperf = get_host_aperf();
+	mperf = get_host_mperf();
 	tsc = rdtsc();
 	if (cpu->prev_mperf == mperf || cpu->prev_tsc == tsc) {
 		local_irq_restore(flags);
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 02/22] x86/aperfmperf: Introduce set_guest_[am]perf()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf() Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 03/22] x86/aperfmperf: Introduce restore_host_[am]perf() Mingwei Zhang
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

KVM guests need access to IA32_APERF and IA32_MPERF to observe their
effective CPU frequency, but intercepting reads of these MSRs is too
expensive since Linux guests read them every scheduler tick (250 Hz by
default). Allow the guest to read these MSRs without interception by
loading guest values into the hardware MSRs.

When loading a guest value into IA32_APERF or IA32_MPERF:
1. Query the current host value
2. Record the offset between guest and host values in a per-CPU variable
3. Load the guest value into the MSR

Modify get_host_[am]perf() to add the per-CPU offset to the raw MSR
value, so that host kernel code can still obtain correct host values
even when the MSRs contain guest values.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/topology.h  |  5 +++++
 arch/x86/kernel/cpu/aperfmperf.c | 31 +++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 2ef9903cf85d7..fef5846c01976 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -302,8 +302,13 @@ static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
 static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
 #endif
 
+DECLARE_PER_CPU(u64, host_aperf_offset);
+DECLARE_PER_CPU(u64, host_mperf_offset);
+
 extern u64 get_host_aperf(void);
 extern u64 get_host_mperf(void);
+extern void set_guest_aperf(u64 aperf);
+extern void set_guest_mperf(u64 mperf);
 
 extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 3be5070ba3361..8b66872aa98c1 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -94,20 +94,47 @@ void arch_set_max_freq_ratio(bool turbo_disabled)
 }
 EXPORT_SYMBOL_GPL(arch_set_max_freq_ratio);
 
+DEFINE_PER_CPU(u64, host_aperf_offset);
+DEFINE_PER_CPU(u64, host_mperf_offset);
+
 u64 get_host_aperf(void)
 {
 	WARN_ON_ONCE(!irqs_disabled());
-	return native_read_msr(MSR_IA32_APERF);
+	return native_read_msr(MSR_IA32_APERF) +
+		this_cpu_read(host_aperf_offset);
 }
 EXPORT_SYMBOL_GPL(get_host_aperf);
 
 u64 get_host_mperf(void)
 {
 	WARN_ON_ONCE(!irqs_disabled());
-	return native_read_msr(MSR_IA32_MPERF);
+	return native_read_msr(MSR_IA32_MPERF) +
+		this_cpu_read(host_mperf_offset);
 }
 EXPORT_SYMBOL_GPL(get_host_mperf);
 
+void set_guest_aperf(u64 guest_aperf)
+{
+	u64 host_aperf;
+
+	WARN_ON_ONCE(!irqs_disabled());
+	host_aperf = get_host_aperf();
+	wrmsrl(MSR_IA32_APERF, guest_aperf);
+	this_cpu_write(host_aperf_offset, host_aperf - guest_aperf);
+}
+EXPORT_SYMBOL_GPL(set_guest_aperf);
+
+void set_guest_mperf(u64 guest_mperf)
+{
+	u64 host_mperf;
+
+	WARN_ON_ONCE(!irqs_disabled());
+	host_mperf = get_host_mperf();
+	wrmsrl(MSR_IA32_MPERF, guest_mperf);
+	this_cpu_write(host_mperf_offset, host_mperf - guest_mperf);
+}
+EXPORT_SYMBOL_GPL(set_guest_mperf);
+
 static bool __init turbo_disabled(void)
 {
 	u64 misc_en;
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 03/22] x86/aperfmperf: Introduce restore_host_[am]perf()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf() Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 02/22] x86/aperfmperf: Introduce set_guest_[am]perf() Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 04/22] x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host offset Mingwei Zhang
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

Round out the {host,guest}[am]perf APIs by adding functions to restore
host values to the hardware MSRs. These functions:

1. Write the current host value (obtained via get_host_[am]perf()) to
   the corresponding MSR
2. Clear the per-CPU offset used to track the difference between guest
   and host values

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/topology.h  |  2 ++
 arch/x86/kernel/cpu/aperfmperf.c | 16 ++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index fef5846c01976..8d4d4cd41bd84 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -309,6 +309,8 @@ extern u64 get_host_aperf(void);
 extern u64 get_host_mperf(void);
 extern void set_guest_aperf(u64 aperf);
 extern void set_guest_mperf(u64 mperf);
+extern void restore_host_aperf(void);
+extern void restore_host_mperf(void);
 
 extern void arch_scale_freq_tick(void);
 #define arch_scale_freq_tick arch_scale_freq_tick
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 8b66872aa98c1..4d6c0b8b39452 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -135,6 +135,22 @@ void set_guest_mperf(u64 guest_mperf)
 }
 EXPORT_SYMBOL_GPL(set_guest_mperf);
 
+void restore_host_aperf(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+	wrmsrl(MSR_IA32_APERF, get_host_aperf());
+	this_cpu_write(host_aperf_offset, 0);
+}
+EXPORT_SYMBOL_GPL(restore_host_aperf);
+
+void restore_host_mperf(void)
+{
+	WARN_ON_ONCE(!irqs_disabled());
+	wrmsrl(MSR_IA32_MPERF, get_host_mperf());
+	this_cpu_write(host_mperf_offset, 0);
+}
+EXPORT_SYMBOL_GPL(restore_host_mperf);
+
 static bool __init turbo_disabled(void)
 {
 	u64 misc_en;
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 04/22] x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host offset
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (2 preceding siblings ...)
  2024-11-21 18:52 ` [RFC PATCH 03/22] x86/aperfmperf: Introduce restore_host_[am]perf() Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 05/22] KVM: x86: Introduce kvm_vcpu_make_runnable() Mingwei Zhang
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

When reading IA32_APERF or IA32_MPERF remotely via /dev/cpu/*/msr,
account for any offset between the hardware MSR value and the true
host value. This ensures tools like turbostat get correct host values
even when the hardware MSRs contain guest values.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/lib/msr-smp.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/lib/msr-smp.c b/arch/x86/lib/msr-smp.c
index acd463d887e1c..43c5d21e840fb 100644
--- a/arch/x86/lib/msr-smp.c
+++ b/arch/x86/lib/msr-smp.c
@@ -4,6 +4,15 @@
 #include <linux/smp.h>
 #include <linux/completion.h>
 #include <asm/msr.h>
+#include <asm/topology.h>
+
+static void adjust_host_aperfmperf(u32 msr_no, struct msr *reg)
+{
+	if (msr_no == MSR_IA32_APERF)
+		reg->q += this_cpu_read(host_aperf_offset);
+	else if (msr_no == MSR_IA32_MPERF)
+		reg->q += this_cpu_read(host_mperf_offset);
+}
 
 static void __rdmsr_on_cpu(void *info)
 {
@@ -16,6 +25,7 @@ static void __rdmsr_on_cpu(void *info)
 		reg = &rv->reg;
 
 	rdmsr(rv->msr_no, reg->l, reg->h);
+	adjust_host_aperfmperf(rv->msr_no, reg);
 }
 
 static void __wrmsr_on_cpu(void *info)
@@ -154,6 +164,7 @@ static void __rdmsr_safe_on_cpu(void *info)
 	struct msr_info_completion *rv = info;
 
 	rv->msr.err = rdmsr_safe(rv->msr.msr_no, &rv->msr.reg.l, &rv->msr.reg.h);
+	adjust_host_aperfmperf(rv->msr.msr_no, &rv->msr.reg);
 	complete(&rv->done);
 }
 
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 05/22] KVM: x86: Introduce kvm_vcpu_make_runnable()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (3 preceding siblings ...)
  2024-11-21 18:52 ` [RFC PATCH 04/22] x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host offset Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-11-21 18:52 ` [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE Mingwei Zhang
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

Factor out common code that handles the transition from
HALTED/AP_RESET_HOLD to RUNNABLE state. In addition to changing
mp_state, this transition has side effects (clearing pv_unhalted,
apf.halted) which must be handled consistently across all code paths.

As future patches add more side effects to this state transition, this
helper ensures they will be applied uniformly at all transition
points.

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c              | 16 +++++++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9d..04ef56d10cbb1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2283,6 +2283,8 @@ static inline bool kvm_is_supported_user_return_msr(u32 msr)
 	return kvm_find_user_return_msr(msr) >= 0;
 }
 
+void kvm_vcpu_make_runnable(struct kvm_vcpu *vcpu);
+
 u64 kvm_scale_tsc(u64 tsc, u64 ratio);
 u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
 u64 kvm_calc_nested_tsc_offset(u64 l1_offset, u64 l2_offset, u64 l2_multiplier);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146fc..3c6b0ca91e5f5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11167,6 +11167,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 	return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
 }
 
+void kvm_vcpu_make_runnable(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED ||
+	    vcpu->arch.mp_state == KVM_MP_STATE_AP_RESET_HOLD)
+		vcpu->arch.pv.pv_unhalted = false;
+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	vcpu->arch.apf.halted = false;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_make_runnable);
+
 /* Called within kvm->srcu read side.  */
 static inline int vcpu_block(struct kvm_vcpu *vcpu)
 {
@@ -11222,12 +11232,8 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 	switch(vcpu->arch.mp_state) {
 	case KVM_MP_STATE_HALTED:
 	case KVM_MP_STATE_AP_RESET_HOLD:
-		vcpu->arch.pv.pv_unhalted = false;
-		vcpu->arch.mp_state =
-			KVM_MP_STATE_RUNNABLE;
-		fallthrough;
 	case KVM_MP_STATE_RUNNABLE:
-		vcpu->arch.apf.halted = false;
+		kvm_vcpu_make_runnable(vcpu);
 		break;
 	case KVM_MP_STATE_INIT_RECEIVED:
 		break;
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (4 preceding siblings ...)
  2024-11-21 18:52 ` [RFC PATCH 05/22] KVM: x86: Introduce kvm_vcpu_make_runnable() Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-12-03 19:07   ` Sean Christopherson
  2024-11-21 18:52 ` [RFC PATCH 07/22] KVM: nSVM: Nested #VMEXIT " Mingwei Zhang
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

When a halted vCPU is awakened by an INIT signal, it might have been
the target of a previous KVM_HC_KICK_CPU hypercall, in which case
pv_unhalted would be set. This flag should be cleared before the next
HLT instruction, as kvm_vcpu_has_events() would otherwise return true
and prevent the vCPU from entering the halt state.

Use kvm_vcpu_make_runnable() to ensure consistent handling of the
HALTED to RUNNABLE state transition.

Fixes: 6aef266c6e17 ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks")
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/lapic.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 95c6beb8ce279..97aa634505306 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3372,9 +3372,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 
 	if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
 		kvm_vcpu_reset(vcpu, true);
-		if (kvm_vcpu_is_bsp(apic->vcpu))
-			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
-		else
+		kvm_vcpu_make_runnable(vcpu);
+		if (!kvm_vcpu_is_bsp(apic->vcpu))
 			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
 	}
 	if (test_and_clear_bit(KVM_APIC_SIPI, &apic->pending_events)) {
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 07/22] KVM: nSVM: Nested #VMEXIT may transition from HALTED to RUNNABLE
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (5 preceding siblings ...)
  2024-11-21 18:52 ` [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE Mingwei Zhang
@ 2024-11-21 18:52 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 08/22] KVM: nVMX: Nested VM-exit " Mingwei Zhang
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:52 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

When a halted vCPU is awakened by a nested event, it might have been
the target of a previous KVM_HC_KICK_CPU hypercall, in which case
pv_unhalted would be set. This flag should be cleared before the next
HLT instruction, as kvm_vcpu_has_events() would otherwise return true
and prevent the vCPU from entering the halt state.

Use kvm_vcpu_make_runnable() to ensure consistent handling of the
HALTED to RUNNABLE state transition.

Fixes: 38c0b192bd6d ("KVM: SVM: leave halted state on vmexit")
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/svm/nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index cf84103ce38b9..49e6cdfeac4da 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -994,7 +994,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 
 	/* in case we halted in L2 */
-	svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	kvm_vcpu_make_runnable(vcpu);
 
 	/* Give the current vmcb to the guest */
 
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 08/22] KVM: nVMX: Nested VM-exit may transition from HALTED to RUNNABLE
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (6 preceding siblings ...)
  2024-11-21 18:52 ` [RFC PATCH 07/22] KVM: nSVM: Nested #VMEXIT " Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 09/22] KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF Mingwei Zhang
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

When a halted vCPU is awakened by a nested event, it might have been
the target of a previous KVM_HC_KICK_CPU hypercall, in which case
pv_unhalted would be set. This flag should be cleared before the next
HLT instruction, as kvm_vcpu_has_events() would otherwise return true
and prevent the vCPU from entering the halt state.

Use kvm_vcpu_make_runnable() to ensure consistent handling of the
HALTED to RUNNABLE state transition.

Fixes: b6b8a1451fc4 ("KVM: nVMX: Rework interception of IRQs and NMIs")
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/vmx/nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 931a7361c30f2..202eacfd87036 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5048,7 +5048,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 		vmx->nested.need_vmcs12_to_shadow_sync = true;
 
 	/* in case we halted in L2 */
-	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	kvm_vcpu_make_runnable(vcpu);
 
 	if (likely(!vmx->fail)) {
 		if (vm_exit_reason != -1)
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 09/22] KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (7 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 08/22] KVM: nVMX: Nested VM-exit " Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 10/22] KVM: x86: Make APERFMPERF a governed feature Mingwei Zhang
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

The APERFMPERF feature bit appears in CPUID.6H:ECX[0], but is exposed to
Linux via X86_FEATURE_APERFMPERF in a Linux-defined feature word. To
enable KVM's reverse CPUID functionality for this feature, define a KVM
feature word matching the hardware CPUID leaf that contains the APERFMPERF
bit.

This patch only adds the feature definition - enabling and advertising
the feature to guests will be handled in later patches.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/reverse_cpuid.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/reverse_cpuid.h b/arch/x86/kvm/reverse_cpuid.h
index 0d17d6b706396..d12e5d9ab2251 100644
--- a/arch/x86/kvm/reverse_cpuid.h
+++ b/arch/x86/kvm/reverse_cpuid.h
@@ -18,6 +18,7 @@ enum kvm_only_cpuid_leafs {
 	CPUID_8000_0022_EAX,
 	CPUID_7_2_EDX,
 	CPUID_24_0_EBX,
+	CPUID_6_ECX,
 	NR_KVM_CPU_CAPS,
 
 	NKVMCAPINTS = NR_KVM_CPU_CAPS - NCAPINTS,
@@ -42,6 +43,9 @@ enum kvm_only_cpuid_leafs {
 #define KVM_X86_FEATURE_SGX2		KVM_X86_FEATURE(CPUID_12_EAX, 1)
 #define KVM_X86_FEATURE_SGX_EDECCSSA	KVM_X86_FEATURE(CPUID_12_EAX, 11)
 
+/* Intel-defined sub-features, CPUID level 0x00000006 (ECX) */
+#define KVM_X86_FEATURE_APERFMPERF	KVM_X86_FEATURE(CPUID_6_ECX, 0)
+
 /* Intel-defined sub-features, CPUID level 0x00000007:1 (EDX) */
 #define X86_FEATURE_AVX_VNNI_INT8       KVM_X86_FEATURE(CPUID_7_1_EDX, 4)
 #define X86_FEATURE_AVX_NE_CONVERT      KVM_X86_FEATURE(CPUID_7_1_EDX, 5)
@@ -98,6 +102,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
 	[CPUID_8000_0022_EAX] = {0x80000022, 0, CPUID_EAX},
 	[CPUID_7_2_EDX]       = {         7, 2, CPUID_EDX},
 	[CPUID_24_0_EBX]      = {      0x24, 0, CPUID_EBX},
+	[CPUID_6_ECX]         = {         6, 0, CPUID_ECX},
 };
 
 /*
@@ -135,6 +140,7 @@ static __always_inline u32 __feature_translate(int x86_feature)
 	KVM_X86_TRANSLATE_FEATURE(SGX_EDECCSSA);
 	KVM_X86_TRANSLATE_FEATURE(CONSTANT_TSC);
 	KVM_X86_TRANSLATE_FEATURE(PERFMON_V2);
+	KVM_X86_TRANSLATE_FEATURE(APERFMPERF);
 	KVM_X86_TRANSLATE_FEATURE(RRSBA_CTRL);
 	KVM_X86_TRANSLATE_FEATURE(BHI_CTRL);
 	default:
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 10/22] KVM: x86: Make APERFMPERF a governed feature
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (8 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 09/22] KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 11/22] KVM: x86: Initialize guest [am]perf at vcpu power-on Mingwei Zhang
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

KVM must verify both host support and guest CPUID enumeration before
enabling guest access to APERFMPERF.

This verification is deferred until the implementation of guest
APERFMPERF is complete.

This declaration enables "guest_can_use(vcpu,
X86_FEATURE_APERFMPERF)". Otherwise, that expression results in a
compile-time error.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/governed_features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index ad463b1ed4e4a..fa77d655d2355 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -17,6 +17,7 @@ KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
 KVM_GOVERNED_X86_FEATURE(VGIF)
 KVM_GOVERNED_X86_FEATURE(VNMI)
 KVM_GOVERNED_X86_FEATURE(LAM)
+KVM_GOVERNED_X86_FEATURE(APERFMPERF)
 
 #undef KVM_GOVERNED_X86_FEATURE
 #undef KVM_GOVERNED_FEATURE
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 11/22] KVM: x86: Initialize guest [am]perf at vcpu power-on
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (9 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 10/22] KVM: x86: Make APERFMPERF a governed feature Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 12/22] KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load() Mingwei Zhang
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

The guest's IA32_APERF and IA32_MPERF MSRs start at zero. However,
IA32_MPERF should be incremented whenever the vCPU is in C0, just as
the host's IA32_MPERF MSR is incremented by hardware.

Record the host TSC at vcpu_reset() to start tracking time spent in C0.
Later patches will add the host TSC delta to the guest's stored IA32_MPERF
value at appropriate points.

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/kvm_host.h | 9 +++++++++
 arch/x86/kvm/x86.c              | 7 +++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 04ef56d10cbb1..067e6ec7f7e9c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -738,6 +738,13 @@ struct kvm_queued_exception {
 	bool has_payload;
 };
 
+struct kvm_vcpu_aperfmperf {
+	u64 guest_aperf;
+	u64 guest_mperf;
+	u64 host_tsc;
+	bool loaded_while_running;
+};
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -1040,6 +1047,8 @@ struct kvm_vcpu_arch {
 #if IS_ENABLED(CONFIG_HYPERV)
 	hpa_t hv_root_tdp;
 #endif
+
+	struct kvm_vcpu_aperfmperf aperfmperf;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3c6b0ca91e5f5..d66cccff13347 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12476,6 +12476,13 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 		__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
 		__kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
+
+		/*
+		 * IA32_MPERF should start running now. Record the host TSC
+		 * so that we can add the host TSC delta the next time that
+		 * we load the guest [am]perf values into the hardware MSRs.
+		 */
+		vcpu->arch.aperfmperf.host_tsc = rdtsc();
 	}
 
 	/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 12/22] KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (10 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 11/22] KVM: x86: Initialize guest [am]perf at vcpu power-on Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 13/22] KVM: x86: Load guest [am]perf when leaving halt state Mingwei Zhang
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

For vCPUs with APERFMPERF and in KVM_RUN, load the guest IA32_APERF
and IA32_MPERF values into the hardware MSRs when loading the vCPU,
but only if the vCPU is not halted. For running vCPUs, first add in
any "background" C0 cycles accumulated since the last checkpoint.

Note that for host TSC measurements of background C0 cycles, we assume
IA32_MPERF increments at the same frequency as TSC. While this is true
for all known processors with these MSRs, it is not architecturally
guaranteed.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d66cccff13347..b914578718d9c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1918,6 +1918,22 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
 				 _kvm_set_msr);
 }
 
+/*
+ * Add elapsed TSC ticks to guest IA32_MPERF while vCPU is in C0 but
+ * not running. Uses TSC instead of host MPERF to include time when
+ * physical CPU is in lower C-states, as guest MPERF should count
+ * whenever vCPU is in C0.  Assumes TSC and MPERF frequencies match.
+ */
+static void kvm_accumulate_background_guest_mperf(struct kvm_vcpu *vcpu)
+{
+	u64 now = rdtsc();
+	s64 tsc_delta = now - vcpu->arch.aperfmperf.host_tsc;
+
+	if (tsc_delta > 0)
+		vcpu->arch.aperfmperf.guest_mperf += tsc_delta;
+	vcpu->arch.aperfmperf.host_tsc = now;
+}
+
 /*
  * Read the MSR specified by @index into @data.  Select MSR specific fault
  * checks are bypassed if @host_initiated is %true.
@@ -4980,6 +4996,19 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
 	return kvm_arch_has_noncoherent_dma(vcpu->kvm);
 }
 
+static void kvm_load_guest_aperfmperf(struct kvm_vcpu *vcpu, bool update_mperf)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (update_mperf)
+		kvm_accumulate_background_guest_mperf(vcpu);
+	set_guest_aperf(vcpu->arch.aperfmperf.guest_aperf);
+	set_guest_mperf(vcpu->arch.aperfmperf.guest_mperf);
+	vcpu->arch.aperfmperf.loaded_while_running = true;
+	local_irq_restore(flags);
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -5039,6 +5068,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		vcpu->cpu = cpu;
 	}
 
+	if (vcpu->wants_to_run &&
+	    guest_can_use(vcpu, X86_FEATURE_APERFMPERF) &&
+	    (vcpu->scheduled_out ? vcpu->arch.aperfmperf.loaded_while_running :
+	     (vcpu->arch.mp_state != KVM_MP_STATE_HALTED)))
+		kvm_load_guest_aperfmperf(vcpu, true);
+
 	kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 13/22] KVM: x86: Load guest [am]perf when leaving halt state
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (11 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 12/22] KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load() Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 14/22] KVM: x86: Introduce kvm_user_return_notifier_register() Mingwei Zhang
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

When a vCPU transitions from HALTED to RUNNABLE, it resumes time keeping:
its virtual IA32_MPERF MSR should start accumulating C0 cycles again.
Load the guest values into the hardware MSRs for direct guest access.

Background cycle accumulation is unnecessary at this point since the
vCPU has been in C1, so the guest's IA32_MPERF has been stopped.

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b914578718d9c..acfa9ecc5bc36 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11204,9 +11204,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 
 void kvm_vcpu_make_runnable(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED ||
-	    vcpu->arch.mp_state == KVM_MP_STATE_AP_RESET_HOLD)
+	switch (vcpu->arch.mp_state) {
+	case KVM_MP_STATE_HALTED:
+		if (guest_can_use(vcpu, X86_FEATURE_APERFMPERF) &&
+		    vcpu->wants_to_run)
+			kvm_load_guest_aperfmperf(vcpu, false);
+		fallthrough;
+	case KVM_MP_STATE_AP_RESET_HOLD:
 		vcpu->arch.pv.pv_unhalted = false;
+		break;
+	}
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 	vcpu->arch.apf.halted = false;
 }
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 14/22] KVM: x86: Introduce kvm_user_return_notifier_register()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (12 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 13/22] KVM: x86: Load guest [am]perf when leaving halt state Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 15/22] KVM: x86: Restore host IA32_[AM]PERF on userspace return Mingwei Zhang
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

Factor out user return notifier registration logic from
kvm_set_user_return_msr() into a separate helper function to prepare
for the registration of KVM's user return notifier from
kvm_load_guest_aperfmperf().

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index acfa9ecc5bc36..6df8f21b83eb1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -637,6 +637,15 @@ static void kvm_user_return_msr_cpu_online(void)
 	}
 }
 
+static void kvm_user_return_notifier_register(struct kvm_user_return_msrs *msrs)
+{
+	if (!msrs->registered) {
+		msrs->urn.on_user_return = kvm_on_user_return;
+		user_return_notifier_register(&msrs->urn);
+		msrs->registered = true;
+	}
+}
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
@@ -650,11 +659,7 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 		return 1;
 
 	msrs->values[slot].curr = value;
-	if (!msrs->registered) {
-		msrs->urn.on_user_return = kvm_on_user_return;
-		user_return_notifier_register(&msrs->urn);
-		msrs->registered = true;
-	}
+	kvm_user_return_notifier_register(msrs);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 15/22] KVM: x86: Restore host IA32_[AM]PERF on userspace return
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (13 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 14/22] KVM: x86: Introduce kvm_user_return_notifier_register() Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 16/22] KVM: x86: Save guest [am]perf checkpoint on HLT Mingwei Zhang
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

Add support for restoring host IA32_APERF and IA32_MPERF values when
returning to userspace. While not strictly necessary since reads of
/dev/cpu/*/msr now reconstruct host values, restoring the host values
maintains cleaner system state.

Leverage KVM's existing user return notifier infrastructure but add a
separate flag since these MSRs require dynamic value restoration
rather than static value restoration. Restoration is only performed
when guest values have been loaded into the hardware MSRs.

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6df8f21b83eb1..ad5351673362c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -207,6 +207,7 @@ module_param(mitigate_smt_rsb, bool, 0444);
 struct kvm_user_return_msrs {
 	struct user_return_notifier urn;
 	bool registered;
+	bool restore_aperfmperf;
 	struct kvm_user_return_msr_values {
 		u64 host;
 		u64 curr;
@@ -571,6 +572,11 @@ static void kvm_on_user_return(struct user_return_notifier *urn)
 	 * interrupted and executed through kvm_arch_disable_virtualization_cpu()
 	 */
 	local_irq_save(flags);
+	if (msrs->restore_aperfmperf) {
+		restore_host_aperf();
+		restore_host_mperf();
+		msrs->restore_aperfmperf = false;
+	}
 	if (msrs->registered) {
 		msrs->registered = false;
 		user_return_notifier_unregister(urn);
@@ -5003,6 +5009,7 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
 
 static void kvm_load_guest_aperfmperf(struct kvm_vcpu *vcpu, bool update_mperf)
 {
+	struct kvm_user_return_msrs *msrs;
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -5011,6 +5018,9 @@ static void kvm_load_guest_aperfmperf(struct kvm_vcpu *vcpu, bool update_mperf)
 	set_guest_aperf(vcpu->arch.aperfmperf.guest_aperf);
 	set_guest_mperf(vcpu->arch.aperfmperf.guest_mperf);
 	vcpu->arch.aperfmperf.loaded_while_running = true;
+	msrs = this_cpu_ptr(user_return_msrs);
+	kvm_user_return_notifier_register(msrs);
+	msrs->restore_aperfmperf = true;
 	local_irq_restore(flags);
 }
 
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 16/22] KVM: x86: Save guest [am]perf checkpoint on HLT
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (14 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 15/22] KVM: x86: Restore host IA32_[AM]PERF on userspace return Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 17/22] KVM: x86: Save guest [am]perf checkpoint on vcpu_put() Mingwei Zhang
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

When the guest executes HLT, the vCPU transitions from virtual C0 to
C1 state. Its virtual IA32_APERF and IA32_MPERF MSRs should stop
counting at this point, just as the host's MSRs stop when it enters
C1.

Save a checkpoint of the current hardware MSR values and host
TSC. Later, if/when the vCPU becomes runnable again, we will start
accumulating C0 cycles from this checkpoint.

To avoid complications, also restore host MSR values at this time,

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ad5351673362c..793f5d2afeb2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5139,6 +5139,21 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 	mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
 }
 
+static void kvm_put_guest_aperfmperf(struct kvm_vcpu *vcpu)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (vcpu->arch.aperfmperf.loaded_while_running) {
+		rdmsrl(MSR_IA32_APERF, vcpu->arch.aperfmperf.guest_aperf);
+		rdmsrl(MSR_IA32_MPERF, vcpu->arch.aperfmperf.guest_mperf);
+		vcpu->arch.aperfmperf.host_tsc = rdtsc();
+		if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
+			vcpu->arch.aperfmperf.loaded_while_running = false;
+	}
+	local_irq_restore(flags);
+}
+
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	int idx;
@@ -11363,10 +11378,13 @@ static int __kvm_emulate_halt(struct kvm_vcpu *vcpu, int state, int reason)
 	 */
 	++vcpu->stat.halt_exits;
 	if (lapic_in_kernel(vcpu)) {
-		if (kvm_vcpu_has_events(vcpu))
+		if (kvm_vcpu_has_events(vcpu)) {
 			vcpu->arch.pv.pv_unhalted = false;
-		else
+		} else {
 			vcpu->arch.mp_state = state;
+			if (guest_can_use(vcpu, X86_FEATURE_APERFMPERF))
+				kvm_put_guest_aperfmperf(vcpu);
+		}
 		return 1;
 	} else {
 		vcpu->run->exit_reason = reason;
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 17/22] KVM: x86: Save guest [am]perf checkpoint on vcpu_put()
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (15 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 16/22] KVM: x86: Save guest [am]perf checkpoint on HLT Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 18/22] KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions Mingwei Zhang
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

For vCPUs with APERFMPERF that are in KVM_RUN and not halted,
checkpoint the current hardware MSR values along with the host TSC
when unloading the vCPU. While still in virtual C0 state, the vCPU
will no longer run on this physical CPU, requiring different handling
for each counter:

- IA32_APERF should stop accumulating since no actual CPU cycles are
  being spent on behalf of the guest
- IA32_MPERF should continue accumulating cycles since the guest is
  still in C0 state

Later when the vCPU is reloaded, we'll use this checkpoint and the
host TSC delta to properly account for any "background" cycles that
should be reflected in the guest's IA32_MPERF value.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 793f5d2afeb2b..7c22bda3b1f7b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5173,6 +5173,11 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 		srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	}
 
+	if (vcpu->wants_to_run &&
+	    guest_can_use(vcpu, X86_FEATURE_APERFMPERF) &&
+	    vcpu->arch.aperfmperf.loaded_while_running)
+		kvm_put_guest_aperfmperf(vcpu);
+
 	kvm_x86_call(vcpu_put)(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();
 }
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 18/22] KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (16 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 17/22] KVM: x86: Save guest [am]perf checkpoint on vcpu_put() Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 19/22] KVM: x86: Allow host and guest access to IA32_[AM]PERF Mingwei Zhang
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

When the host modifies a vCPU's MP_STATE after the vCPU has started
running, maintain the accuracy of guest aperfmperf tracking:

1. For transitions from !HALTED to HALTED, add any accumulated
   "background" TSC ticks to the guest_mperf checkpoint before
   stopping the counter.

2. For transitions from HALTED to !HALTED, record the current TSC in
   host_tsc to begin accumulating background cycles in guest_mperf.

This ensures the guest MPERF counter properly reflects time spent in
C0 vs C1 states, even when state transitions are initiated by the host
rather than the guest.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c22bda3b1f7b..cd1f1ae86f83f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11904,6 +11904,18 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 	     mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
 		goto out;
 
+	if (kvm_vcpu_has_run(vcpu) &&
+	    guest_can_use(vcpu, X86_FEATURE_APERFMPERF)) {
+		if (mp_state->mp_state == KVM_MP_STATE_HALTED &&
+		    vcpu->arch.mp_state != KVM_MP_STATE_HALTED) {
+			kvm_accumulate_background_guest_mperf(vcpu);
+			vcpu->arch.aperfmperf.loaded_while_running = false;
+		} else if (mp_state->mp_state != KVM_MP_STATE_HALTED &&
+			   vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
+			vcpu->arch.aperfmperf.host_tsc = rdtsc();
+		}
+	}
+
 	if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
 		vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
 		set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 19/22] KVM: x86: Allow host and guest access to IA32_[AM]PERF
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (17 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 18/22] KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 20/22] KVM: VMX: Pass through guest reads of IA32_[AM]PERF Mingwei Zhang
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

Implement MSR read/write handlers for IA32_APERF and IA32_MPERF to
support both host and guest access:

- Host userspace access via KVM_[GS]ET_MSRS only reads/writes the
  snapshot values in vcpu->arch.aperfmperf
- Guest writes update both the hardware MSRs (via set_guest_[am]perf)
  and the snapshots
- For host-initiated writes of IA32_MPERF, record the current TSC to
  establish a new baseline for background cycle accumulation
- Guest reads don't reach these handlers as they access the MSRs directly

Add both MSRs to msrs_to_save_base[] to ensure they are properly
serialized during vCPU state save/restore operations.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/x86.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd1f1ae86f83f..4394ecb291401 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -334,6 +334,7 @@ static const u32 msrs_to_save_base[] = {
 	MSR_IA32_UMWAIT_CONTROL,
 
 	MSR_IA32_XFD, MSR_IA32_XFD_ERR,
+	MSR_IA32_APERF, MSR_IA32_MPERF,
 };
 
 static const u32 msrs_to_save_pmu[] = {
@@ -4151,6 +4152,26 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		vcpu->arch.msr_misc_features_enables = data;
 		break;
+	case MSR_IA32_APERF:
+		if ((data || !msr_info->host_initiated) &&
+		    !guest_can_use(vcpu, X86_FEATURE_APERFMPERF))
+			return 1;
+
+		vcpu->arch.aperfmperf.guest_aperf = data;
+		if (unlikely(!msr_info->host_initiated))
+			set_guest_aperf(data);
+		break;
+	case MSR_IA32_MPERF:
+		if ((data || !msr_info->host_initiated) &&
+		    !guest_can_use(vcpu, X86_FEATURE_APERFMPERF))
+			return 1;
+
+		vcpu->arch.aperfmperf.guest_mperf = data;
+		if (likely(msr_info->host_initiated))
+			vcpu->arch.aperfmperf.host_tsc = rdtsc();
+		else
+			set_guest_mperf(data);
+		break;
 #ifdef CONFIG_X86_64
 	case MSR_IA32_XFD:
 		if (!msr_info->host_initiated &&
@@ -4524,6 +4545,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = vcpu->arch.guest_fpu.xfd_err;
 		break;
 #endif
+	case MSR_IA32_APERF:
+		/* Guest read access should never reach here. */
+		if (!msr_info->host_initiated)
+			return 1;
+
+		msr_info->data = vcpu->arch.aperfmperf.guest_aperf;
+		break;
+	case MSR_IA32_MPERF:
+		/* Guest read access should never reach here. */
+		if (!msr_info->host_initiated)
+			return 1;
+
+		if (vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+			kvm_accumulate_background_guest_mperf(vcpu);
+		msr_info->data = vcpu->arch.aperfmperf.guest_mperf;
+		break;
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
 			return kvm_pmu_get_msr(vcpu, msr_info);
@@ -7535,6 +7572,11 @@ static void kvm_probe_msr_to_save(u32 msr_index)
 		if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
 			return;
 		break;
+	case MSR_IA32_APERF:
+	case MSR_IA32_MPERF:
+		if (!kvm_cpu_cap_has(KVM_X86_FEATURE_APERFMPERF))
+			return;
+		break;
 	default:
 		break;
 	}
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 20/22] KVM: VMX: Pass through guest reads of IA32_[AM]PERF
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (18 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 19/22] KVM: x86: Allow host and guest access to IA32_[AM]PERF Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 21/22] KVM: SVM: " Mingwei Zhang
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

Allow direct guest rdmsr access to IA32_APERF and IA32_MPERF while
continuing to intercept guest wrmsr. Direct read access reduces overhead
for guests that poll these MSRs frequently (e.g. every scheduler tick),
while write interception remains necessary to maintain proper host
offset tracking.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 7 +++++++
 arch/x86/kvm/vmx/vmx.h | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d28618e9277ed..07f013912370e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -186,6 +186,8 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_CORE_C3_RESIDENCY,
 	MSR_CORE_C6_RESIDENCY,
 	MSR_CORE_C7_RESIDENCY,
+	MSR_IA32_APERF,
+	MSR_IA32_MPERF,
 };
 
 /*
@@ -7871,6 +7873,11 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_FLUSH_CMD, MSR_TYPE_W,
 					  !guest_cpuid_has(vcpu, X86_FEATURE_FLUSH_L1D));
 
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R,
+				  !guest_can_use(vcpu, X86_FEATURE_APERFMPERF));
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R,
+				  !guest_can_use(vcpu, X86_FEATURE_APERFMPERF));
+
 	set_cr4_guest_host_mask(vmx);
 
 	vmx_write_encls_bitmap(vcpu, NULL);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2325f773a20be..929f153cdcbae 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -356,7 +356,7 @@ struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	16
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	18
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 21/22] KVM: SVM: Pass through guest reads of IA32_[AM]PERF
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (19 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 20/22] KVM: VMX: Pass through guest reads of IA32_[AM]PERF Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-11-21 18:53 ` [RFC PATCH 22/22] KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF Mingwei Zhang
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

Allow direct guest rdmsr access to IA32_APERF and IA32_MPERF while
continuing to intercept guest wrmsr. Direct read access reduces overhead
for guests that poll these MSRs frequently (e.g. every scheduler tick),
while write interception remains necessary to maintain proper host
offset tracking.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/svm/svm.c | 7 +++++++
 arch/x86/kvm/svm/svm.h | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9df3e1e5ae81a..332947e0e9670 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -142,6 +142,8 @@ static const struct svm_direct_access_msrs {
 	{ .index = X2APIC_MSR(APIC_TMICT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TMCCT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TDCR),		.always = false },
+	{ .index = MSR_IA32_APERF,			.always = false },
+	{ .index = MSR_IA32_MPERF,			.always = false },
 	{ .index = MSR_INVALID,				.always = false },
 };
 
@@ -1231,6 +1233,11 @@ static inline void init_vmcb_after_set_cpuid(struct kvm_vcpu *vcpu)
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 1, 1);
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 1, 1);
 	}
+
+	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_APERF,
+			     guest_can_use(vcpu, X86_FEATURE_APERFMPERF), 0);
+	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_MPERF,
+			     guest_can_use(vcpu, X86_FEATURE_APERFMPERF), 0);
 }
 
 static void init_vmcb(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 43fa6a16eb191..5ae5a13b9771a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -44,7 +44,7 @@ static inline struct page *__sme_pa_to_page(unsigned long pa)
 #define	IOPM_SIZE PAGE_SIZE * 3
 #define	MSRPM_SIZE PAGE_SIZE * 2
 
-#define MAX_DIRECT_ACCESS_MSRS	48
+#define MAX_DIRECT_ACCESS_MSRS	50
 #define MSRPM_OFFSETS	32
 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
 extern bool npt_enabled;
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 22/22] KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (20 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 21/22] KVM: SVM: " Mingwei Zhang
@ 2024-11-21 18:53 ` Mingwei Zhang
  2024-12-03 23:19 ` [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Sean Christopherson
  2024-12-05  8:59 ` Nikunj A Dadhania
  23 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-11-21 18:53 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson, Mingwei Zhang

Enable support for IA32_APERF/IA32_MPERF performance monitoring in KVM
guests. These MSRs allow guests to measure their effective CPU
frequency by comparing actual CPU cycles (APERF) against reference
cycles (MPERF).

Only expose X86_FEATURE_APERFMPERF to guests when the host has both
CONSTANT_TSC and NONSTOP_TSC. These features ensure the TSC frequency
remains stable across C-states and P-states, which is necessary for
"background" MPERF accounting.

Guest TSC scaling via KVM_SET_TSC_KHZ is not supported:
- On Intel, IA32_MPERF ticks at host rate regardless of guest TSC
  scaling, making passthrough impossible without intercepting reads
- On AMD, guest TSC scaling does affect IA32_MPERF reads, but handling
  it would significantly complicate cycle accounting

Record host support in kvm_cpu_caps[], advertise the feature to
userspace via CPUID.06H:ECX, and enable the governed feature when
supported by both host and guest CPUID.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/cpuid.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 41786b834b163..309fa7fef6b7b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -399,6 +399,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	kvm_hv_set_cpuid(vcpu, kvm_cpuid_has_hyperv(vcpu->arch.cpuid_entries,
 						    vcpu->arch.cpuid_nent));
 
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+		kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_APERFMPERF);
+
 	/* Invoke the vendor callback only after the above state is updated. */
 	kvm_x86_call(vcpu_after_set_cpuid)(vcpu);
 
@@ -697,6 +701,12 @@ void kvm_set_cpu_caps(void)
 	if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
 		kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
 
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+		kvm_cpu_cap_init_kvm_defined(CPUID_6_ECX, F(APERFMPERF));
+	else
+		kvm_cpu_cap_init_kvm_defined(CPUID_6_ECX, 0);
+
 	kvm_cpu_cap_mask(CPUID_7_1_EAX,
 		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
 		F(FZRM) | F(FSRS) | F(FSRC) |
@@ -993,7 +1003,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 	case 6: /* Thermal management */
 		entry->eax = 0x4; /* allow ARAT */
 		entry->ebx = 0;
-		entry->ecx = 0;
+		cpuid_entry_override(entry, CPUID_6_ECX);
 		entry->edx = 0;
 		break;
 	/* function 7 has additional index. */
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE
  2024-11-21 18:52 ` [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE Mingwei Zhang
@ 2024-12-03 19:07   ` Sean Christopherson
  0 siblings, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2024-12-03 19:07 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Huang Rui, Gautham R. Shenoy, Mario Limonciello,
	Rafael J. Wysocki, Viresh Kumar, Srinivas Pandruvada, Len Brown,
	H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson

The shortlog is an observation, not a proper summary of the change.

  KVM: x86: Handle side effects of receiving INIT while vCPU is HALTED

On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> From: Jim Mattson <jmattson@google.com>
> 
> When a halted vCPU is awakened by an INIT signal, it might have been
> the target of a previous KVM_HC_KICK_CPU hypercall, in which case
> pv_unhalted would be set. This flag should be cleared before the next
> HLT instruction, as kvm_vcpu_has_events() would otherwise return true
> and prevent the vCPU from entering the halt state.
> 
> Use kvm_vcpu_make_runnable() to ensure consistent handling of the
> HALTED to RUNNABLE state transition.
> 
> Fixes: 6aef266c6e17 ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks")
> Signed-off-by: Jim Mattson <jmattson@google.com>

Mingwei's SoB is missing.

> ---
>  arch/x86/kvm/lapic.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 95c6beb8ce279..97aa634505306 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -3372,9 +3372,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>  
>  	if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
>  		kvm_vcpu_reset(vcpu, true);
> -		if (kvm_vcpu_is_bsp(apic->vcpu))
> -			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> -		else
> +		kvm_vcpu_make_runnable(vcpu);

This is arguably wrong.  APs are never runnable after receiving.  Nothing should
ever be able to observe the "bad" state, but that doesn't make it any less
confusing.

This series also fails to address the majority cases where KVM transitions to RUNNABLE:

  __set_sregs_common()
  __sev_snp_update_protected_guest_state()
  kvm_arch_vcpu_ioctl_set_mpstate()
  kvm_xen_schedop_poll()
  kvm_arch_async_page_present()
  kvm_arch_vcpu_ioctl_get_mpstate()
  kvm_apic_accept_events() (SIPI path)

Yeah, some of those don't _need_ to be converted, and the existing behavior of
pv_unhalted is all kinds of sketchy, but fixing a few select paths just so that
APERF/MPERF virtualization does what y'all want it to do does not leave KVM in a
better place.

I also think we should add a generic setter, e.g. kvm_set_mp_state(), and take
this opportunity to sanitize pv_unhalted.  Specifically, I think pv_unhalted
should be clear on _any_ state transition, and unconditionally cleared when KVM
enters the guest.  The PV kick should only wake a vCPU that is currently halted.
Unfortunately, the cross-vCPU nature means KVM can't easily handle that when
delivering APIC_DM_REMRD.

Please also send these fixes as a separate series.  My crystal ball says APERF/MPERF
virtualization isn't going to land in the near future, and I would like to get
the mp_state handling cleaned up soonish.

> +		if (!kvm_vcpu_is_bsp(apic->vcpu))
>  			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>  	}
>  	if (test_and_clear_bit(KVM_APIC_SIPI, &apic->pending_events)) {
> -- 
> 2.47.0.371.ga323438b13-goog
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (21 preceding siblings ...)
  2024-11-21 18:53 ` [RFC PATCH 22/22] KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF Mingwei Zhang
@ 2024-12-03 23:19 ` Sean Christopherson
  2024-12-04  1:13   ` Jim Mattson
  2024-12-05  8:59 ` Nikunj A Dadhania
  23 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2024-12-03 23:19 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Huang Rui, Gautham R. Shenoy, Mario Limonciello,
	Rafael J. Wysocki, Viresh Kumar, Srinivas Pandruvada, Len Brown,
	H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson

On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> (250 Hz by default) to measure their effective CPU frequency. To avoid
> the overhead of intercepting these frequent MSR reads, allow the guest
> to read them directly by loading guest values into the hardware MSRs.
> 
> These MSRs are continuously running counters whose values must be
> carefully tracked during all vCPU state transitions:
> - Guest IA32_APERF advances only during guest execution

That's not what this series does though.  Guest APERF advances while the vCPU is
loaded by KVM_RUN, which is *very* different than letting APERF run freely only
while the vCPU is actively executing in the guest.

E.g. a vCPU that is memory oversubscribed via zswap will account a significant
amount of CPU time in APERF when faulting in swapped memory, whereas traditional
file-backed swap will not due to the task being scheduled out while waiting on I/O.

In general, the "why" of this series is missing.  What are the use cases you are
targeting?  What are the exact semantics you want to define?  *Why* did are you
proposed those exact semantics?

E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
requires userspace exits will not.  It's not necessarily wrong for heavy userspace
I/O to cause observed frequency to drop, but it's not obviously correct either.

The use cases matter a lot for APERF/MPERF, because trying to reason about what's
desirable for an oversubscribed setup requires a lot more work than defining
semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
less just partitioned.  Not to mention the complexity for trying to support all
potential use cases is likely quite a bit higher.

And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
workloads running on CPUs should be vCPUs.  It's not clear to me that observing
the guest utilization is outright wrong in that case.

One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
APERF/MPERF if and only if the feature is supported in hardware, but hidden from
the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.

> - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
>   in C0 state, even when not actively running
> - Host kernel access is redirected through get_host_[am]perf() which
>   adds per-CPU offsets to the hardware MSR values
> - Remote MSR reads through /dev/cpu/*/msr also account for these
>   offsets
> 
> Guest values persist in hardware while the vCPU is loaded and
> running. Host MSR values are restored on vcpu_put (either at KVM_RUN
> completion or when preempted) and when transitioning to halt state.
> 
> Note that guest TSC scaling via KVM_SET_TSC_KHZ is not supported, as
> it would require either intercepting MPERF reads on Intel (where MPERF
> ticks at host rate regardless of guest TSC scaling) or significantly
> complicating the cycle accounting on AMD.
> 
> The host must have both CONSTANT_TSC and NONSTOP_TSC capabilities
> since these ensure stable TSC frequency across C-states and P-states,
> which is required for accurate background MPERF accounting.

...

>  arch/x86/include/asm/kvm_host.h  |  11 ++
>  arch/x86/include/asm/topology.h  |  10 ++
>  arch/x86/kernel/cpu/aperfmperf.c |  65 +++++++++++-
>  arch/x86/kvm/cpuid.c             |  12 ++-
>  arch/x86/kvm/governed_features.h |   1 +
>  arch/x86/kvm/lapic.c             |   5 +-
>  arch/x86/kvm/reverse_cpuid.h     |   6 ++
>  arch/x86/kvm/svm/nested.c        |   2 +-
>  arch/x86/kvm/svm/svm.c           |   7 ++
>  arch/x86/kvm/svm/svm.h           |   2 +-
>  arch/x86/kvm/vmx/nested.c        |   2 +-
>  arch/x86/kvm/vmx/vmx.c           |   7 ++
>  arch/x86/kvm/vmx/vmx.h           |   2 +-
>  arch/x86/kvm/x86.c               | 171 ++++++++++++++++++++++++++++---
>  arch/x86/lib/msr-smp.c           |  11 ++
>  drivers/cpufreq/amd-pstate.c     |   4 +-
>  drivers/cpufreq/intel_pstate.c   |   5 +-
>  17 files changed, 295 insertions(+), 28 deletions(-)
> 
> 
> base-commit: 0a9b9d17f3a781dea03baca01c835deaa07f7cc3
> -- 
> 2.47.0.371.ga323438b13-goog
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-03 23:19 ` [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Sean Christopherson
@ 2024-12-04  1:13   ` Jim Mattson
  2024-12-04  1:59     ` Sean Christopherson
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Mattson @ 2024-12-04  1:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > the overhead of intercepting these frequent MSR reads, allow the guest
> > to read them directly by loading guest values into the hardware MSRs.
> >
> > These MSRs are continuously running counters whose values must be
> > carefully tracked during all vCPU state transitions:
> > - Guest IA32_APERF advances only during guest execution
>
> That's not what this series does though.  Guest APERF advances while the vCPU is
> loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> while the vCPU is actively executing in the guest.
>
> E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> file-backed swap will not due to the task being scheduled out while waiting on I/O.

Are you saying that APERF should stop completely outside of VMX
non-root operation / guest mode?
While that is possible, the overhead would be significantly
higher...probably high enough to make it impractical.

> In general, the "why" of this series is missing.  What are the use cases you are
> targeting?  What are the exact semantics you want to define?  *Why* did are you
> proposed those exact semantics?

I get the impression that the questions above are largely rhetorical,
and that you would not be happy with the answers anyway, but if you
really are inviting a version 2, I will gladly expound upon the why.

> E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> I/O to cause observed frequency to drop, but it's not obviously correct either.
>
> The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> desirable for an oversubscribed setup requires a lot more work than defining
> semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> less just partitioned.  Not to mention the complexity for trying to support all
> potential use cases is likely quite a bit higher.
>
> And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> the guest utilization is outright wrong in that case.

My understanding is that Google Cloud customers have been asking for
this feature for all manner of VM families for years, and most of
those VM families are not slice-of-hardware, since we just launched
our first such offering a few months ago.

> One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.

Part of our goal has been to enable guest APERF/MPERF without
impacting the use of host APERF/MPERF, since one of the first things
our support teams look at in response to a performance complaint is
the effective frequencies of the CPUs as reported on the host.

I can explain all of this in excruciating detail, but I'm not really
motivated by your initial response, which honestly seems a bit
hostile. At least you looked at the code, which is a far warmer
reception than I usually get.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-04  1:13   ` Jim Mattson
@ 2024-12-04  1:59     ` Sean Christopherson
  2024-12-04  4:00       ` Jim Mattson
                         ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Sean Christopherson @ 2024-12-04  1:59 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

On Tue, Dec 03, 2024, Jim Mattson wrote:
> On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > to read them directly by loading guest values into the hardware MSRs.
> > >
> > > These MSRs are continuously running counters whose values must be
> > > carefully tracked during all vCPU state transitions:
> > > - Guest IA32_APERF advances only during guest execution
> >
> > That's not what this series does though.  Guest APERF advances while the vCPU is
> > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > while the vCPU is actively executing in the guest.
> >
> > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> 
> Are you saying that APERF should stop completely outside of VMX
> non-root operation / guest mode?
> While that is possible, the overhead would be significantly
> higher...probably high enough to make it impractical.

No, I'm simply pointing out that the cover letter is misleading/inaccurate.

> > In general, the "why" of this series is missing.  What are the use cases you are
> > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > proposed those exact semantics?
> 
> I get the impression that the questions above are largely rhetorical, and

Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
about the side effects of things like swap and emulated I/O, and if you did, what
made you come to the conclusion that the "best" boundary is on sched_out() and
return to userspace.

> that you would not be happy with the answers anyway, but if you really are
> inviting a version 2, I will gladly expound upon the why.

No need for a new version at this time, just give me the details.

> > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > I/O to cause observed frequency to drop, but it's not obviously correct either.
> >
> > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > desirable for an oversubscribed setup requires a lot more work than defining
> > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > less just partitioned.  Not to mention the complexity for trying to support all
> > potential use cases is likely quite a bit higher.
> >
> > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > the guest utilization is outright wrong in that case.
> 
> My understanding is that Google Cloud customers have been asking for this
> feature for all manner of VM families for years, and most of those VM
> families are not slice-of-hardware, since we just launched our first such
> offering a few months ago.

But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
hat on, what someone's customers are asking for isn't relevant.  What's relevant
is what that someone wants to deliver/enable.

> > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> 
> Part of our goal has been to enable guest APERF/MPERF without impacting the
> use of host APERF/MPERF, since one of the first things our support teams look
> at in response to a performance complaint is the effective frequencies of the
> CPUs as reported on the host.

But is looking at the host's view even useful if (a) the only thing running on
those CPUs is a single vCPU, and (b) host userspace only sees the effective
frequencies when _host_ code is running?  Getting the effective frequency for
when the userspace VMM is processing emulated I/O probably isn't going to be all
that helpful.

And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
decisions or any other behaviors that rely on an accurate host view.

> I can explain all of this in excruciating detail, but I'm not really
> motivated by your initial response, which honestly seems a bit hostile.

Probably because this series made me a bit grumpy :-)  As presented, this feels
way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
stare at it just so, but devoid of details on why X was done instead of Y, and
seemingly ignores multiple edge cases.

I'm not saying you and Mingwei haven't thought about edge cases and design
tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
testcases (also none) communicates those thoughts to others.

> At least you looked at the code, which is a far warmer reception than I
> usually get.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-04  1:59     ` Sean Christopherson
@ 2024-12-04  4:00       ` Jim Mattson
  2024-12-04  5:11       ` Mingwei Zhang
  2024-12-04 12:30       ` Jim Mattson
  2 siblings, 0 replies; 35+ messages in thread
From: Jim Mattson @ 2024-12-04  4:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

On Tue, Dec 3, 2024 at 5:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 03, 2024, Jim Mattson wrote:
> > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > > to read them directly by loading guest values into the hardware MSRs.
> > > >
> > > > These MSRs are continuously running counters whose values must be
> > > > carefully tracked during all vCPU state transitions:
> > > > - Guest IA32_APERF advances only during guest execution
> > >
> > > That's not what this series does though.  Guest APERF advances while the vCPU is
> > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > > while the vCPU is actively executing in the guest.
> > >
> > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> >
> > Are you saying that APERF should stop completely outside of VMX
> > non-root operation / guest mode?
> > While that is possible, the overhead would be significantly
> > higher...probably high enough to make it impractical.
>
> No, I'm simply pointing out that the cover letter is misleading/inaccurate.
>
> > > In general, the "why" of this series is missing.  What are the use cases you are
> > > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > > proposed those exact semantics?
> >
> > I get the impression that the questions above are largely rhetorical, and
>
> Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
> about the side effects of things like swap and emulated I/O, and if you did, what
> made you come to the conclusion that the "best" boundary is on sched_out() and
> return to userspace.
>
> > that you would not be happy with the answers anyway, but if you really are
> > inviting a version 2, I will gladly expound upon the why.
>
> No need for a new version at this time, just give me the details.
>
> > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > > I/O to cause observed frequency to drop, but it's not obviously correct either.
> > >
> > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > > desirable for an oversubscribed setup requires a lot more work than defining
> > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > > less just partitioned.  Not to mention the complexity for trying to support all
> > > potential use cases is likely quite a bit higher.
> > >
> > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > > the guest utilization is outright wrong in that case.
> >
> > My understanding is that Google Cloud customers have been asking for this
> > feature for all manner of VM families for years, and most of those VM
> > families are not slice-of-hardware, since we just launched our first such
> > offering a few months ago.
>
> But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
> hat on, what someone's customers are asking for isn't relevant.  What's relevant
> is what that someone wants to deliver/enable.
>
> > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> >
> > Part of our goal has been to enable guest APERF/MPERF without impacting the
> > use of host APERF/MPERF, since one of the first things our support teams look
> > at in response to a performance complaint is the effective frequencies of the
> > CPUs as reported on the host.
>
> But is looking at the host's view even useful if (a) the only thing running on
> those CPUs is a single vCPU, and (b) host userspace only sees the effective
> frequencies when _host_ code is running?  Getting the effective frequency for
> when the userspace VMM is processing emulated I/O probably isn't going to be all
> that helpful.

(a) is your constraint, not mine, and (b) certainly sounds pointless,
but that isn't the behavior of this patch set, so I'm not sure why
you're even going there.

With this patch set, host APERF/MPERF still reports all cycles
accumulated on the logical processor, regardless of whether in the
host or the guest. There will be a small loss every time the MSRs are
written, but that loss is minimized by writing the MSRs as
infrequently as possible.

I get ahead of myself, but I just couldn't let this
mischaracterization stand uncorrected while I get the rest of my
response together.

> And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
> e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
> decisions or any other behaviors that rely on an accurate host view.
>
> > I can explain all of this in excruciating detail, but I'm not really
> > motivated by your initial response, which honestly seems a bit hostile.
>
> Probably because this series made me a bit grumpy :-)  As presented, this feels
> way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
> stare at it just so, but devoid of details on why X was done instead of Y, and
> seemingly ignores multiple edge cases.
>
> I'm not saying you and Mingwei haven't thought about edge cases and design
> tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
> testcases (also none) communicates those thoughts to others.
>
> > At least you looked at the code, which is a far warmer reception than I
> > usually get.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-04  1:59     ` Sean Christopherson
  2024-12-04  4:00       ` Jim Mattson
@ 2024-12-04  5:11       ` Mingwei Zhang
  2024-12-04 12:30       ` Jim Mattson
  2 siblings, 0 replies; 35+ messages in thread
From: Mingwei Zhang @ 2024-12-04  5:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

Sorry for the duplicate message...


On Tue, Dec 3, 2024 at 5:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 03, 2024, Jim Mattson wrote:
> > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > > to read them directly by loading guest values into the hardware MSRs.
> > > >
> > > > These MSRs are continuously running counters whose values must be
> > > > carefully tracked during all vCPU state transitions:
> > > > - Guest IA32_APERF advances only during guest execution
> > >
> > > That's not what this series does though.  Guest APERF advances while the vCPU is
> > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > > while the vCPU is actively executing in the guest.
> > >
> > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> >
> > Are you saying that APERF should stop completely outside of VMX
> > non-root operation / guest mode?
> > While that is possible, the overhead would be significantly
> > higher...probably high enough to make it impractical.
>
> No, I'm simply pointing out that the cover letter is misleading/inaccurate.
>
> > > In general, the "why" of this series is missing.  What are the use cases you are
> > > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > > proposed those exact semantics?
> >
> > I get the impression that the questions above are largely rhetorical, and
>
> Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
> about the side effects of things like swap and emulated I/O, and if you did, what
> made you come to the conclusion that the "best" boundary is on sched_out() and
> return to userspace.

Even for the slice of hardware case, KVM still needs to maintain the
guest aperfmperf context and do the context switch. Even if vcpu is
pinned, the host system design always has corner cases. For instance,
the host may want to move a bunch of vCPUs from one chunk to another,
say from 1 CCX to another CCX in AMD. Or maybe in some cases,
balancing the memory usage by moving VMs from one (v)NUMA to another.
Those should be corner cases and thus rare, but could happen in
reality.

Even for the slice of hardware case, KVM still needs to maintain the
guest aperfmperf context and do the context switch. Even if vcpu is
pinned, the host system design always has corner cases. For instance,
the host may want to move a bunch of vCPUs from one chunk to another,
say from 1 CCX to another CCX in AMD. Or maybe in some cases,
balancing the memory usage by moving VMs from one (v)NUMA to another.
Those should be corner cases and thus rare, but could happen in
reality.

>
> > that you would not be happy with the answers anyway, but if you really are
> > inviting a version 2, I will gladly expound upon the why.
>
> No need for a new version at this time, just give me the details.
>
> > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > > I/O to cause observed frequency to drop, but it's not obviously correct either.
> > >
> > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > > desirable for an oversubscribed setup requires a lot more work than defining
> > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > > less just partitioned.  Not to mention the complexity for trying to support all
> > > potential use cases is likely quite a bit higher.
> > >
> > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > > the guest utilization is outright wrong in that case.
> >
> > My understanding is that Google Cloud customers have been asking for this
> > feature for all manner of VM families for years, and most of those VM
> > families are not slice-of-hardware, since we just launched our first such
> > offering a few months ago.
>
> But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
> hat on, what someone's customers are asking for isn't relevant.  What's relevant
> is what that someone wants to deliver/enable.
>
> > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> >
> > Part of our goal has been to enable guest APERF/MPERF without impacting the
> > use of host APERF/MPERF, since one of the first things our support teams look
> > at in response to a performance complaint is the effective frequencies of the
> > CPUs as reported on the host.
>
> But is looking at the host's view even useful if (a) the only thing running on
> those CPUs is a single vCPU, and (b) host userspace only sees the effective
> frequencies when _host_ code is running?  Getting the effective frequency for
> when the userspace VMM is processing emulated I/O probably isn't going to be all
> that helpful.
>
> And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
> e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
> decisions or any other behaviors that rely on an accurate host view.
>
> > I can explain all of this in excruciating detail, but I'm not really
> > motivated by your initial response, which honestly seems a bit hostile.
>
> Probably because this series made me a bit grumpy :-)  As presented, this feels
> way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
> stare at it just so, but devoid of details on why X was done instead of Y, and
> seemingly ignores multiple edge cases.

ah, I can understand your feelings :) In the existing implementation
of vPMU, the guest counter value is really really hard to fetch
because part of it is always located in the perf subsystem. But in the
case of aperfmperf, the guest value is always in one place when code
is within the KVM loop. We pass through rdmsr to aperfmperf. Writes
need some adjustment, but it is on the host-level offset, not the
guest value. The offset we maintain is quite simple math.

>
> I'm not saying you and Mingwei haven't thought about edge cases and design
> tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
> testcases (also none) communicates those thoughts to others.
>
> > At least you looked at the code, which is a far warmer reception than I
> > usually get.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-04  1:59     ` Sean Christopherson
  2024-12-04  4:00       ` Jim Mattson
  2024-12-04  5:11       ` Mingwei Zhang
@ 2024-12-04 12:30       ` Jim Mattson
  2024-12-06 16:34         ` Sean Christopherson
  2 siblings, 1 reply; 35+ messages in thread
From: Jim Mattson @ 2024-12-04 12:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

Here is the sordid history behind the proposed APERFMPERF implementation.

First of all, I have never considered this feature for only
"slice-of-hardware" VMs, for two reasons:
(1) The original feature request was first opened in 2015, six months
before I joined Google, and long before Google Cloud had anything
remotely resembling a slice-of-hardware offering.
(2) One could argue that Google Cloud still has no slice-of-hardware
offering today.
Hence, an implementation that only works for "slice-of-hardware" is
essentially a science fair project. We might learn a lot, but there is
no ROI.

I dragged my feet on this for a long time, because
(1) Without actual guest C-state control, it seems essentially
pointless (though I probably didn't give sufficient weight to the warm
fuzzy feeling it might give customers).
(2) It's one of those things that's impossible to virtualize with
precision, and I can be a real pedant sometimes.
(3) I didn't want to expose a power side-channel that could be used to
spy on other tenants.

In 2019, Google Cloud launched the C2 VM family, with MWAIT-exiting
disabled for whole socket shapes. Though MWAIT hints aren't full
C-state control, and we still had the 1 MHz host timer tick that would
probably keep the guest out of deep C-states, my first objection
started to collapse. As I softened in my old age, the second objection
seemed indefensible, especially after I finally caved on nested posted
interrupt processing, which truly is unvirtualizable. But, especially
after the whole Meltdown/Spectre debacle, I was holding firm to my
third objection, despite counter-arguments that the same information
could be obtained without APERFMPERF. I guess I'm still smarting from
being proven completely wrong about RowHammer.

Finally, in December 2021, I thought I had a reasonable solution. We
could implement APERFMPERF in userspace, and the low fidelity would
make me feel comfortable about my third objection. "How would
userspace get this information," you may ask. Well, Google Cloud has
been carrying local patches to log guest {APERF, MPERF, TSC} deltas
since Ben Serebrin added it in 2017. Though the design document only
stipulated that the MSRs should be sampled at VMRUN entry and exit,
the original code actually sampled at VM-entry and VM-exit, with a
limitation of sampling at those additional points only if 500
microseconds had elapsed since the last samples were taken. Ben
calculated the effective frequency at each sample point to populate a
histogram, but that's not really relevant to APERFMPERF
virtualization. I just mention it to explain why those
VM-entry/VM-exit sample points were there. This code accounted for
everything between vcpu_load() and vcpu_put() when accumulating
"guest" APERF and MPERF, and this data eventually formed the basis of
our userspace implementation of APERFMPERF virtualization.

In 2022, Riley Gamson implemented APERFMPERF virtualization in
userspace, using KVM_X86_SET_MSR_FILTER to intercept guest accesses to
the MSRs, and using Ben's "turbostat" data to derive the values to be
returned. The APERF delta could be used as-is, but I was insistent
that MPERF needed to track guest TSC cycles while the vCPU was not
halted. My reasoning was this:
(1) The specification says so. Okay; it actually says that MPERF
"[i]ncrements at fixed interval (relative to TSC freq.) when the
logical processor is in C0," but even turbostat makes the
architecturally prohibited assumption that MPERF and TSC tick at the
same frequency.
(2) It would be disingenuous to claim the effective frequency *while
running* for a duty-cycle limited f1-micro or g2-small VM, or for
overcommitted VMs that are forced to timeshare with other tenants.

APERF is admittedly tricky to virtualize. For instance, how many
virtual "core clock counts at the coordinated clock frequency" should
we count while KVM is emulating CPUID? That's unclear. We're certainly
not trying to *emulate* APERF, so the number of cycles the physical
CPU takes to execute the instruction isn't relevant. The virtual CPU
just happens to take a few thousand cycles to execute CPUID. Consider
it a quirk. Similarly, in the case of zswap, some memory accesses take
a *really* long time. Or consider KVM time as the equivalent of SMM
time on physical hardware. SMM cycles are accumulated by APERF. It may
seem like a memory access just took 60 *milliseconds*, but most of
that time was spent in SMM. (That's taken from a recent real-world
example.) As much as I hate SMM, it provides a convenient rug to sweep
virtualization holes under.

At this point, I should note that Aaron Lewis eliminated the
rate-limited "turbostat" sampling at VM-entry/VM-exit early this year,
because it was too expensive. Admittedly, most of the cost was
attributed to reading MSR_C6_CORE_RESIDENCY, which Drew Schmitt added
to Ben's sampling in 2018. But this did factor into my thinking
regarding cost.

The target intercept was the C3 VM family, which is not
"slice-of-hardware," and, ironically, does not disable MWAIT-exiting
even for full socket shapes (because we realized after launching C2
that that was a huge mistake). However, the userspace approach was
abandoned before C3 launch, because of performance issues. You may
laugh, but when we advertised APERFMPERF to Linux guests, we were
surprised to find that every vCPU started sampling these MSRs every
timer tick. I still haven't looked into why. I'm assuming it has
something to do with a perception of "fairness" in scheduling, and I
just hope that it doesn't give power-hungry instruction mixes like
AVX-512 and AMX an even greater fraction of CPU time because their
effective frequency is low. In any case, we were seeing 10% to 16%
performance degradations when APERFMPERF was advertised to Linux
guests, and that was a non-starter. If that seems excessive, it is. A
large part of this is due to contention for an unfortunate exclusive
lock on the legacy PIC that our userspace grabs and releases for each
KVM_RUN ioctl. That could be fixed with a reader/writer lock, but the
point is that we were seeing KVM exits at a much higher rate than ever
before. I accept full responsibility for this debacle. I thought maybe
these MSRs would get sampled once a second while running turbostat. I
had no idea that the Linux kernel was so enamored of these MSRs.

Just doing a back-of-the-envelope calculation based on a 250 Hz guest
tick and 50000 cycles for a KVM exit, this implementation was going to
cost 1% or more in guest performance. We certainly couldn't enable it
by default, but maybe we could enable it for the specific customers
who had been clamoring for the feature. However, when I asked Product
Management how much performance customers were willing to trade for
this feature, the answer was "none."

Okay. How do we achieve that? The obvious approach is to intercept
reads of these MSRs and do some math in KVM. I found that really
unpalatable, though. For most of our VM families, the dominant source
of consistent background VM-exits is the host timer tick. The second
highest source is the guest timer tick. With the host timer tick
finally going away on the C4 VM family, the guest timer tick now
dominates. On Intel parts, where we take advantage of hardware EOI
virtualization, we now have two VM-exits per guest timer tick (one for
writing the x2APIC initial count MSR, and one for the VMX-preemption
timer). I couldn't defend doubling that with intercepted reads of
APERF and MPERF.

Well, what about the simple hack of passing through the host values? I
had considered this, despite the fact that it would only work for
slice-of-hardware. I even coerced Josh Don into "fixing" our scheduler
so that it wouldn't allow two vCPU threads (a virtual hyperthread
pair) to flip-flop between hyperthreads on their assigned physical
core. However, I eventually dismissed this as
(1) too much of a hack
(2) broken with live migration
(3) disingenuous when multiple tasks are running on the logical processor.

Yes, (3) does happen, even with our C4 VM family. During copyless
migration, two vCPU threads share a logical processor. During live
migration, I believe the live migration threads compete with vCPU
threads. And there is still at least one kworker thread competing for
cycles.

Actually writing the guest values into the MSRs was initially
abhorrent to me, because of the inherent lossage on every write. But,
I eventually got over it, and partially assuaged my revulsion by
writing the MSRs infrequently. I would much have preferred APERF and
MPERF equivalents of IA32_TSC_ADJUST, but I don't have the patience to
wait for the CPU vendors. BTW, as an aside, just how is AMD's scaling
of MPERF by the TSC_RATIO MSR even remotely useful without an offset?

One requirement I refuse to budge on is that host usage of APERFMPERF
must continue to work exactly as before, modulo some very small loss
of precision. Most of the damage could be contained within KVM, if
you're willing to accept the constraint that these MSRs cannot be
accessed within an NMI handler (on Intel CPUs), but then you have to
swap the guest and host values every VM-entry/VM-exit. This approach
increases both the performance overhead (for which our budget is
"none") and the loss of precision over the current approach. Given the
amount of whining on this list over writing just one MSR on every
VM-entry/VM-exit (yes, IA32_SPEC_CTRL, I'm looking at you), I didn't
think it would be very popular to add two. And, to be honest, I
remembered that rate-limited *reads* of the "turbostat" MSRs were too
expensive, but I had forgotten that the real culprit there was the
egregiously slow MSR_C6_CORE_RESIDENCY.

I do recall the hullabaloo regarding KVM's usurpation of IA32_TSC_AUX,
an affront that went unnoticed until the advent of RDPID. Honestly,
that's where I expected pushback on this series. Initially, I tried to
keep the changes outwith KVM to the minimum possible, replacing only
explicit reads of APERF or MPERF with the new accessor functions. I
wasn't going to touch the /dev/cpu/*/msr/* interface. After all, none
of the other KVM userspace return MSRs do anything special there. But,
then I discovered that turbostat on the host uses that interface, and
I really wanted that tool to continue to work as expected. So, the
rdmsr crosscalls picked up an ugly wart. Frankly, that was the
specific patch that I expected to unleash vitriol. As an aside, why
does turbostat have to use that interface for its own independent
reads of these MSRs, when the kernel is already reading them every
scheduler tick?!? Sure, for tickless kernels, maybe, but I digress.

Wherever the context-switching happens, I contend that there is no
"clean" virtualization of APERF. If it comes down to just a question
of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
performance numbers and try to come to a consensus, but if you're
fundamentally opposed to virtualizing APERF, because it's messy, then
I don't see any point in pursuing this further.

Thanks,

--jim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
                   ` (22 preceding siblings ...)
  2024-12-03 23:19 ` [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Sean Christopherson
@ 2024-12-05  8:59 ` Nikunj A Dadhania
  2024-12-05 13:48   ` Jim Mattson
  23 siblings, 1 reply; 35+ messages in thread
From: Nikunj A Dadhania @ 2024-12-05  8:59 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Huang Rui,
	Gautham R. Shenoy, Mario Limonciello, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Len Brown
  Cc: H. Peter Anvin, Perry Yuan, kvm, linux-kernel, linux-pm,
	Jim Mattson

On 11/22/2024 12:22 AM, Mingwei Zhang wrote:
> Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> (250 Hz by default) to measure their effective CPU frequency. To avoid
> the overhead of intercepting these frequent MSR reads, allow the guest
> to read them directly by loading guest values into the hardware MSRs.
> 
> These MSRs are continuously running counters whose values must be
> carefully tracked during all vCPU state transitions:
> - Guest IA32_APERF advances only during guest execution
> - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
>   in C0 state, even when not actively running

Any particular reason to treat APERF and MPERF differently?

AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state.
MPERF counting at constant frequency and the APERF counting at a variable
frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0
state and even when "not actively running" ?

Can you clarify what do you mean by "not actively running"?

Regards
Nikunj


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-05  8:59 ` Nikunj A Dadhania
@ 2024-12-05 13:48   ` Jim Mattson
  0 siblings, 0 replies; 35+ messages in thread
From: Jim Mattson @ 2024-12-05 13:48 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Huang Rui,
	Gautham R. Shenoy, Mario Limonciello, Rafael J. Wysocki,
	Viresh Kumar, Srinivas Pandruvada, Len Brown, H. Peter Anvin,
	Perry Yuan, kvm, linux-kernel, linux-pm

On Thu, Dec 5, 2024 at 1:00 AM Nikunj A Dadhania <nikunj@amd.com> wrote:
>
> On 11/22/2024 12:22 AM, Mingwei Zhang wrote:
> > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > the overhead of intercepting these frequent MSR reads, allow the guest
> > to read them directly by loading guest values into the hardware MSRs.
> >
> > These MSRs are continuously running counters whose values must be
> > carefully tracked during all vCPU state transitions:
> > - Guest IA32_APERF advances only during guest execution
> > - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
> >   in C0 state, even when not actively running
>
> Any particular reason to treat APERF and MPERF differently?

Core cycles accumulated by the logical processor that do not
contribute to the execution of the virtual processor should not be
counted. For example, consider Google Cloud's e2-small VM type, which
is capped at a 25% duty cycle. Even if the logical processor is
humming along at an effective frequency of 3.6 GHz, an e2-small vCPU
task is only resident 25% of the time, so its effective frequency is
more like 0.9 GHz (over a sufficiently large period of time).
Similarly, if a logical processor running at 3.6 GHz is shared 50/50
by two vCPUs, the effective frequency of each is about 1.8 GHz (again,
over a sufficiently large period of time). Over smaller time periods,
the effective frequencies in these examples would look like square
waves, alternating between 3.6 GHz and 0, much like thermal
throttling. And, much like thermal throttling, MPERF reference cycles
continue to tick on at the fixed reference frequency, even when APERF
cycles drop to 0.

> AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state.
> MPERF counting at constant frequency and the APERF counting at a variable
> frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0
> state and even when "not actively running" ?
>
> Can you clarify what do you mean by "not actively running"?

The current implementation considers the vCPU to be actively running
if the task is in the KVM_RUN ioctl, between vcpu_load() and
vcpu_put(). This also implies that the task itself is currently
running on a logical processor, since there is a vcpu_put() on
sched_out and a vcpu_load() on sched_in. As Sean points out, this is
only an approximation, since (a) such things as I/O completion in
userspace are not counted, and (b) such things as uncompressing a
zswapped page that happen in the vCPU task are counted.

> Regards
> Nikunj
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-04 12:30       ` Jim Mattson
@ 2024-12-06 16:34         ` Sean Christopherson
  2024-12-18 22:23           ` Jim Mattson
  0 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2024-12-06 16:34 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

On Wed, Dec 04, 2024, Jim Mattson wrote:
> Wherever the context-switching happens, I contend that there is no
> "clean" virtualization of APERF. If it comes down to just a question
> of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
> performance numbers and try to come to a consensus, but if you're
> fundamentally opposed to virtualizing APERF, because it's messy, then
> I don't see any point in pursuing this further.

I'm not fundamentally opposed to virtualizing the feature.  My complaints with
the series are that it doesn't provide sufficient information to make it feasible
for reviewers to provide useful feedback.  The history you provided is a great
start, but that's still largely just background information.  For a feature as
messy and subjective as APERF/MPERF, I think we need at least the following:

  1. What use cases are being targeted (e.g. because targeting only SoH would
     allow for a different implementation).
  2. The exact requirements, especially with respect to host usage.  And the
     the motivation behind those requirements.
  3. The high level design choices, and what, if any, alternatives were considered.
  4. Basic rules of thumb for what is/isn't accounted in APERF/MPERF, so that it's
     feasible to actually maintain support long-term.

E.g. does the host need to retain access to APERF/MPERF at all times?  If so, why?
Do we care about host kernel accesses, e.g. in the scheduler, or just userspace
accesses, e.g. turbostat?

What information is the host intended to see?  E.g. should APERF and MPERF stop
when transitioning to the guest?  If not, what are the intended semantics for the
host's view when running VMs with HLT-exiting disabled?  If the host's view of
APERF and MPREF account guest time, how does that mesh with upcoming mediated PMU,
where the host is disallowed from observing the guest?

Is there a plan for supporting VMs with a different TSC frequency than the host?
How will live migration work, without generating too much slop/skew between MPERF
and GUEST_TSC?

I don't expect the series to answer every possible question upfront, but the RFC
provided _nothing_, just a "here's what we implemented, please review".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-06 16:34         ` Sean Christopherson
@ 2024-12-18 22:23           ` Jim Mattson
  2025-01-13 19:15             ` Sean Christopherson
  0 siblings, 1 reply; 35+ messages in thread
From: Jim Mattson @ 2024-12-18 22:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Dec 04, 2024, Jim Mattson wrote:
> > Wherever the context-switching happens, I contend that there is no
> > "clean" virtualization of APERF. If it comes down to just a question
> > of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
> > performance numbers and try to come to a consensus, but if you're
> > fundamentally opposed to virtualizing APERF, because it's messy, then
> > I don't see any point in pursuing this further.
>
> I'm not fundamentally opposed to virtualizing the feature.  My complaints with
> the series are that it doesn't provide sufficient information to make it feasible
> for reviewers to provide useful feedback.  The history you provided is a great
> start, but that's still largely just background information.  For a feature as
> messy and subjective as APERF/MPERF, I think we need at least the following:
>
>   1. What use cases are being targeted (e.g. because targeting only SoH would
>      allow for a different implementation).
>   2. The exact requirements, especially with respect to host usage.  And the
>      the motivation behind those requirements.
>   3. The high level design choices, and what, if any, alternatives were considered.
>   4. Basic rules of thumb for what is/isn't accounted in APERF/MPERF, so that it's
>      feasible to actually maintain support long-term.
>
> E.g. does the host need to retain access to APERF/MPERF at all times?  If so, why?
> Do we care about host kernel accesses, e.g. in the scheduler, or just userspace
> accesses, e.g. turbostat?
>
> What information is the host intended to see?  E.g. should APERF and MPERF stop
> when transitioning to the guest?  If not, what are the intended semantics for the
> host's view when running VMs with HLT-exiting disabled?  If the host's view of
> APERF and MPREF account guest time, how does that mesh with upcoming mediated PMU,
> where the host is disallowed from observing the guest?
>
> Is there a plan for supporting VMs with a different TSC frequency than the host?
> How will live migration work, without generating too much slop/skew between MPERF
> and GUEST_TSC?
>
> I don't expect the series to answer every possible question upfront, but the RFC
> provided _nothing_, just a "here's what we implemented, please review".

Sean,

Thanks for the thoughtful pushback. You're right: the RFC needs more
context about our design choices and rationale.

My response has been delayed as I've been researching the reads of
IA32_APERF and IA32_MPERF on every scheduler tick. Thanks for pointing
me to commit 1567c3e3467c ("x86, sched: Add support for frequency
invariance") off-list. Frequency-invariant scheduling was, in fact,
the original source of these RDMSRs.

The code in the aforementioned commit would have allowed us to
virtualize APERFMPERF without triggering these frequent MSR reads,
because arch_scale_freq_tick() had an early return if
arch_scale_freq_invariant() was false. And since KVM does not
virtualize MSR_TURBO_RATIO_LIMIT, arch_scale_freq_invariant() would be
false in a VM.

However, in commit bb6e89df9028 ("x86/aperfmperf: Make parts of the
frequency invariance code unconditional"), the early return was
weakened to only bail out if
cpu_feature_enabled(X86_FEATURE_APERFMPERF) is false. Hence, even if
frequency-invariant scheduling is disabled, Linux will read the MSRs
every scheduler tick when APERFMPERF is available.

As we discussed off-list, it appears that the primary motivation for
this change was to minimize the crosscalls executed when examining
/proc/cpuinfo. I don't really think that use case justifies reading
these MSRs *every scheduler tick*, but I'm admittedly biased.

1. Guest Requirements

Unlike vPMU, which is primarily a development tool, our customers want
APERFMPERF enabled on their production VMs, and they are unwilling to
trade any amount of performance for the feature. They don't want
frequency-invariant scheduling; they just want to observe the
effective frequency (possibly via /proc/cpuinfo).

These requests are not limited to slice-of-hardware VMs. No one can
tell me what customers expect with respect to KVM "steal time," but it
seems to me that it would be disingenuous to ignore "steal time." By
analogy with HDC, the effective frequency should drop to zero when the
vCPU is "forced idle."

2. Host Requirements

The host needs reliable APERF/MPERF access for:
- Frequency-invariant scheduling
- Monitoring through /proc/cpuinfo
- Turbostat, maybe?

Our goal was for host APERFMPERF to work as it always has, counting
both host cycles and guest cycles. We lose cycles on every WRMSR, but
most of the time, the loss should be very small relative to the
measurement.

To be honest, we had not even considered breaking APERF/MPERF on the
host. We didn't think such an approach would have any chance of
upstream acceptance.

3. Design Choices

We evaluated three approaches:

a) Userspace virtualization via MSR filtering

   This approach was implemented before we knew about
   frequency-invariant scheduling. Because of the frequent guest
   reads, we observed a 10-16% performance hit, depending on vCPU
   count. The performance impact was exacerbated by contention for a
   legacy PIC mutex on KVM_RUN, but even if the mutex were replaced
   with a reader/writer lock, the performance impact would be too
   high. Hence, we abandoned this approach.

b) KVM intercepts RDMSR of APERF/MPERF

   This approach was ruled out by back-of-the-envelope
   calculation. We're not going to be able to provide this feature for
   free, but we could argue that 0.01% overhead is negligible. On a 2
   GHz processor that gives us a budget of 200,000 cycles per
   second. With a 250 Hz guest tick generating 500 RDMSR intercepts
   per second, we have a budget of just 400 cycles per
   intercept. That's likely to be insufficient for most platforms. A
   guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles
   per intercept. That's unachievable.

   We should have a discussion about just how much overhead is
   negligible, and that may open the door to other implementation
   options.

c) Current RDMSR pass-through approach

   The biggest downside is the loss of cycles on every WRMSR. An NMI
   or SMI in the critical region could result in millions of lost
   cycles. However, the damage only persists until all in-progress
   measurements are completed.

   We had considered context-switching host and guest values on
   VM-entry and VM-exit. This would have kept everything within KVM,
   as long as the host doesn't access the MSRs during an NMI or
   SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a
   VM-enter/VM-exit round-trip would have blown the budget. Even
   without APERFMPERF, an active guest vCPU takes a minimum of two
   VM-exits per timer tick, so we have even less budget per
   VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b).

   Internally, we have already moved the mediated vPMU context-switch
   from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed
   natural to do the same for APERFMPERF. I don't have a
   back-of-the-envelope calculation for this overhead, but I have run
   Virtuozzo's cpuid_rate benchmark in a guest with and without
   APERFMPERF, 100 times for each configuration, and a Student's
   t-test showed that there is no statistically significant difference
   between the means of the two datasets.

4. APERF/MPERF Accounting

   Virtual MPERF cycles are easy to define. They accumulate at the
   virtual TSC frequency as long as the vCPU is in C0. There are only
   a few ways the vCPU can leave C0. If HLT or MWAIT exiting is
   disabled, then the vCPU can leave C) in VMX non-root operation (or
   AMD guest mode). If HLT exiting is not disabled, then the vCPU will
   leave C0 when a HLT instruction is intercepted, and it will reenter
   C0 when it receives an interrupt (or a PV kick) and starts running
   again.

   Virtual APERF cycles are more ambiguous, especially in VMX root
   operation (or AMD host mode). I think we can all agree that they
   should accumulate at some non-zero rate as long as the code being
   executed on the logical processor contributes in some way to guest
   vCPU progress, but should the virtual APERF accumulate cycles at
   the same frequency as the physical APERF? Probably not. Ultimately,
   the decision was pragmatic. Virtual APERF accumulates at the same
   rate as physical APERF while the guest context is live in the
   MSR. Doing anything else would have been too expensive.

5. Live Migration

   The IA32_MPERF MSR is serialized independently of the
   IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do
   not advance in lock step across live migration, but this is no
   different from a general purpose vPMU counter programmed to count
   "unhalted reference cycles." In general, our implementation of
   guest IA32_MPERF is far superior to the vPMU implementation of
   "unhalted reference cycles."

6. Guest TSC Scaling

   It is not possible to support TSC scaling with IA32_MPERF
   RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX
   non-root operation are not scaled by the hardware. It is possible
   to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD
   CPUs, but the implementation is left as an exercise for the reader.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs
  2024-12-18 22:23           ` Jim Mattson
@ 2025-01-13 19:15             ` Sean Christopherson
  0 siblings, 0 replies; 35+ messages in thread
From: Sean Christopherson @ 2025-01-13 19:15 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Mingwei Zhang, Paolo Bonzini, Huang Rui, Gautham R. Shenoy,
	Mario Limonciello, Rafael J. Wysocki, Viresh Kumar,
	Srinivas Pandruvada, Len Brown, H. Peter Anvin, Perry Yuan, kvm,
	linux-kernel, linux-pm

On Wed, Dec 18, 2024, Jim Mattson wrote:
> On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote:
> As we discussed off-list, it appears that the primary motivation for
> this change was to minimize the crosscalls executed when examining
> /proc/cpuinfo. I don't really think that use case justifies reading
> these MSRs *every scheduler tick*, but I'm admittedly biased.

Heh, yeah, we missed that boat by ~2 years.  Or maybe KVM's "slow" emulation
would only have further angered the x86 maintainers :-)

> 1. Guest Requirements
> 
> Unlike vPMU, which is primarily a development tool, our customers want
> APERFMPERF enabled on their production VMs, and they are unwilling to
> trade any amount of performance for the feature. They don't want
> frequency-invariant scheduling; they just want to observe the
> effective frequency (possibly via /proc/cpuinfo).
> 
> These requests are not limited to slice-of-hardware VMs. No one can
> tell me what customers expect with respect to KVM "steal time," but it
> seems to me that it would be disingenuous to ignore "steal time." By
> analogy with HDC, the effective frequency should drop to zero when the
> vCPU is "forced idle."
> 
> 2. Host Requirements
> 
> The host needs reliable APERF/MPERF access for:
> - Frequency-invariant scheduling
> - Monitoring through /proc/cpuinfo
> - Turbostat, maybe?
> 
> Our goal was for host APERFMPERF to work as it always has, counting
> both host cycles and guest cycles. We lose cycles on every WRMSR, but
> most of the time, the loss should be very small relative to the
> measurement.
> 
> To be honest, we had not even considered breaking APERF/MPERF on the
> host. We didn't think such an approach would have any chance of
> upstream acceptance.

FWIW, my stance on gifting features to KVM guests is that it's a-ok so long as it
requires an explicit opt-in from the system admin, and that it's decoupled from
KVM.  E.g. add a flag (or KConfig) to disable APERF/MPERF usage, at which point
there's no good reason to prevent KVM from virtualizing the feature.

Unfortunately, my idea of hiding a feature from the kernel has never panned out,
because apparently there's no feature that Linux can't squeeze some amount of
usefulness out of.  :-)

> 3. Design Choices
> 
> We evaluated three approaches:
> 
> a) Userspace virtualization via MSR filtering
> 
>    This approach was implemented before we knew about
>    frequency-invariant scheduling. Because of the frequent guest
>    reads, we observed a 10-16% performance hit, depending on vCPU
>    count. The performance impact was exacerbated by contention for a
>    legacy PIC mutex on KVM_RUN, but even if the mutex were replaced
>    with a reader/writer lock, the performance impact would be too
>    high. Hence, we abandoned this approach.
> 
> b) KVM intercepts RDMSR of APERF/MPERF
> 
>    This approach was ruled out by back-of-the-envelope
>    calculation. We're not going to be able to provide this feature for
>    free, but we could argue that 0.01% overhead is negligible. On a 2
>    GHz processor that gives us a budget of 200,000 cycles per
>    second. With a 250 Hz guest tick generating 500 RDMSR intercepts
>    per second, we have a budget of just 400 cycles per
>    intercept. That's likely to be insufficient for most platforms. A
>    guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles
>    per intercept. That's unachievable.

I think we'd actually have a bit more headroom.  The overhead would be relative
to bare metal, not absolute.  RDMSR is typically ~80 cycles, so even if we are
super duper strict in how that 0.01% overhead is accounted, KVM would have more
like 150+ cycles?  But I'm mostly just being pedantic, I'm pretty sure AMD CPUs
can't achieve 400 cycle roundtrips, i.e. hardware alone would exhaust the budget.

>    We should have a discussion about just how much overhead is
>    negligible, and that may open the door to other implementation
>    options.
> 
> c) Current RDMSR pass-through approach
> 
>    The biggest downside is the loss of cycles on every WRMSR. An NMI
>    or SMI in the critical region could result in millions of lost
>    cycles. However, the damage only persists until all in-progress
>    measurements are completed.

FWIW, the NMI problem is solvable, e.g. by bumping a sequence counter if the CPU
takes an NMI in the critical section, and then retrying until there are no NMIs
(or maybe retry a very limited number of times to avoid creating a set of problems
that could be worse than the loss in accuracy).

>    We had considered context-switching host and guest values on
>    VM-entry and VM-exit. This would have kept everything within KVM,
>    as long as the host doesn't access the MSRs during an NMI or
>    SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a
>    VM-enter/VM-exit round-trip would have blown the budget. Even
>    without APERFMPERF, an active guest vCPU takes a minimum of two
>    VM-exits per timer tick, so we have even less budget per
>    VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b).
> 
>    Internally, we have already moved the mediated vPMU context-switch
>    from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed
>    natural to do the same for APERFMPERF. I don't have a
>    back-of-the-envelope calculation for this overhead, but I have run
>    Virtuozzo's cpuid_rate benchmark in a guest with and without
>    APERFMPERF, 100 times for each configuration, and a Student's
>    t-test showed that there is no statistically significant difference
>    between the means of the two datasets.
> 
> 4. APERF/MPERF Accounting
> 
>    Virtual MPERF cycles are easy to define. They accumulate at the
>    virtual TSC frequency as long as the vCPU is in C0. There are only
>    a few ways the vCPU can leave C0. If HLT or MWAIT exiting is
>    disabled, then the vCPU can leave C) in VMX non-root operation (or
>    AMD guest mode). If HLT exiting is not disabled, then the vCPU will
>    leave C0 when a HLT instruction is intercepted, and it will reenter
>    C0 when it receives an interrupt (or a PV kick) and starts running
>    again.
> 
>    Virtual APERF cycles are more ambiguous, especially in VMX root
>    operation (or AMD host mode). I think we can all agree that they
>    should accumulate at some non-zero rate as long as the code being
>    executed on the logical processor contributes in some way to guest
>    vCPU progress, but should the virtual APERF accumulate cycles at
>    the same frequency as the physical APERF? Probably not. Ultimately,
>    the decision was pragmatic. Virtual APERF accumulates at the same
>    rate as physical APERF while the guest context is live in the
>    MSR. Doing anything else would have been too expensive.

Hmm, I'm ok stopping virtual APERF while the vCPU task is in userspace, and the
more I poke at it, the more I agree it's the only sane approach.  However, I most
definitely want to document the various gotchas with the alternative.

At first glance, keeping KVM's preempt notifier registered on exits to userspace
would be very doable, but there are lurking complexities that make it very
unpalatable when digging deeper.  E.g. handling the case where userspace
invokes KVM_RUN on a different task+CPU would likely require a per-CPU spinlock,
which is all kinds of gross.  And userspace would need a way to disassociated a
task from a vCPU.

Maybe this would be a good candidate for Paolo's idea of using the merge commit
to capture information that doesn't belong in Documentation, but that is too
specific/detailed for a single commit's changelog.

> 5. Live Migration
> 
>    The IA32_MPERF MSR is serialized independently of the
>    IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do
>    not advance in lock step across live migration, but this is no
>    different from a general purpose vPMU counter programmed to count
>    "unhalted reference cycles." In general, our implementation of
>    guest IA32_MPERF is far superior to the vPMU implementation of
>    "unhalted reference cycles."

Aha!  The SDM gives us an out:

  Only the IA32_APERF/IA32_MPERF ratio is architecturally defined; software should
  not attach meaning to the content of the individual of IA32_APERF or IA32_MPERF
  MSRs.

While the SDM kinda sorta implies that MPERF and TSC will operrate in lock-step,
the above gives me confidence that some amount of drift is tolerable.

Off-list you floated the idea of tying save/restore to TSC as an offset, but I
think that's unnecessary complexity on two fronts.  First, the writes to TSC and
MPERF must happen separately, so even if KVM does back-to-back WRMSRs, some amount
of drift is inevitable.  Second, because virtual TSC doesn't stop on vcpu_{load,put},
there will be non-trivial drift irrespective of migration (and it might even be
worse?).

> 6. Guest TSC Scaling
> 
>    It is not possible to support TSC scaling with IA32_MPERF
>    RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX
>    non-root operation are not scaled by the hardware. It is possible
>    to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD
>    CPUs, but the implementation is left as an exercise for the reader.

So, what's the proposed solution?  Either the limitation needs to be documented
as a KVM erratum, or KVM needs to actively prevent APERF/MPREF virtualization if
TSC scaling is in effect.  I can't think of a third option off the top of my
head.

I'm not sure how I feel about taking an erratum for this one.  The SDM explicitly
states, in multiple places, that MPREF counts at a fixed frequency, e.g.

  IA32_MPERF MSR (E7H) increments in proportion to a fixed frequency, which is
  configured when the processor is booted.

Drift between TSC and MPERF is one thing, having MPERF suddenly count at a
different frequency is problematic on a different level.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-01-13 19:15 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-21 18:52 [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 01/22] x86/aperfmperf: Introduce get_host_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 02/22] x86/aperfmperf: Introduce set_guest_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 03/22] x86/aperfmperf: Introduce restore_host_[am]perf() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 04/22] x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host offset Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 05/22] KVM: x86: Introduce kvm_vcpu_make_runnable() Mingwei Zhang
2024-11-21 18:52 ` [RFC PATCH 06/22] KVM: x86: INIT may transition from HALTED to RUNNABLE Mingwei Zhang
2024-12-03 19:07   ` Sean Christopherson
2024-11-21 18:52 ` [RFC PATCH 07/22] KVM: nSVM: Nested #VMEXIT " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 08/22] KVM: nVMX: Nested VM-exit " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 09/22] KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 10/22] KVM: x86: Make APERFMPERF a governed feature Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 11/22] KVM: x86: Initialize guest [am]perf at vcpu power-on Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 12/22] KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 13/22] KVM: x86: Load guest [am]perf when leaving halt state Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 14/22] KVM: x86: Introduce kvm_user_return_notifier_register() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 15/22] KVM: x86: Restore host IA32_[AM]PERF on userspace return Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 16/22] KVM: x86: Save guest [am]perf checkpoint on HLT Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 17/22] KVM: x86: Save guest [am]perf checkpoint on vcpu_put() Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 18/22] KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 19/22] KVM: x86: Allow host and guest access to IA32_[AM]PERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 20/22] KVM: VMX: Pass through guest reads of IA32_[AM]PERF Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 21/22] KVM: SVM: " Mingwei Zhang
2024-11-21 18:53 ` [RFC PATCH 22/22] KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF Mingwei Zhang
2024-12-03 23:19 ` [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs Sean Christopherson
2024-12-04  1:13   ` Jim Mattson
2024-12-04  1:59     ` Sean Christopherson
2024-12-04  4:00       ` Jim Mattson
2024-12-04  5:11       ` Mingwei Zhang
2024-12-04 12:30       ` Jim Mattson
2024-12-06 16:34         ` Sean Christopherson
2024-12-18 22:23           ` Jim Mattson
2025-01-13 19:15             ` Sean Christopherson
2024-12-05  8:59 ` Nikunj A Dadhania
2024-12-05 13:48   ` Jim Mattson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).