From: Dongli Zhang <dongli.zhang@oracle.com>
To: qemu-devel@nongnu.org, kvm@vger.kernel.org, qemu-arm@nongnu.org,
qemu-ppc@nongnu.org, qemu-riscv@nongnu.org,
qemu-s390x@nongnu.org
Cc: pbonzini@redhat.com, zhao1.liu@intel.com, mtosatti@redhat.com,
sandipan.das@amd.com, babu.moger@amd.com, likexu@tencent.com,
like.xu.linux@gmail.com, groug@kaod.org, khorenko@virtuozzo.com,
alexander.ivanov@virtuozzo.com, den@virtuozzo.com,
davydov-max@yandex-team.ru, xiaoyao.li@intel.com,
dapeng1.mi@linux.intel.com, joe.jin@oracle.com,
peter.maydell@linaro.org, gaosong@loongson.cn,
chenhuacai@kernel.org, philmd@linaro.org, aurelien@aurel32.net,
jiaxun.yang@flygoat.com, arikalo@gmail.com, npiggin@gmail.com,
danielhb413@gmail.com, palmer@dabbelt.com,
alistair.francis@wdc.com, liwei1518@gmail.com,
zhiwei_liu@linux.alibaba.com, pasic@linux.ibm.com,
borntraeger@linux.ibm.com, richard.henderson@linaro.org,
david@redhat.com, iii@linux.ibm.com, thuth@redhat.com,
flavra@baylibre.com, ewanhai-oc@zhaoxin.com, ewanhai@zhaoxin.com,
cobechen@zhaoxin.com, louisqi@zhaoxin.com, liamni@zhaoxin.com,
frankzhu@zhaoxin.com, silviazhao@zhaoxin.com, kraxel@redhat.com,
berrange@redhat.com
Subject: [PATCH v5 08/10] target/i386/kvm: reset AMD PMU registers during VM reset
Date: Fri, 25 Apr 2025 14:30:05 -0700 [thread overview]
Message-ID: <20250425213037.8137-9-dongli.zhang@oracle.com> (raw)
In-Reply-To: <20250425213037.8137-1-dongli.zhang@oracle.com>
QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM
and kvm_put_msrs() to restore them to KVM. However, there is no support for
AMD PMU registers. Currently, pmu_version and num_pmu_gp_counters are
initialized based on cpuid(0xa), which does not apply to AMD processors.
For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers
is determined based on the CPU version.
To address this issue, we need to add support for AMD PMU registers.
Without this support, the following problems can arise:
1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
running "perf top", the PMU registers are not disabled properly.
2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
does not handle AMD PMU registers, causing some PMU events to remain
enabled in KVM.
3. The KVM kvm_pmc_speculative_in_use() function consistently returns true,
preventing the reclamation of these events. Consequently, the
kvm_pmc->perf_event remains active.
4. After a reboot, the VM kernel may report the following error:
[ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
[ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)
5. In the worst case, the active kvm_pmc->perf_event may inject unknown
NMIs randomly into the VM kernel:
[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
To resolve these issues, we propose resetting AMD PMU registers during the
VM reset process.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
---
Changed since v1:
- Modify "MSR_K7_EVNTSEL0 + 3" and "MSR_K7_PERFCTR0 + 3" by using
AMD64_NUM_COUNTERS (suggested by Sandipan Das).
- Use "AMD64_NUM_COUNTERS_CORE * 2 - 1", not "MSR_F15H_PERF_CTL0 + 0xb".
(suggested by Sandipan Das).
- Switch back to "-pmu" instead of using a global "pmu-cap-disabled".
- Don't initialize PMU info if kvm.enable_pmu=N.
Changed since v2:
- Remove 'static' from host_cpuid_vendorX.
- Change has_pmu_version to pmu_version.
- Use object_property_get_int() to get CPU family.
- Use cpuid_find_entry() instead of cpu_x86_cpuid().
- Send error log when host and guest are from different vendors.
- Move "if (!cpu->enable_pmu)" to begin of function. Add comments to
reminder developers.
- Add support to Zhaoxin. Change is_same_vendor() to
is_host_compat_vendor().
- Didn't add Reviewed-by from Sandipan because the change isn't minor.
Changed since v3:
- Use host_cpu_vendor_fms() from Zhao's patch.
- Check AMD directly makes the "compat" rule clear.
- Add comment to MAX_GP_COUNTERS.
- Skip PMU info initialization if !kvm_pmu_disabled.
Changed since v4:
- Add Reviewed-by from Zhao and Sandipan.
target/i386/cpu.h | 12 +++
target/i386/kvm/kvm.c | 175 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 183 insertions(+), 4 deletions(-)
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9866595cd0..e85f43a4b9 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -490,6 +490,14 @@ typedef enum X86Seg {
#define MSR_CORE_PERF_GLOBAL_CTRL 0x38f
#define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390
+#define MSR_K7_EVNTSEL0 0xc0010000
+#define MSR_K7_PERFCTR0 0xc0010004
+#define MSR_F15H_PERF_CTL0 0xc0010200
+#define MSR_F15H_PERF_CTR0 0xc0010201
+
+#define AMD64_NUM_COUNTERS 4
+#define AMD64_NUM_COUNTERS_CORE 6
+
#define MSR_MC0_CTL 0x400
#define MSR_MC0_STATUS 0x401
#define MSR_MC0_ADDR 0x402
@@ -1608,6 +1616,10 @@ typedef struct {
#endif
#define MAX_FIXED_COUNTERS 3
+/*
+ * This formula is based on Intel's MSR. The current size also meets AMD's
+ * needs.
+ */
#define MAX_GP_COUNTERS (MSR_IA32_PERF_STATUS - MSR_P6_EVNTSEL0)
#define NB_OPMASK_REGS 8
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a173b345b9..2c7b0f6717 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2077,7 +2077,7 @@ int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp)
return 0;
}
-static void kvm_init_pmu_info(struct kvm_cpuid2 *cpuid)
+static void kvm_init_pmu_info_intel(struct kvm_cpuid2 *cpuid)
{
struct kvm_cpuid_entry2 *c;
@@ -2110,6 +2110,96 @@ static void kvm_init_pmu_info(struct kvm_cpuid2 *cpuid)
}
}
+static void kvm_init_pmu_info_amd(struct kvm_cpuid2 *cpuid, X86CPU *cpu)
+{
+ struct kvm_cpuid_entry2 *c;
+ int64_t family;
+
+ family = object_property_get_int(OBJECT(cpu), "family", NULL);
+ if (family < 0) {
+ return;
+ }
+
+ if (family < 6) {
+ error_report("AMD performance-monitoring is supported from "
+ "K7 and later");
+ return;
+ }
+
+ pmu_version = 1;
+ num_pmu_gp_counters = AMD64_NUM_COUNTERS;
+
+ c = cpuid_find_entry(cpuid, 0x80000001, 0);
+ if (!c) {
+ return;
+ }
+
+ if (!(c->ecx & CPUID_EXT3_PERFCORE)) {
+ return;
+ }
+
+ num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
+}
+
+static bool is_host_compat_vendor(CPUX86State *env)
+{
+ char host_vendor[CPUID_VENDOR_SZ + 1];
+
+ host_cpu_vendor_fms(host_vendor, NULL, NULL, NULL);
+
+ /*
+ * Intel and Zhaoxin are compatible.
+ */
+ if ((g_str_equal(host_vendor, CPUID_VENDOR_INTEL) ||
+ g_str_equal(host_vendor, CPUID_VENDOR_ZHAOXIN1) ||
+ g_str_equal(host_vendor, CPUID_VENDOR_ZHAOXIN2)) &&
+ (IS_INTEL_CPU(env) || IS_ZHAOXIN_CPU(env))) {
+ return true;
+ }
+
+ return g_str_equal(host_vendor, CPUID_VENDOR_AMD) &&
+ IS_AMD_CPU(env);
+}
+
+static void kvm_init_pmu_info(struct kvm_cpuid2 *cpuid, X86CPU *cpu)
+{
+ CPUX86State *env = &cpu->env;
+
+ /*
+ * The PMU virtualization is disabled by kvm.enable_pmu=N.
+ */
+ if (kvm_pmu_disabled) {
+ return;
+ }
+
+ /*
+ * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
+ * disable the AMD PMU virtualization.
+ *
+ * Assume the user is aware of this when !cpu->enable_pmu. AMD PMU
+ * registers are not going to reset, even they are still available to
+ * guest VM.
+ */
+ if (!cpu->enable_pmu) {
+ return;
+ }
+
+ /*
+ * It is not supported to virtualize AMD PMU registers on Intel
+ * processors, nor to virtualize Intel PMU registers on AMD processors.
+ */
+ if (!is_host_compat_vendor(env)) {
+ error_report("host doesn't support requested feature: vPMU");
+ return;
+ }
+
+ if (IS_INTEL_CPU(env) || IS_ZHAOXIN_CPU(env)) {
+ kvm_init_pmu_info_intel(cpuid);
+ } else if (IS_AMD_CPU(env)) {
+ kvm_init_pmu_info_amd(cpuid, cpu);
+ }
+}
+
int kvm_arch_init_vcpu(CPUState *cs)
{
struct {
@@ -2292,7 +2382,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
cpuid_i = kvm_x86_build_cpuid(env, cpuid_data.entries, cpuid_i);
cpuid_data.cpuid.nent = cpuid_i;
- kvm_init_pmu_info(&cpuid_data.cpuid);
+ kvm_init_pmu_info(&cpuid_data.cpuid, cpu);
if (((env->cpuid_version >> 8)&0xF) >= 6
&& (env->features[FEAT_1_EDX] & (CPUID_MCE | CPUID_MCA)) ==
@@ -4072,7 +4162,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
}
- if (pmu_version > 0) {
+ if ((IS_INTEL_CPU(env) || IS_ZHAOXIN_CPU(env)) && pmu_version > 0) {
if (pmu_version > 1) {
/* Stop the counter. */
kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
@@ -4103,6 +4193,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
env->msr_global_ctrl);
}
}
+
+ if (IS_AMD_CPU(env) && pmu_version > 0) {
+ uint32_t sel_base = MSR_K7_EVNTSEL0;
+ uint32_t ctr_base = MSR_K7_PERFCTR0;
+ /*
+ * The address of the next selector or counter register is
+ * obtained by incrementing the address of the current selector
+ * or counter register by one.
+ */
+ uint32_t step = 1;
+
+ /*
+ * When PERFCORE is enabled, AMD PMU uses a separate set of
+ * addresses for the selector and counter registers.
+ * Additionally, the address of the next selector or counter
+ * register is determined by incrementing the address of the
+ * current register by two.
+ */
+ if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
+ sel_base = MSR_F15H_PERF_CTL0;
+ ctr_base = MSR_F15H_PERF_CTR0;
+ step = 2;
+ }
+
+ for (i = 0; i < num_pmu_gp_counters; i++) {
+ kvm_msr_entry_add(cpu, ctr_base + i * step,
+ env->msr_gp_counters[i]);
+ kvm_msr_entry_add(cpu, sel_base + i * step,
+ env->msr_gp_evtsel[i]);
+ }
+ }
+
/*
* Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
* only sync them to KVM on the first cpu
@@ -4550,7 +4672,8 @@ static int kvm_get_msrs(X86CPU *cpu)
if (env->features[FEAT_KVM] & CPUID_KVM_POLL_CONTROL) {
kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
}
- if (pmu_version > 0) {
+
+ if ((IS_INTEL_CPU(env) || IS_ZHAOXIN_CPU(env)) && pmu_version > 0) {
if (pmu_version > 1) {
kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
@@ -4566,6 +4689,35 @@ static int kvm_get_msrs(X86CPU *cpu)
}
}
+ if (IS_AMD_CPU(env) && pmu_version > 0) {
+ uint32_t sel_base = MSR_K7_EVNTSEL0;
+ uint32_t ctr_base = MSR_K7_PERFCTR0;
+ /*
+ * The address of the next selector or counter register is
+ * obtained by incrementing the address of the current selector
+ * or counter register by one.
+ */
+ uint32_t step = 1;
+
+ /*
+ * When PERFCORE is enabled, AMD PMU uses a separate set of
+ * addresses for the selector and counter registers.
+ * Additionally, the address of the next selector or counter
+ * register is determined by incrementing the address of the
+ * current register by two.
+ */
+ if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
+ sel_base = MSR_F15H_PERF_CTL0;
+ ctr_base = MSR_F15H_PERF_CTR0;
+ step = 2;
+ }
+
+ for (i = 0; i < num_pmu_gp_counters; i++) {
+ kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
+ kvm_msr_entry_add(cpu, sel_base + i * step, 0);
+ }
+ }
+
if (env->mcg_cap) {
kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
@@ -4877,6 +5029,21 @@ static int kvm_get_msrs(X86CPU *cpu)
case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
break;
+ case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + AMD64_NUM_COUNTERS - 1:
+ env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
+ break;
+ case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + AMD64_NUM_COUNTERS - 1:
+ env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
+ break;
+ case MSR_F15H_PERF_CTL0 ...
+ MSR_F15H_PERF_CTL0 + AMD64_NUM_COUNTERS_CORE * 2 - 1:
+ index = index - MSR_F15H_PERF_CTL0;
+ if (index & 0x1) {
+ env->msr_gp_counters[index] = msrs[i].data;
+ } else {
+ env->msr_gp_evtsel[index] = msrs[i].data;
+ }
+ break;
case HV_X64_MSR_HYPERCALL:
env->msr_hv_hypercall = msrs[i].data;
break;
--
2.39.3
next prev parent reply other threads:[~2025-04-25 21:36 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-25 21:29 [PATCH v5 00/10] target/i386/kvm/pmu: PMU Enhancement, Bugfix and Cleanup Dongli Zhang
2025-04-25 21:29 ` [PATCH v5 01/10] target/i386: disable PerfMonV2 when PERFCORE unavailable Dongli Zhang
2025-04-25 21:29 ` [PATCH v5 02/10] target/i386: disable PERFCORE when "-pmu" is configured Dongli Zhang
2025-04-25 21:30 ` [PATCH v5 03/10] kvm: Introduce kvm_arch_pre_create_vcpu() Dongli Zhang
2025-04-25 21:30 ` [PATCH v5 04/10] target/i386/kvm: set KVM_PMU_CAP_DISABLE if "-pmu" is configured Dongli Zhang
2025-04-25 21:30 ` [PATCH v5 05/10] target/i386/kvm: extract unrelated code out of kvm_x86_build_cpuid() Dongli Zhang
2025-04-25 21:30 ` [PATCH v5 06/10] target/i386/kvm: rename architectural PMU variables Dongli Zhang
2025-04-25 21:30 ` [PATCH v5 07/10] target/i386/kvm: query kvm.enable_pmu parameter Dongli Zhang
2025-04-25 21:30 ` Dongli Zhang [this message]
2025-04-25 21:30 ` [PATCH v5 09/10] target/i386/kvm: support perfmon-v2 for reset Dongli Zhang
2025-04-25 21:30 ` [PATCH v5 10/10] target/i386/kvm: don't stop Intel PMU counters Dongli Zhang
2025-06-09 12:26 ` [PATCH v5 00/10] target/i386/kvm/pmu: PMU Enhancement, Bugfix and Cleanup Zhao Liu
2025-06-11 8:45 ` Dongli Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250425213037.8137-9-dongli.zhang@oracle.com \
--to=dongli.zhang@oracle.com \
--cc=alexander.ivanov@virtuozzo.com \
--cc=alistair.francis@wdc.com \
--cc=arikalo@gmail.com \
--cc=aurelien@aurel32.net \
--cc=babu.moger@amd.com \
--cc=berrange@redhat.com \
--cc=borntraeger@linux.ibm.com \
--cc=chenhuacai@kernel.org \
--cc=cobechen@zhaoxin.com \
--cc=danielhb413@gmail.com \
--cc=dapeng1.mi@linux.intel.com \
--cc=david@redhat.com \
--cc=davydov-max@yandex-team.ru \
--cc=den@virtuozzo.com \
--cc=ewanhai-oc@zhaoxin.com \
--cc=ewanhai@zhaoxin.com \
--cc=flavra@baylibre.com \
--cc=frankzhu@zhaoxin.com \
--cc=gaosong@loongson.cn \
--cc=groug@kaod.org \
--cc=iii@linux.ibm.com \
--cc=jiaxun.yang@flygoat.com \
--cc=joe.jin@oracle.com \
--cc=khorenko@virtuozzo.com \
--cc=kraxel@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=liamni@zhaoxin.com \
--cc=like.xu.linux@gmail.com \
--cc=likexu@tencent.com \
--cc=liwei1518@gmail.com \
--cc=louisqi@zhaoxin.com \
--cc=mtosatti@redhat.com \
--cc=npiggin@gmail.com \
--cc=palmer@dabbelt.com \
--cc=pasic@linux.ibm.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
--cc=qemu-riscv@nongnu.org \
--cc=qemu-s390x@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=sandipan.das@amd.com \
--cc=silviazhao@zhaoxin.com \
--cc=thuth@redhat.com \
--cc=xiaoyao.li@intel.com \
--cc=zhao1.liu@intel.com \
--cc=zhiwei_liu@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).