[PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup
@ 2024-11-04  9:40 Dongli Zhang
  2024-11-04  9:40 ` [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable Dongli Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

This patchset addresses three bugs related to AMD PMU virtualization.

1. The PerfMonV2 is still available if PERCORE if disabled via
"-cpu host,-perfctr-core".

2. The second issue is that using "-cpu host,-pmu" does not disable AMD PMU
virtualization. When using "-cpu EPYC" or "-cpu host,-pmu", AMD PMU
virtualization remains enabled. On the VM's Linux side, you might still
see:

[    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

instead of:

[    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
[    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

To address this, we have introduced a new x86-specific accel/kvm property,
"pmu-cap-disabled=true", which disables PMU virtualization via
KVM_PMU_CAP_DISABLE.

Another previous solution to re-use '-cpu host,-pmu':
https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/

3. The third issue is that unreclaimed performance events (after a QEMU
system_reset) in KVM may cause random, unwanted, or unknown NMIs to be
injected into the VM.

The AMD PMU registers are not reset during QEMU system_reset.

(1) If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
running "perf top", the PMU registers are not disabled properly.

(2) Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
does not handle AMD PMU registers, causing some PMU events to remain
enabled in KVM.

(3) The KVM kvm_pmc_speculative_in_use() function consistently returns true,
preventing the reclamation of these events. Consequently, the
kvm_pmc->perf_event remains active.

(4) After a reboot, the VM kernel may report the following error:

[    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
[    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)

(5) In the worst case, the active kvm_pmc->perf_event may inject unknown
NMIs randomly into the VM kernel:

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

To resolve these issues, we propose resetting AMD PMU registers during the
VM reset process

Dongli Zhang (7):
  target/i386: disable PerfMonV2 when PERFCORE unavailable
  target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  target/i386/kvm: init PMU information only once
  target/i386/kvm: rename architectural PMU variables
  target/i386/kvm: reset AMD PMU registers during VM reset
  target/i386/kvm: support perfmon-v2 for reset
  target/i386/kvm: don't stop Intel PMU counters

 accel/kvm/kvm-all.c        |   1 +
 include/sysemu/kvm_int.h   |   1 +
 qemu-options.hx            |   9 +-
 target/i386/cpu.c          |   3 +-
 target/i386/cpu.h          |  12 ++
 target/i386/kvm/kvm.c      | 340 ++++++++++++++++++++++++++++++++++------
 target/i386/kvm/kvm_i386.h |   2 +
 7 files changed, 319 insertions(+), 49 deletions(-)

base-commit: c94bee4cd6693c1c65ba43bb8970cf909dec378b

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  2024-11-06  3:54   ` Zhao Liu
  2024-11-04  9:40 ` [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

When the PERFCORE is disabled with "-cpu host,-perfctr-core", it is
reflected in in guest dmesg.

[    0.285136] Performance Events: AMD PMU driver.

However, the guest cpuid indicates the PerfMonV2 is still available.

CPU:
   Extended Performance Monitoring and Debugging (0x80000022):
      AMD performance monitoring V2         = true
      AMD LBR V2                            = false
      AMD LBR stack & PMC freezing          = false
      number of core perf ctrs              = 0x6 (6)
      number of LBR stack entries           = 0x0 (0)
      number of avail Northbridge perf ctrs = 0x0 (0)
      number of available UMC PMCs          = 0x0 (0)
      active UMCs bitmask                   = 0x0

Disable PerfMonV2 in cpuid when PERFCORE is disabled.

Fixes: 209b0ac12074 ("target/i386: Add PerfMonV2 feature bit")
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/cpu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 3baa95481f..4490a7a8d6 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -7103,6 +7103,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
         *eax = *ebx = *ecx = *edx = 0;
         /* AMD Extended Performance Monitoring and Debug */
         if (kvm_enabled() && cpu->enable_pmu &&
+            (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) &&
             (env->features[FEAT_8000_0022_EAX] & CPUID_8000_0022_EAX_PERFMON_V2)) {
             *eax |= CPUID_8000_0022_EAX_PERFMON_V2;
             *ebx |= kvm_arch_get_supported_cpuid(cs->kvm_state, index, count,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
  2024-11-04  9:40 ` [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  2024-11-07  7:52   ` Zhao Liu
  2024-11-04  9:40 ` [PATCH 3/7] target/i386/kvm: init PMU information only once Dongli Zhang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

The AMD PMU virtualization is not disabled when configuring
"-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
"-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
virtualization in such an environment.

As a result, VM logs typically show:

[    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

whereas the expected logs should be:

[    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
[    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

This discrepancy occurs because AMD PMU does not use CPUID to determine
whether PMU virtualization is supported.

To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
acceleration. This property sets KVM_PMU_CAP_DISABLE if
KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
x86 systems.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
Another previous solution to re-use '-cpu host,-pmu':
https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/

 accel/kvm/kvm-all.c        |  1 +
 include/sysemu/kvm_int.h   |  1 +
 qemu-options.hx            |  9 ++++++-
 target/i386/cpu.c          |  2 +-
 target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/kvm_i386.h |  2 ++
 6 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..8b5ba45cf7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
     s->xen_evtchn_max_pirq = 256;
     s->device = NULL;
     s->msr_energy.enable = false;
+    s->pmu_cap_disabled = false;
 }
 
 /**
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..cdee1a481c 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -166,6 +166,7 @@ struct KVMState
     uint16_t xen_gnttab_max_frames;
     uint16_t xen_evtchn_max_pirq;
     char *device;
+    bool pmu_cap_disabled;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/qemu-options.hx b/qemu-options.hx
index dacc9790a4..9684424835 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -191,7 +191,8 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
     "                eager-split-size=n (KVM Eager Page Split chunk size, default 0, disabled. ARM only)\n"
     "                notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n"
     "                thread=single|multi (enable multi-threaded TCG)\n"
-    "                device=path (KVM device path, default /dev/kvm)\n", QEMU_ARCH_ALL)
+    "                device=path (KVM device path, default /dev/kvm)\n"
+    "                pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n", QEMU_ARCH_ALL)
 SRST
 ``-accel name[,prop=value[,...]]``
     This is used to enable an accelerator. Depending on the target
@@ -277,6 +278,12 @@ SRST
         option can be used to pass the KVM device to use via a file descriptor
         by setting the value to ``/dev/fdset/NN``.
 
+    ``pmu-cap-disabled=true|false``
+        When using the KVM accelerator, it controls the disabling of
+        KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When this capability is
+        disabled, PMU virtualization is turned off at the KVM module level.
+        Note that this functionality is supported for x86 hosts only.
+
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 4490a7a8d6..c97ed74a47 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -7102,7 +7102,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
     case 0x80000022:
         *eax = *ebx = *ecx = *edx = 0;
         /* AMD Extended Performance Monitoring and Debug */
-        if (kvm_enabled() && cpu->enable_pmu &&
+        if (kvm_enabled() && cpu->enable_pmu && !kvm_pmu_cap_disabled() &&
             (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) &&
             (env->features[FEAT_8000_0022_EAX] & CPUID_8000_0022_EAX_PERFMON_V2)) {
             *eax |= CPUID_8000_0022_EAX_PERFMON_V2;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 8e17942c3b..ed73b1e7e0 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -166,6 +166,7 @@ static bool has_msr_vmx_procbased_ctls2;
 static bool has_msr_perf_capabs;
 static bool has_msr_pkrs;
 static bool has_msr_hwcr;
+static bool has_pmu_cap;
 
 static uint32_t has_architectural_pmu_version;
 static uint32_t num_architectural_pmu_gp_counters;
@@ -3196,6 +3197,11 @@ static void kvm_vm_enable_energy_msrs(KVMState *s)
     return;
 }
 
+bool kvm_pmu_cap_disabled(void)
+{
+    return kvm_state->pmu_cap_disabled;
+}
+
 int kvm_arch_init(MachineState *ms, KVMState *s)
 {
     int ret;
@@ -3335,6 +3341,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
         }
     }
 
+    has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
+
+    if (s->pmu_cap_disabled) {
+        if (has_pmu_cap) {
+            ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
+                                    KVM_PMU_CAP_DISABLE);
+            if (ret < 0) {
+                s->pmu_cap_disabled = false;
+                error_report("kvm: Failed to disable pmu cap: %s",
+                             strerror(-ret));
+            }
+        } else {
+            s->pmu_cap_disabled = false;
+            error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported");
+        }
+    }
+
     return 0;
 }
 
@@ -6525,6 +6548,29 @@ static void kvm_arch_set_xen_evtchn_max_pirq(Object *obj, Visitor *v,
     s->xen_evtchn_max_pirq = value;
 }
 
+static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v,
+                                     const char *name, void *opaque,
+                                     Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    bool pmu_cap_disabled;
+    Error *error = NULL;
+
+    if (s->fd != -1) {
+        error_setg(errp, "Cannot set properties after the accelerator has "
+                         "been initialized");
+        return;
+    }
+
+    visit_type_bool(v, name, &pmu_cap_disabled, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    s->pmu_cap_disabled = pmu_cap_disabled;
+}
+
 void kvm_arch_accel_class_init(ObjectClass *oc)
 {
     object_class_property_add_enum(oc, "notify-vmexit", "NotifyVMexitOption",
@@ -6564,6 +6610,12 @@ void kvm_arch_accel_class_init(ObjectClass *oc)
                               NULL, NULL);
     object_class_property_set_description(oc, "xen-evtchn-max-pirq",
                                           "Maximum number of Xen PIRQs");
+
+    object_class_property_add(oc, "pmu-cap-disabled", "bool",
+                              NULL, kvm_set_pmu_cap_disabled,
+                              NULL, NULL);
+    object_class_property_set_description(oc, "pmu-cap-disabled",
+                                          "Disable KVM_CAP_PMU_CAPABILITY");
 }
 
 void kvm_set_max_apic_id(uint32_t max_apic_id)
diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
index 9de9c0d303..03522de9d1 100644
--- a/target/i386/kvm/kvm_i386.h
+++ b/target/i386/kvm/kvm_i386.h
@@ -49,6 +49,8 @@ uint64_t kvm_arch_get_supported_msr_feature(KVMState *s, uint32_t index);
 void kvm_set_max_apic_id(uint32_t max_apic_id);
 void kvm_request_xsave_components(X86CPU *cpu, uint64_t mask);
 
+bool kvm_pmu_cap_disabled(void);
+
 #ifdef CONFIG_KVM
 
 bool kvm_is_vm_type_supported(int type);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/7] target/i386/kvm: init PMU information only once
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
  2024-11-04  9:40 ` [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable Dongli Zhang
  2024-11-04  9:40 ` [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  2024-11-10 15:29   ` Zhao Liu
  2024-11-04  9:40 ` [PATCH 4/7] target/i386/kvm: rename architectural PMU variables Dongli Zhang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

Currently, the 'has_architectural_pmu_version',
'num_architectural_pmu_gp_counters', and
'num_architectural_pmu_fixed_counters' are shared globally across all vCPUs
and are initialized during the setup of each vCPU.

To optimize, initialize PMU information only once for the first vCPU.

Additionally, the code extracted from kvm_x86_build_cpuid() is unrelated to
the process of building the CPUID.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/kvm/kvm.c | 71 +++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index ed73b1e7e0..5c0276f889 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1961,33 +1961,6 @@ static uint32_t kvm_x86_build_cpuid(CPUX86State *env,
         }
     }
 
-    if (limit >= 0x0a) {
-        uint32_t eax, edx;
-
-        cpu_x86_cpuid(env, 0x0a, 0, &eax, &unused, &unused, &edx);
-
-        has_architectural_pmu_version = eax & 0xff;
-        if (has_architectural_pmu_version > 0) {
-            num_architectural_pmu_gp_counters = (eax & 0xff00) >> 8;
-
-            /* Shouldn't be more than 32, since that's the number of bits
-             * available in EBX to tell us _which_ counters are available.
-             * Play it safe.
-             */
-            if (num_architectural_pmu_gp_counters > MAX_GP_COUNTERS) {
-                num_architectural_pmu_gp_counters = MAX_GP_COUNTERS;
-            }
-
-            if (has_architectural_pmu_version > 1) {
-                num_architectural_pmu_fixed_counters = edx & 0x1f;
-
-                if (num_architectural_pmu_fixed_counters > MAX_FIXED_COUNTERS) {
-                    num_architectural_pmu_fixed_counters = MAX_FIXED_COUNTERS;
-                }
-            }
-        }
-    }
-
     cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
 
     for (i = 0x80000000; i <= limit; i++) {
@@ -2055,6 +2028,43 @@ full:
     abort();
 }
 
+static void kvm_init_pmu_info(CPUX86State *env)
+{
+    uint32_t eax, edx;
+    uint32_t unused;
+    uint32_t limit;
+
+    cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
+
+    if (limit < 0x0a) {
+        return;
+    }
+
+    cpu_x86_cpuid(env, 0x0a, 0, &eax, &unused, &unused, &edx);
+
+    has_architectural_pmu_version = eax & 0xff;
+    if (has_architectural_pmu_version > 0) {
+        num_architectural_pmu_gp_counters = (eax & 0xff00) >> 8;
+
+        /*
+         * Shouldn't be more than 32, since that's the number of bits
+         * available in EBX to tell us _which_ counters are available.
+         * Play it safe.
+         */
+        if (num_architectural_pmu_gp_counters > MAX_GP_COUNTERS) {
+            num_architectural_pmu_gp_counters = MAX_GP_COUNTERS;
+        }
+
+        if (has_architectural_pmu_version > 1) {
+            num_architectural_pmu_fixed_counters = edx & 0x1f;
+
+            if (num_architectural_pmu_fixed_counters > MAX_FIXED_COUNTERS) {
+                num_architectural_pmu_fixed_counters = MAX_FIXED_COUNTERS;
+            }
+        }
+    }
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
     struct {
@@ -2237,6 +2247,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
     cpuid_i = kvm_x86_build_cpuid(env, cpuid_data.entries, cpuid_i);
     cpuid_data.cpuid.nent = cpuid_i;
 
+    /*
+     * Initialize PMU information only once for the first vCPU.
+     */
+    if (cs == first_cpu) {
+        kvm_init_pmu_info(env);
+    }
+
     if (((env->cpuid_version >> 8)&0xF) >= 6
         && (env->features[FEAT_1_EDX] & (CPUID_MCE | CPUID_MCA)) ==
            (CPUID_MCE | CPUID_MCA)) {
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/7] target/i386/kvm: rename architectural PMU variables
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
                   ` (2 preceding siblings ...)
  2024-11-04  9:40 ` [PATCH 3/7] target/i386/kvm: init PMU information only once Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  2024-11-04  9:40 ` [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset Dongli Zhang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

AMD does not have what is commonly referred to as an architectural PMU.
Therefore, we need to rename the following variables to be applicable for
both Intel and AMD:

- has_architectural_pmu_version
- num_architectural_pmu_gp_counters
- num_architectural_pmu_fixed_counters

For Intel processors, the meaning of has_pmu_version remains unchanged.

For AMD processors:

has_pmu_version == 1 corresponds to versions before AMD PerfMonV2.
has_pmu_version == 2 corresponds to AMD PerfMonV2.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/kvm/kvm.c | 49 ++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 5c0276f889..ca2b644e2c 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -168,9 +168,16 @@ static bool has_msr_pkrs;
 static bool has_msr_hwcr;
 static bool has_pmu_cap;
 
-static uint32_t has_architectural_pmu_version;
-static uint32_t num_architectural_pmu_gp_counters;
-static uint32_t num_architectural_pmu_fixed_counters;
+/*
+ * For Intel processors, the meaning is the architectural PMU version
+ * number.
+ *
+ * For AMD processors: 1 corresponds to the prior versions, and 2
+ * corresponds to AMD PerfMonV2.
+ */
+static uint32_t has_pmu_version;
+static uint32_t num_pmu_gp_counters;
+static uint32_t num_pmu_fixed_counters;
 
 static int has_xsave2;
 static int has_xcrs;
@@ -2042,24 +2049,24 @@ static void kvm_init_pmu_info(CPUX86State *env)
 
     cpu_x86_cpuid(env, 0x0a, 0, &eax, &unused, &unused, &edx);
 
-    has_architectural_pmu_version = eax & 0xff;
-    if (has_architectural_pmu_version > 0) {
-        num_architectural_pmu_gp_counters = (eax & 0xff00) >> 8;
+    has_pmu_version = eax & 0xff;
+    if (has_pmu_version > 0) {
+        num_pmu_gp_counters = (eax & 0xff00) >> 8;
 
         /*
          * Shouldn't be more than 32, since that's the number of bits
          * available in EBX to tell us _which_ counters are available.
          * Play it safe.
          */
-        if (num_architectural_pmu_gp_counters > MAX_GP_COUNTERS) {
-            num_architectural_pmu_gp_counters = MAX_GP_COUNTERS;
+        if (num_pmu_gp_counters > MAX_GP_COUNTERS) {
+            num_pmu_gp_counters = MAX_GP_COUNTERS;
         }
 
-        if (has_architectural_pmu_version > 1) {
-            num_architectural_pmu_fixed_counters = edx & 0x1f;
+        if (has_pmu_version > 1) {
+            num_pmu_fixed_counters = edx & 0x1f;
 
-            if (num_architectural_pmu_fixed_counters > MAX_FIXED_COUNTERS) {
-                num_architectural_pmu_fixed_counters = MAX_FIXED_COUNTERS;
+            if (num_pmu_fixed_counters > MAX_FIXED_COUNTERS) {
+                num_pmu_fixed_counters = MAX_FIXED_COUNTERS;
             }
         }
     }
@@ -4020,25 +4027,25 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
             kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
         }
 
-        if (has_architectural_pmu_version > 0) {
-            if (has_architectural_pmu_version > 1) {
+        if (has_pmu_version > 0) {
+            if (has_pmu_version > 1) {
                 /* Stop the counter.  */
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
             }
 
             /* Set the counter values.  */
-            for (i = 0; i < num_architectural_pmu_fixed_counters; i++) {
+            for (i = 0; i < num_pmu_fixed_counters; i++) {
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR0 + i,
                                   env->msr_fixed_counters[i]);
             }
-            for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+            for (i = 0; i < num_pmu_gp_counters; i++) {
                 kvm_msr_entry_add(cpu, MSR_P6_PERFCTR0 + i,
                                   env->msr_gp_counters[i]);
                 kvm_msr_entry_add(cpu, MSR_P6_EVNTSEL0 + i,
                                   env->msr_gp_evtsel[i]);
             }
-            if (has_architectural_pmu_version > 1) {
+            if (has_pmu_version > 1) {
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_STATUS,
                                   env->msr_global_status);
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL,
@@ -4496,17 +4503,17 @@ static int kvm_get_msrs(X86CPU *cpu)
     if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
         kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
     }
-    if (has_architectural_pmu_version > 0) {
-        if (has_architectural_pmu_version > 1) {
+    if (has_pmu_version > 0) {
+        if (has_pmu_version > 1) {
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_STATUS, 0);
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, 0);
         }
-        for (i = 0; i < num_architectural_pmu_fixed_counters; i++) {
+        for (i = 0; i < num_pmu_fixed_counters; i++) {
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR0 + i, 0);
         }
-        for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+        for (i = 0; i < num_pmu_gp_counters; i++) {
             kvm_msr_entry_add(cpu, MSR_P6_PERFCTR0 + i, 0);
             kvm_msr_entry_add(cpu, MSR_P6_EVNTSEL0 + i, 0);
         }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
                   ` (3 preceding siblings ...)
  2024-11-04  9:40 ` [PATCH 4/7] target/i386/kvm: rename architectural PMU variables Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  2024-11-06  9:58   ` Sandipan Das
  2024-11-07 21:00   ` Maksim Davydov
  2024-11-04  9:40 ` [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset Dongli Zhang
  2024-11-04  9:40 ` [PATCH 7/7] target/i386/kvm: don't stop Intel PMU counters Dongli Zhang
  6 siblings, 2 replies; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM
and kvm_put_msrs() to restore them to KVM. However, there is no support for
AMD PMU registers. Currently, has_pmu_version and num_pmu_gp_counters are
initialized based on cpuid(0xa), which does not apply to AMD processors.
For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers
is determined based on the CPU version.

To address this issue, we need to add support for AMD PMU registers.
Without this support, the following problems can arise:

1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
running "perf top", the PMU registers are not disabled properly.

2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
does not handle AMD PMU registers, causing some PMU events to remain
enabled in KVM.

3. The KVM kvm_pmc_speculative_in_use() function consistently returns true,
preventing the reclamation of these events. Consequently, the
kvm_pmc->perf_event remains active.

4. After a reboot, the VM kernel may report the following error:

[    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
[    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)

5. In the worst case, the active kvm_pmc->perf_event may inject unknown
NMIs randomly into the VM kernel:

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

To resolve these issues, we propose resetting AMD PMU registers during the
VM reset process.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/cpu.h     |   8 +++
 target/i386/kvm/kvm.c | 156 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 161 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 59959b8b7a..0505eb3b08 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -488,6 +488,14 @@ typedef enum X86Seg {
 #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
 
+#define MSR_K7_EVNTSEL0                 0xc0010000
+#define MSR_K7_PERFCTR0                 0xc0010004
+#define MSR_F15H_PERF_CTL0              0xc0010200
+#define MSR_F15H_PERF_CTR0              0xc0010201
+
+#define AMD64_NUM_COUNTERS              4
+#define AMD64_NUM_COUNTERS_CORE         6
+
 #define MSR_MC0_CTL                     0x400
 #define MSR_MC0_STATUS                  0x401
 #define MSR_MC0_ADDR                    0x402
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index ca2b644e2c..83ec85a9b9 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2035,7 +2035,7 @@ full:
     abort();
 }
 
-static void kvm_init_pmu_info(CPUX86State *env)
+static void kvm_init_pmu_info_intel(CPUX86State *env)
 {
     uint32_t eax, edx;
     uint32_t unused;
@@ -2072,6 +2072,80 @@ static void kvm_init_pmu_info(CPUX86State *env)
     }
 }
 
+static void kvm_init_pmu_info_amd(CPUX86State *env)
+{
+    int64_t family;
+
+    has_pmu_version = 0;
+
+    /*
+     * To determine the CPU family, the following code is derived from
+     * x86_cpuid_version_get_family().
+     */
+    family = (env->cpuid_version >> 8) & 0xf;
+    if (family == 0xf) {
+        family += (env->cpuid_version >> 20) & 0xff;
+    }
+
+    /*
+     * Performance-monitoring supported from K7 and later.
+     */
+    if (family < 6) {
+        return;
+    }
+
+    has_pmu_version = 1;
+
+    if (!(env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE)) {
+        num_pmu_gp_counters = AMD64_NUM_COUNTERS;
+        return;
+    }
+
+    num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
+}
+
+static bool is_same_vendor(CPUX86State *env)
+{
+    static uint32_t host_cpuid_vendor1;
+    static uint32_t host_cpuid_vendor2;
+    static uint32_t host_cpuid_vendor3;
+
+    host_cpuid(0x0, 0, NULL, &host_cpuid_vendor1, &host_cpuid_vendor3,
+               &host_cpuid_vendor2);
+
+    return env->cpuid_vendor1 == host_cpuid_vendor1 &&
+           env->cpuid_vendor2 == host_cpuid_vendor2 &&
+           env->cpuid_vendor3 == host_cpuid_vendor3;
+}
+
+static void kvm_init_pmu_info(CPUX86State *env)
+{
+    /*
+     * It is not supported to virtualize AMD PMU registers on Intel
+     * processors, nor to virtualize Intel PMU registers on AMD processors.
+     */
+    if (!is_same_vendor(env)) {
+        return;
+    }
+
+    /*
+     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
+     * disable the AMD pmu virtualization.
+     *
+     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
+     * indicates the KVM has already disabled the pmu virtualization.
+     */
+    if (kvm_state->pmu_cap_disabled) {
+        return;
+    }
+
+    if (IS_INTEL_CPU(env)) {
+        kvm_init_pmu_info_intel(env);
+    } else if (IS_AMD_CPU(env)) {
+        kvm_init_pmu_info_amd(env);
+    }
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
     struct {
@@ -4027,7 +4101,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
             kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
         }
 
-        if (has_pmu_version > 0) {
+        if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
             if (has_pmu_version > 1) {
                 /* Stop the counter.  */
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
@@ -4058,6 +4132,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
                                   env->msr_global_ctrl);
             }
         }
+
+        if (IS_AMD_CPU(env) && has_pmu_version > 0) {
+            uint32_t sel_base = MSR_K7_EVNTSEL0;
+            uint32_t ctr_base = MSR_K7_PERFCTR0;
+            /*
+             * The address of the next selector or counter register is
+             * obtained by incrementing the address of the current selector
+             * or counter register by one.
+             */
+            uint32_t step = 1;
+
+            /*
+             * When PERFCORE is enabled, AMD PMU uses a separate set of
+             * addresses for the selector and counter registers.
+             * Additionally, the address of the next selector or counter
+             * register is determined by incrementing the address of the
+             * current register by two.
+             */
+            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
+                sel_base = MSR_F15H_PERF_CTL0;
+                ctr_base = MSR_F15H_PERF_CTR0;
+                step = 2;
+            }
+
+            for (i = 0; i < num_pmu_gp_counters; i++) {
+                kvm_msr_entry_add(cpu, ctr_base + i * step,
+                                  env->msr_gp_counters[i]);
+                kvm_msr_entry_add(cpu, sel_base + i * step,
+                                  env->msr_gp_evtsel[i]);
+            }
+        }
+
         /*
          * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
          * only sync them to KVM on the first cpu
@@ -4503,7 +4609,8 @@ static int kvm_get_msrs(X86CPU *cpu)
     if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
         kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
     }
-    if (has_pmu_version > 0) {
+
+    if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
         if (has_pmu_version > 1) {
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
@@ -4519,6 +4626,35 @@ static int kvm_get_msrs(X86CPU *cpu)
         }
     }
 
+    if (IS_AMD_CPU(env) && has_pmu_version > 0) {
+        uint32_t sel_base = MSR_K7_EVNTSEL0;
+        uint32_t ctr_base = MSR_K7_PERFCTR0;
+        /*
+         * The address of the next selector or counter register is
+         * obtained by incrementing the address of the current selector
+         * or counter register by one.
+         */
+        uint32_t step = 1;
+
+        /*
+         * When PERFCORE is enabled, AMD PMU uses a separate set of
+         * addresses for the selector and counter registers.
+         * Additionally, the address of the next selector or counter
+         * register is determined by incrementing the address of the
+         * current register by two.
+         */
+        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
+            sel_base = MSR_F15H_PERF_CTL0;
+            ctr_base = MSR_F15H_PERF_CTR0;
+            step = 2;
+        }
+
+        for (i = 0; i < num_pmu_gp_counters; i++) {
+            kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
+            kvm_msr_entry_add(cpu, sel_base + i * step, 0);
+        }
+    }
+
     if (env->mcg_cap) {
         kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
         kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
@@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu)
         case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
             env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
             break;
+        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
+            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
+            break;
+        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
+            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
+            break;
+        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
+            index = index - MSR_F15H_PERF_CTL0;
+            if (index & 0x1) {
+                env->msr_gp_counters[index] = msrs[i].data;
+            } else {
+                env->msr_gp_evtsel[index] = msrs[i].data;
+            }
+            break;
         case HV_X64_MSR_HYPERCALL:
             env->msr_hv_hypercall = msrs[i].data;
             break;
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
                   ` (4 preceding siblings ...)
  2024-11-04  9:40 ` [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  2024-11-08 13:09   ` Sandipan Das
  2024-11-04  9:40 ` [PATCH 7/7] target/i386/kvm: don't stop Intel PMU counters Dongli Zhang
  6 siblings, 1 reply; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

Since perfmon-v2, the AMD PMU supports additional registers. This update
includes get/put functionality for these extra registers.

Similar to the implementation in KVM:

- MSR_CORE_PERF_GLOBAL_STATUS and MSR_AMD64_PERF_CNTR_GLOBAL_STATUS both
use env->msr_global_status.
- MSR_CORE_PERF_GLOBAL_CTRL and MSR_AMD64_PERF_CNTR_GLOBAL_CTL both use
env->msr_global_ctrl.
- MSR_CORE_PERF_GLOBAL_OVF_CTRL and MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR
both use env->msr_global_ovf_ctrl.

No changes are needed for vmstate_msr_architectural_pmu or
pmu_enable_needed().

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/cpu.h     |  4 ++++
 target/i386/kvm/kvm.c | 47 ++++++++++++++++++++++++++++++++++---------
 2 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 0505eb3b08..68ed798808 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -488,6 +488,10 @@ typedef enum X86Seg {
 #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
 
+#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS       0xc0000300
+#define MSR_AMD64_PERF_CNTR_GLOBAL_CTL          0xc0000301
+#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR   0xc0000302
+
 #define MSR_K7_EVNTSEL0                 0xc0010000
 #define MSR_K7_PERFCTR0                 0xc0010004
 #define MSR_F15H_PERF_CTL0              0xc0010200
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 83ec85a9b9..918dcb61fe 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2074,6 +2074,8 @@ static void kvm_init_pmu_info_intel(CPUX86State *env)
 
 static void kvm_init_pmu_info_amd(CPUX86State *env)
 {
+    uint32_t eax, ebx;
+    uint32_t unused;
     int64_t family;
 
     has_pmu_version = 0;
@@ -2102,6 +2104,13 @@ static void kvm_init_pmu_info_amd(CPUX86State *env)
     }
 
     num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
+
+    cpu_x86_cpuid(env, 0x80000022, 0, &eax, &ebx, &unused, &unused);
+
+    if (eax & CPUID_8000_0022_EAX_PERFMON_V2) {
+        has_pmu_version = 2;
+        num_pmu_gp_counters = ebx & 0xf;
+    }
 }
 
 static bool is_same_vendor(CPUX86State *env)
@@ -4144,13 +4153,14 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
             uint32_t step = 1;
 
             /*
-             * When PERFCORE is enabled, AMD PMU uses a separate set of
-             * addresses for the selector and counter registers.
-             * Additionally, the address of the next selector or counter
-             * register is determined by incrementing the address of the
-             * current register by two.
+             * When PERFCORE or PerfMonV2 is enabled, AMD PMU uses a
+             * separate set of addresses for the selector and counter
+             * registers. Additionally, the address of the next selector or
+             * counter register is determined by incrementing the address
+             * of the current register by two.
              */
-            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
+            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE ||
+                has_pmu_version == 2) {
                 sel_base = MSR_F15H_PERF_CTL0;
                 ctr_base = MSR_F15H_PERF_CTR0;
                 step = 2;
@@ -4162,6 +4172,15 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
                 kvm_msr_entry_add(cpu, sel_base + i * step,
                                   env->msr_gp_evtsel[i]);
             }
+
+            if (has_pmu_version == 2) {
+                kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,
+                                  env->msr_global_status);
+                kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,
+                                  env->msr_global_ovf_ctrl);
+                kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_CTL,
+                                  env->msr_global_ctrl);
+            }
         }
 
         /*
@@ -4637,13 +4656,14 @@ static int kvm_get_msrs(X86CPU *cpu)
         uint32_t step = 1;
 
         /*
-         * When PERFCORE is enabled, AMD PMU uses a separate set of
-         * addresses for the selector and counter registers.
+         * When PERFCORE or PerfMonV2 is enabled, AMD PMU uses a separate
+         * set of addresses for the selector and counter registers.
          * Additionally, the address of the next selector or counter
          * register is determined by incrementing the address of the
          * current register by two.
          */
-        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
+        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE ||
+            has_pmu_version == 2) {
             sel_base = MSR_F15H_PERF_CTL0;
             ctr_base = MSR_F15H_PERF_CTR0;
             step = 2;
@@ -4653,6 +4673,12 @@ static int kvm_get_msrs(X86CPU *cpu)
             kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
             kvm_msr_entry_add(cpu, sel_base + i * step, 0);
         }
+
+        if (has_pmu_version == 2) {
+            kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+            kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, 0);
+            kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, 0);
+        }
     }
 
     if (env->mcg_cap) {
@@ -4949,12 +4975,15 @@ static int kvm_get_msrs(X86CPU *cpu)
             env->msr_fixed_ctr_ctrl = msrs[i].data;
             break;
         case MSR_CORE_PERF_GLOBAL_CTRL:
+        case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
             env->msr_global_ctrl = msrs[i].data;
             break;
         case MSR_CORE_PERF_GLOBAL_STATUS:
+        case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS:
             env->msr_global_status = msrs[i].data;
             break;
         case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+        case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
             env->msr_global_ovf_ctrl = msrs[i].data;
             break;
         case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + MAX_FIXED_COUNTERS - 1:
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/7] target/i386/kvm: don't stop Intel PMU counters
  2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
                   ` (5 preceding siblings ...)
  2024-11-04  9:40 ` [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset Dongli Zhang
@ 2024-11-04  9:40 ` Dongli Zhang
  6 siblings, 0 replies; 32+ messages in thread
From: Dongli Zhang @ 2024-11-04  9:40 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max

The kvm_put_msrs() sets the MSRs using KVM_SET_MSRS. The x86 KVM processes
these MSRs one by one in a loop, only saving the config and triggering the
KVM_REQ_PMU request. This approach does not immediately stop the event
before updating PMC.

In additional, PMU MSRs are set only at levels >= KVM_PUT_RESET_STATE,
excluding runtime. Therefore, updating these MSRs without stopping events
should be acceptable.

Finally, KVM creates kernel perf events with host mode excluded
(exclude_host = 1). While the events remain active, they don't increment
the counter during QEMU vCPU userspace mode.

No Fixed tag is going to be added for the commit 0d89436786b0 ("kvm:
migrate vPMU state"), because this isn't a bugfix.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/kvm/kvm.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 918dcb61fe..c4a11c2e80 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -4111,13 +4111,6 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
         }
 
         if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
-            if (has_pmu_version > 1) {
-                /* Stop the counter.  */
-                kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
-                kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
-            }
-
-            /* Set the counter values.  */
             for (i = 0; i < num_pmu_fixed_counters; i++) {
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR0 + i,
                                   env->msr_fixed_counters[i]);
@@ -4133,8 +4126,6 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
                                   env->msr_global_status);
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL,
                                   env->msr_global_ovf_ctrl);
-
-                /* Now start the PMU.  */
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL,
                                   env->msr_fixed_ctr_ctrl);
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable
  2024-11-04  9:40 ` [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable Dongli Zhang
@ 2024-11-06  3:54   ` Zhao Liu
  2024-11-07  0:29     ` dongli.zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Zhao Liu @ 2024-11-06  3:54 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

Hi Dongli,

On Mon, Nov 04, 2024 at 01:40:16AM -0800, Dongli Zhang wrote:
> Date: Mon,  4 Nov 2024 01:40:16 -0800
> From: Dongli Zhang <dongli.zhang@oracle.com>
> Subject: [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE
>  unavailable
> X-Mailer: git-send-email 2.43.5
> 
> When the PERFCORE is disabled with "-cpu host,-perfctr-core", it is
> reflected in in guest dmesg.
> 
> [    0.285136] Performance Events: AMD PMU driver.
> 
> However, the guest cpuid indicates the PerfMonV2 is still available.
> 
> CPU:
>    Extended Performance Monitoring and Debugging (0x80000022):
>       AMD performance monitoring V2         = true
>       AMD LBR V2                            = false
>       AMD LBR stack & PMC freezing          = false
>       number of core perf ctrs              = 0x6 (6)
>       number of LBR stack entries           = 0x0 (0)
>       number of avail Northbridge perf ctrs = 0x0 (0)
>       number of available UMC PMCs          = 0x0 (0)
>       active UMCs bitmask                   = 0x0
> 
> Disable PerfMonV2 in cpuid when PERFCORE is disabled.
> 
> Fixes: 209b0ac12074 ("target/i386: Add PerfMonV2 feature bit")
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  target/i386/cpu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 3baa95481f..4490a7a8d6 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -7103,6 +7103,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>          *eax = *ebx = *ecx = *edx = 0;
>          /* AMD Extended Performance Monitoring and Debug */
>          if (kvm_enabled() && cpu->enable_pmu &&
> +            (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) &&
>              (env->features[FEAT_8000_0022_EAX] & CPUID_8000_0022_EAX_PERFMON_V2)) {
>              *eax |= CPUID_8000_0022_EAX_PERFMON_V2;
>              *ebx |= kvm_arch_get_supported_cpuid(cs->kvm_state, index, count,

You can define dependency like this:

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 3baa95481fbc..99c69ec9f369 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1803,6 +1803,10 @@ static FeatureDep feature_dependencies[] = {
         .from = { FEAT_7_1_EDX,             CPUID_7_1_EDX_AVX10 },
         .to = { FEAT_24_0_EBX,              ~0ull },
     },
+    {
+        .from = { FEAT_8000_0001_ECX,       CPUID_EXT3_PERFCORE },
+        .to = { FEAT_8000_0022_EAX,         CPUID_8000_0022_EAX_PERFMON_V2 }
+    }
 };

 typedef struct X86RegisterInfo32 {

---
Does this meet your needs?

Regards,
Zhao



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-04  9:40 ` [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset Dongli Zhang
@ 2024-11-06  9:58   ` Sandipan Das
  2024-11-07  0:33     ` dongli.zhang
  2024-11-07 21:00   ` Maksim Davydov
  1 sibling, 1 reply; 32+ messages in thread
From: Sandipan Das @ 2024-11-06  9:58 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, babu.moger, zhao1.liu,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

On 11/4/2024 3:10 PM, Dongli Zhang wrote:
> QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM
> and kvm_put_msrs() to restore them to KVM. However, there is no support for
> AMD PMU registers. Currently, has_pmu_version and num_pmu_gp_counters are
> initialized based on cpuid(0xa), which does not apply to AMD processors.
> For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers
> is determined based on the CPU version.
> 
> To address this issue, we need to add support for AMD PMU registers.
> Without this support, the following problems can arise:
> 
> 1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
> running "perf top", the PMU registers are not disabled properly.
> 
> 2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
> does not handle AMD PMU registers, causing some PMU events to remain
> enabled in KVM.
> 
> 3. The KVM kvm_pmc_speculative_in_use() function consistently returns true,
> preventing the reclamation of these events. Consequently, the
> kvm_pmc->perf_event remains active.
> 
> 4. After a reboot, the VM kernel may report the following error:
> 
> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)
> 
> 5. In the worst case, the active kvm_pmc->perf_event may inject unknown
> NMIs randomly into the VM kernel:
> 
> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
> 
> To resolve these issues, we propose resetting AMD PMU registers during the
> VM reset process.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  target/i386/cpu.h     |   8 +++
>  target/i386/kvm/kvm.c | 156 +++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 161 insertions(+), 3 deletions(-)
> 
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 59959b8b7a..0505eb3b08 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -488,6 +488,14 @@ typedef enum X86Seg {
>  #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>  #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>  
> +#define MSR_K7_EVNTSEL0                 0xc0010000
> +#define MSR_K7_PERFCTR0                 0xc0010004
> +#define MSR_F15H_PERF_CTL0              0xc0010200
> +#define MSR_F15H_PERF_CTR0              0xc0010201
> +
> +#define AMD64_NUM_COUNTERS              4
> +#define AMD64_NUM_COUNTERS_CORE         6
> +
>  #define MSR_MC0_CTL                     0x400
>  #define MSR_MC0_STATUS                  0x401
>  #define MSR_MC0_ADDR                    0x402
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index ca2b644e2c..83ec85a9b9 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -2035,7 +2035,7 @@ full:
>      abort();
>  }
>  
> -static void kvm_init_pmu_info(CPUX86State *env)
> +static void kvm_init_pmu_info_intel(CPUX86State *env)
>  {
>      uint32_t eax, edx;
>      uint32_t unused;
> @@ -2072,6 +2072,80 @@ static void kvm_init_pmu_info(CPUX86State *env)
>      }
>  }
>  
> +static void kvm_init_pmu_info_amd(CPUX86State *env)
> +{
> +    int64_t family;
> +
> +    has_pmu_version = 0;
> +
> +    /*
> +     * To determine the CPU family, the following code is derived from
> +     * x86_cpuid_version_get_family().
> +     */
> +    family = (env->cpuid_version >> 8) & 0xf;
> +    if (family == 0xf) {
> +        family += (env->cpuid_version >> 20) & 0xff;
> +    }
> +
> +    /*
> +     * Performance-monitoring supported from K7 and later.
> +     */
> +    if (family < 6) {
> +        return;
> +    }
> +
> +    has_pmu_version = 1;
> +
> +    if (!(env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE)) {
> +        num_pmu_gp_counters = AMD64_NUM_COUNTERS;
> +        return;
> +    }
> +
> +    num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
> +}
> +
> +static bool is_same_vendor(CPUX86State *env)
> +{
> +    static uint32_t host_cpuid_vendor1;
> +    static uint32_t host_cpuid_vendor2;
> +    static uint32_t host_cpuid_vendor3;
> +
> +    host_cpuid(0x0, 0, NULL, &host_cpuid_vendor1, &host_cpuid_vendor3,
> +               &host_cpuid_vendor2);
> +
> +    return env->cpuid_vendor1 == host_cpuid_vendor1 &&
> +           env->cpuid_vendor2 == host_cpuid_vendor2 &&
> +           env->cpuid_vendor3 == host_cpuid_vendor3;
> +}
> +
> +static void kvm_init_pmu_info(CPUX86State *env)
> +{
> +    /*
> +     * It is not supported to virtualize AMD PMU registers on Intel
> +     * processors, nor to virtualize Intel PMU registers on AMD processors.
> +     */
> +    if (!is_same_vendor(env)) {
> +        return;
> +    }
> +
> +    /*
> +     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
> +     * disable the AMD pmu virtualization.
> +     *
> +     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
> +     * indicates the KVM has already disabled the pmu virtualization.
> +     */
> +    if (kvm_state->pmu_cap_disabled) {
> +        return;
> +    }
> +
> +    if (IS_INTEL_CPU(env)) {
> +        kvm_init_pmu_info_intel(env);
> +    } else if (IS_AMD_CPU(env)) {
> +        kvm_init_pmu_info_amd(env);
> +    }
> +}
> +
>  int kvm_arch_init_vcpu(CPUState *cs)
>  {
>      struct {
> @@ -4027,7 +4101,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
>          }
>  
> -        if (has_pmu_version > 0) {
> +        if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>              if (has_pmu_version > 1) {
>                  /* Stop the counter.  */
>                  kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> @@ -4058,6 +4132,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>                                    env->msr_global_ctrl);
>              }
>          }
> +
> +        if (IS_AMD_CPU(env) && has_pmu_version > 0) {
> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
> +            /*
> +             * The address of the next selector or counter register is
> +             * obtained by incrementing the address of the current selector
> +             * or counter register by one.
> +             */
> +            uint32_t step = 1;
> +
> +            /*
> +             * When PERFCORE is enabled, AMD PMU uses a separate set of
> +             * addresses for the selector and counter registers.
> +             * Additionally, the address of the next selector or counter
> +             * register is determined by incrementing the address of the
> +             * current register by two.
> +             */
> +            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
> +                sel_base = MSR_F15H_PERF_CTL0;
> +                ctr_base = MSR_F15H_PERF_CTR0;
> +                step = 2;
> +            }
> +
> +            for (i = 0; i < num_pmu_gp_counters; i++) {
> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
> +                                  env->msr_gp_counters[i]);
> +                kvm_msr_entry_add(cpu, sel_base + i * step,
> +                                  env->msr_gp_evtsel[i]);
> +            }
> +        }
> +
>          /*
>           * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>           * only sync them to KVM on the first cpu
> @@ -4503,7 +4609,8 @@ static int kvm_get_msrs(X86CPU *cpu)
>      if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>          kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>      }
> -    if (has_pmu_version > 0) {
> +
> +    if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>          if (has_pmu_version > 1) {
>              kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>              kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
> @@ -4519,6 +4626,35 @@ static int kvm_get_msrs(X86CPU *cpu)
>          }
>      }
>  
> +    if (IS_AMD_CPU(env) && has_pmu_version > 0) {
> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
> +        /*
> +         * The address of the next selector or counter register is
> +         * obtained by incrementing the address of the current selector
> +         * or counter register by one.
> +         */
> +        uint32_t step = 1;
> +
> +        /*
> +         * When PERFCORE is enabled, AMD PMU uses a separate set of
> +         * addresses for the selector and counter registers.
> +         * Additionally, the address of the next selector or counter
> +         * register is determined by incrementing the address of the
> +         * current register by two.
> +         */
> +        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
> +            sel_base = MSR_F15H_PERF_CTL0;
> +            ctr_base = MSR_F15H_PERF_CTR0;
> +            step = 2;
> +        }
> +
> +        for (i = 0; i < num_pmu_gp_counters; i++) {
> +            kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
> +            kvm_msr_entry_add(cpu, sel_base + i * step, 0);
> +        }
> +    }
> +
>      if (env->mcg_cap) {
>          kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>          kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
> @@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>          case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>              env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>              break;
> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:

The upper bound is very unlikely to change but rewriting MSR_K7_EVNTSEL0 + 3 as
MSR_K7_EVNTSEL0 + AMD64_NUM_COUNTERS - 1 may be more readable. Same applies to
MSR_K7_PERFCTR below.

> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
> +            break;
> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
> +            break;
> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:

Same as above except this one needs AMD64_NUM_COUNTERS_CORE * 2 - 1.

> +            index = index - MSR_F15H_PERF_CTL0;
> +            if (index & 0x1) {
> +                env->msr_gp_counters[index] = msrs[i].data;
> +            } else {
> +                env->msr_gp_evtsel[index] = msrs[i].data;
> +            }
> +            break;
>          case HV_X64_MSR_HYPERCALL:
>              env->msr_hv_hypercall = msrs[i].data;
>              break;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable
  2024-11-06  3:54   ` Zhao Liu
@ 2024-11-07  0:29     ` dongli.zhang
  2024-11-07  7:57       ` Zhao Liu
  0 siblings, 1 reply; 32+ messages in thread
From: dongli.zhang @ 2024-11-07  0:29 UTC (permalink / raw)
  To: Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

Hi Zhao,


> 
> You can define dependency like this:
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 3baa95481fbc..99c69ec9f369 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -1803,6 +1803,10 @@ static FeatureDep feature_dependencies[] = {
>          .from = { FEAT_7_1_EDX,             CPUID_7_1_EDX_AVX10 },
>          .to = { FEAT_24_0_EBX,              ~0ull },
>      },
> +    {
> +        .from = { FEAT_8000_0001_ECX,       CPUID_EXT3_PERFCORE },
> +        .to = { FEAT_8000_0022_EAX,         CPUID_8000_0022_EAX_PERFMON_V2 }
> +    }
>  };
> 
>  typedef struct X86RegisterInfo32 {
> 
> ---
> Does this meet your needs?
> 

Thank you very much for the suggestion.

Yes, this works. The PERFCORE is a prerequisite of PERFMON_V2 (according to
Linux kernel source code).

1403 static int __init amd_core_pmu_init(void)
1404 {
1405         union cpuid_0x80000022_ebx ebx;
1406         u64 even_ctr_mask = 0ULL;
1407         int i;
1408
1409         if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
1410                 return 0;


If you don't mind, I will send the v2 with your Suggested-by.

Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-06  9:58   ` Sandipan Das
@ 2024-11-07  0:33     ` dongli.zhang
  0 siblings, 0 replies; 32+ messages in thread
From: dongli.zhang @ 2024-11-07  0:33 UTC (permalink / raw)
  To: Sandipan Das
  Cc: qemu-devel, kvm, pbonzini, mtosatti, babu.moger, zhao1.liu,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

Hi Sandipan,

On 11/6/24 1:58 AM, Sandipan Das wrote:

[snip]

>> @@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>>          case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>>              env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>>              break;
>> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
> 
> The upper bound is very unlikely to change but rewriting MSR_K7_EVNTSEL0 + 3 as
> MSR_K7_EVNTSEL0 + AMD64_NUM_COUNTERS - 1 may be more readable. Same applies to
> MSR_K7_PERFCTR below.
> 
>> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
>> +            break;
>> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
>> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
>> +            break;
>> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
> 
> Same as above except this one needs AMD64_NUM_COUNTERS_CORE * 2 - 1.

Unlike the AMD PMU K7, I assume it is more likely the number of counters may be
changed in the future (so far it is still 6 on my test server with PerfMonV2).

I will make it more readable in v2 following your suggestion.

Thank you very much!

Dongli Zhang



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-04  9:40 ` [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
@ 2024-11-07  7:52   ` Zhao Liu
  2024-11-07 23:44     ` dongli.zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Zhao Liu @ 2024-11-07  7:52 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, dapeng1.mi,
	zide.chen

(+Dapang & Zide)

Hi Dongli,

On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
> Date: Mon,  4 Nov 2024 01:40:17 -0800
> From: Dongli Zhang <dongli.zhang@oracle.com>
> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>  KVM_PMU_CAP_DISABLE
> X-Mailer: git-send-email 2.43.5
> 
> The AMD PMU virtualization is not disabled when configuring
> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
> virtualization in such an environment.
> 
> As a result, VM logs typically show:
> 
> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
> 
> whereas the expected logs should be:
> 
> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
> 
> This discrepancy occurs because AMD PMU does not use CPUID to determine
> whether PMU virtualization is supported.

Intel platform doesn't have this issue since Linux kernel fails to check
the CPU family & model when "-cpu *,-pmu" option clears PMU version.

The difference between Intel and AMD platforms, however, is that it seems
Intel hardly ever reaches the “...due virtualization” message, but
instead reports an error because it recognizes a mismatched family/model.

This may be a drawback of the PMU driver's print message, but the result
is the same, it prevents the PMU driver from enabling.

So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
behavior on Intel platform because current "pmu" property works as
expected.

> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
> acceleration. This property sets KVM_PMU_CAP_DISABLE if
> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
> x86 systems.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> Another previous solution to re-use '-cpu host,-pmu':
> https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/

IMO, I prefer the previous version. This VM-level KVM property is
difficult to integrate with the existing CPU properties. Pls refer later
comments for reasons.

>  accel/kvm/kvm-all.c        |  1 +
>  include/sysemu/kvm_int.h   |  1 +
>  qemu-options.hx            |  9 ++++++-
>  target/i386/cpu.c          |  2 +-
>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/kvm_i386.h |  2 ++
>  6 files changed, 65 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..8b5ba45cf7 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>      s->xen_evtchn_max_pirq = 256;
>      s->device = NULL;
>      s->msr_energy.enable = false;
> +    s->pmu_cap_disabled = false;
>  }

The CPU property "pmu" also defaults to "false"...but:

 * max CPU would override this and try to enable PMU by default in
   max_x86_cpu_initfn().

 * Other named CPU models keep the default setting to avoid affecting
   the migration.

The pmu_cap_disabled and “pmu” property look unbound and unassociated,
so this can cause the conflict when they are not synchronized. For
example,

-cpu host -accel kvm,pmu-cap-disabled=on

The above options will fail to launch a VM (on Intel platform).

Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
other and be consistent. But it's not easy because:
 - There is no proper way to have pmu_cap_disabled set different default
   values (e.g., "false" for max CPU and "true" for named CPU models)
   based on different CPU models.
 - And, no proper place to check the consistency of pmu_cap_disabled and
   enable_pmu.

Therefore, I prefer your previous approach, to reuse current CPU "pmu"
property.

Further, considering that this is currently the only case that needs to
to set the VM level's capability in the CPU context, there is no need to
introduce a new kvm interface (in your previous patch), which can instead
be set in kvm_cpu_realizefn(), like:


diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 99d1941cf51c..05e9c9a1a0cf 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
 {
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
+    KVMState *s = kvm_state;
+    static bool first = true;
     bool ret;

     /*
@@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
      *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
      *   cpu_common_realizefn() (via xcc->parent_realize)
      */
+
+    if (first) {
+        first = false;
+
+        /*
+         * Since Linux v5.18, KVM provides a VM-level capability to easily
+         * disable PMUs; however, QEMU has been providing PMU property per
+         * CPU since v1.6. In order to accommodate both, have to configure
+         * the VM-level capability here.
+         */
+        if (!cpu->enable_pmu &&
+            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
+            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
+                                      KVM_PMU_CAP_DISABLE);
+
+            if (r < 0) {
+                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
+                           strerror(-r));
+                return false;
+            }
+        }
+    }
+
     if (cpu->max_features) {
         if (enable_cpu_pm) {
             if (kvm_has_waitpkg()) {
---

In addition, if PMU is disabled, why not mask the perf related bits in
8000_0001_ECX? :)

Regards,
Zhao



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable
  2024-11-07  0:29     ` dongli.zhang
@ 2024-11-07  7:57       ` Zhao Liu
  0 siblings, 0 replies; 32+ messages in thread
From: Zhao Liu @ 2024-11-07  7:57 UTC (permalink / raw)
  To: dongli.zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

> Thank you very much for the suggestion.
> 
> Yes, this works. The PERFCORE is a prerequisite of PERFMON_V2 (according to
> Linux kernel source code).
> 
> 1403 static int __init amd_core_pmu_init(void)
> 1404 {
> 1405         union cpuid_0x80000022_ebx ebx;
> 1406         u64 even_ctr_mask = 0ULL;
> 1407         int i;
> 1408
> 1409         if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
> 1410                 return 0;
> 
> 
> If you don't mind, I will send the v2 with your Suggested-by.
> 
> Thank you very much!

Sure, welcome! :)

-Zhao

> Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-04  9:40 ` [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset Dongli Zhang
  2024-11-06  9:58   ` Sandipan Das
@ 2024-11-07 21:00   ` Maksim Davydov
  2024-11-08  1:19     ` dongli.zhang
  1 sibling, 1 reply; 32+ messages in thread
From: Maksim Davydov @ 2024-11-07 21:00 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, qemu-devel, kvm



On 11/4/24 12:40, Dongli Zhang wrote:
> QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM
> and kvm_put_msrs() to restore them to KVM. However, there is no support for
> AMD PMU registers. Currently, has_pmu_version and num_pmu_gp_counters are
> initialized based on cpuid(0xa), which does not apply to AMD processors.
> For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers
> is determined based on the CPU version.
> 
> To address this issue, we need to add support for AMD PMU registers.
> Without this support, the following problems can arise:
> 
> 1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
> running "perf top", the PMU registers are not disabled properly.
> 
> 2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
> does not handle AMD PMU registers, causing some PMU events to remain
> enabled in KVM.
> 
> 3. The KVM kvm_pmc_speculative_in_use() function consistently returns true,
> preventing the reclamation of these events. Consequently, the
> kvm_pmc->perf_event remains active.
> 
> 4. After a reboot, the VM kernel may report the following error:
> 
> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)
> 
> 5. In the worst case, the active kvm_pmc->perf_event may inject unknown
> NMIs randomly into the VM kernel:
> 
> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
> 
> To resolve these issues, we propose resetting AMD PMU registers during the
> VM reset process.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>   target/i386/cpu.h     |   8 +++
>   target/i386/kvm/kvm.c | 156 +++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 161 insertions(+), 3 deletions(-)
> 
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 59959b8b7a..0505eb3b08 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -488,6 +488,14 @@ typedef enum X86Seg {
>   #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>   #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>   
> +#define MSR_K7_EVNTSEL0                 0xc0010000
> +#define MSR_K7_PERFCTR0                 0xc0010004
> +#define MSR_F15H_PERF_CTL0              0xc0010200
> +#define MSR_F15H_PERF_CTR0              0xc0010201
> +
> +#define AMD64_NUM_COUNTERS              4
> +#define AMD64_NUM_COUNTERS_CORE         6
> +
>   #define MSR_MC0_CTL                     0x400
>   #define MSR_MC0_STATUS                  0x401
>   #define MSR_MC0_ADDR                    0x402
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index ca2b644e2c..83ec85a9b9 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -2035,7 +2035,7 @@ full:
>       abort();
>   }
>   
> -static void kvm_init_pmu_info(CPUX86State *env)
> +static void kvm_init_pmu_info_intel(CPUX86State *env)
>   {
>       uint32_t eax, edx;
>       uint32_t unused;
> @@ -2072,6 +2072,80 @@ static void kvm_init_pmu_info(CPUX86State *env)
>       }
>   }
>   
> +static void kvm_init_pmu_info_amd(CPUX86State *env)
> +{
> +    int64_t family;
> +
> +    has_pmu_version = 0;
> +
> +    /*
> +     * To determine the CPU family, the following code is derived from
> +     * x86_cpuid_version_get_family().
> +     */
> +    family = (env->cpuid_version >> 8) & 0xf;
> +    if (family == 0xf) {
> +        family += (env->cpuid_version >> 20) & 0xff;
> +    }
> +
> +    /*
> +     * Performance-monitoring supported from K7 and later.
> +     */
> +    if (family < 6) {
> +        return;
> +    }
> +
> +    has_pmu_version = 1;
> +
> +    if (!(env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE)) {
> +        num_pmu_gp_counters = AMD64_NUM_COUNTERS;
> +        return;
> +    }
> +
> +    num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
> +}

It seems that AMD implementation has one issue.
KVM has parameter `enable_pmu`. So vPMU can be disabled in another way, 
not only via KVM_PMU_CAP_DISABLE. For Intel it's not a problem, because 
the vPMU initialization uses info from KVM_GET_SUPPORTED_CPUID. The 
enable_pmu state is reflected in KVM_GET_SUPPORTED_CPUID.  Thus no PMU 
MSRs in kvm_put_msrs/kvm_get_msrs will be used.

But on AMD we don't use information from KVM_GET_SUPPORTED_CPUID to set 
an appropriate number of PMU registers. So, if vPMU is disabled by KVM 
parameter `enable_pmu` and pmu-cap-disable=false, then has_pmu_version 
will be 1 after kvm_init_pmu_info_amd execution. It means that in 
kvm_put_msrs/kvm_get_msrs 4 PMU counters will be processed, but the 
correct behavior in that situation is to skip all PMU registers.
I think we should get info from KVM to fix that.

I tested this series on Zen2 and found that PMU MSRs were still 
processed during initialization even with enable_pmu=N. But it doesn't 
lead to any errors in QEMU

> +
> +static bool is_same_vendor(CPUX86State *env)
> +{
> +    static uint32_t host_cpuid_vendor1;
> +    static uint32_t host_cpuid_vendor2;
> +    static uint32_t host_cpuid_vendor3;
> +
> +    host_cpuid(0x0, 0, NULL, &host_cpuid_vendor1, &host_cpuid_vendor3,
> +               &host_cpuid_vendor2);
> +
> +    return env->cpuid_vendor1 == host_cpuid_vendor1 &&
> +           env->cpuid_vendor2 == host_cpuid_vendor2 &&
> +           env->cpuid_vendor3 == host_cpuid_vendor3;
> +}
> +
> +static void kvm_init_pmu_info(CPUX86State *env)
> +{
> +    /*
> +     * It is not supported to virtualize AMD PMU registers on Intel
> +     * processors, nor to virtualize Intel PMU registers on AMD processors.
> +     */
> +    if (!is_same_vendor(env)) {
> +        return;
> +    }
> +
> +    /*
> +     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
> +     * disable the AMD pmu virtualization.
> +     *
> +     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
> +     * indicates the KVM has already disabled the pmu virtualization.
> +     */
> +    if (kvm_state->pmu_cap_disabled) {
> +        return;
> +    }
> +

It seems that after these changes the issue concerning using
pmu-cap-disable=true with +pmu on Intel platform (that Zhao Liu has 
mentioned before) is fixed

> +    if (IS_INTEL_CPU(env)) {
> +        kvm_init_pmu_info_intel(env);
> +    } else if (IS_AMD_CPU(env)) {
> +        kvm_init_pmu_info_amd(env);
> +    }
> +}
> +
>   int kvm_arch_init_vcpu(CPUState *cs)
>   {
>       struct {
> @@ -4027,7 +4101,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>               kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
>           }
>   
> -        if (has_pmu_version > 0) {
> +        if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>               if (has_pmu_version > 1) {
>                   /* Stop the counter.  */
>                   kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> @@ -4058,6 +4132,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>                                     env->msr_global_ctrl);
>               }
>           }
> +
> +        if (IS_AMD_CPU(env) && has_pmu_version > 0) {
> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
> +            /*
> +             * The address of the next selector or counter register is
> +             * obtained by incrementing the address of the current selector
> +             * or counter register by one.
> +             */
> +            uint32_t step = 1;
> +
> +            /*
> +             * When PERFCORE is enabled, AMD PMU uses a separate set of
> +             * addresses for the selector and counter registers.
> +             * Additionally, the address of the next selector or counter
> +             * register is determined by incrementing the address of the
> +             * current register by two.
> +             */
> +            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
> +                sel_base = MSR_F15H_PERF_CTL0;
> +                ctr_base = MSR_F15H_PERF_CTR0;
> +                step = 2;
> +            }
> +
> +            for (i = 0; i < num_pmu_gp_counters; i++) {
> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
> +                                  env->msr_gp_counters[i]);
> +                kvm_msr_entry_add(cpu, sel_base + i * step,
> +                                  env->msr_gp_evtsel[i]);
> +            }
> +        }
> +
>           /*
>            * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>            * only sync them to KVM on the first cpu
> @@ -4503,7 +4609,8 @@ static int kvm_get_msrs(X86CPU *cpu)
>       if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>           kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>       }
> -    if (has_pmu_version > 0) {
> +
> +    if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>           if (has_pmu_version > 1) {
>               kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>               kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
> @@ -4519,6 +4626,35 @@ static int kvm_get_msrs(X86CPU *cpu)
>           }
>       }
>   
> +    if (IS_AMD_CPU(env) && has_pmu_version > 0) {
> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
> +        /*
> +         * The address of the next selector or counter register is
> +         * obtained by incrementing the address of the current selector
> +         * or counter register by one.
> +         */
> +        uint32_t step = 1;
> +
> +        /*
> +         * When PERFCORE is enabled, AMD PMU uses a separate set of
> +         * addresses for the selector and counter registers.
> +         * Additionally, the address of the next selector or counter
> +         * register is determined by incrementing the address of the
> +         * current register by two.
> +         */
> +        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
> +            sel_base = MSR_F15H_PERF_CTL0;
> +            ctr_base = MSR_F15H_PERF_CTR0;
> +            step = 2;
> +        }
> +
> +        for (i = 0; i < num_pmu_gp_counters; i++) {
> +            kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
> +            kvm_msr_entry_add(cpu, sel_base + i * step, 0);
> +        }
> +    }
> +
>       if (env->mcg_cap) {
>           kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>           kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
> @@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>           case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>               env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>               break;
> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
> +            break;
> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
> +            break;
> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
> +            index = index - MSR_F15H_PERF_CTL0;
> +            if (index & 0x1) {
> +                env->msr_gp_counters[index] = msrs[i].data;
> +            } else {
> +                env->msr_gp_evtsel[index] = msrs[i].data;
> +            }
> +            break;
>           case HV_X64_MSR_HYPERCALL:
>               env->msr_hv_hypercall = msrs[i].data;
>               break;

-- 
Best regards,
Maksim Davydov


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-07  7:52   ` Zhao Liu
@ 2024-11-07 23:44     ` dongli.zhang
  2024-11-08  2:32       ` Zhao Liu
                         ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: dongli.zhang @ 2024-11-07 23:44 UTC (permalink / raw)
  To: Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, dapeng1.mi,
	zide.chen

Hi Zhao,


On 11/6/24 11:52 PM, Zhao Liu wrote:
> (+Dapang & Zide)
> 
> Hi Dongli,
> 
> On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
>> Date: Mon,  4 Nov 2024 01:40:17 -0800
>> From: Dongli Zhang <dongli.zhang@oracle.com>
>> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>>  KVM_PMU_CAP_DISABLE
>> X-Mailer: git-send-email 2.43.5
>>
>> The AMD PMU virtualization is not disabled when configuring
>> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
>> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
>> virtualization in such an environment.
>>
>> As a result, VM logs typically show:
>>
>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>
>> whereas the expected logs should be:
>>
>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>
>> This discrepancy occurs because AMD PMU does not use CPUID to determine
>> whether PMU virtualization is supported.
> 
> Intel platform doesn't have this issue since Linux kernel fails to check
> the CPU family & model when "-cpu *,-pmu" option clears PMU version.
> 
> The difference between Intel and AMD platforms, however, is that it seems
> Intel hardly ever reaches the “...due virtualization” message, but
> instead reports an error because it recognizes a mismatched family/model.
> 
> This may be a drawback of the PMU driver's print message, but the result
> is the same, it prevents the PMU driver from enabling.
> 
> So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
> behavior on Intel platform because current "pmu" property works as
> expected.

Sure. I will mention this in v2.

> 
>> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
>> acceleration. This property sets KVM_PMU_CAP_DISABLE if
>> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
>> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
>> x86 systems.
>>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>> Another previous solution to re-use '-cpu host,-pmu':
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Nm8Db-mwBoMIwKkRqzC9kgNi5uZ7SCIf43zUBn92Ar_NEbLXq-ZkrDDvpvDQ4cnS2i4VyKAp6CRVE12bRkMF$ 
> 
> IMO, I prefer the previous version. This VM-level KVM property is
> difficult to integrate with the existing CPU properties. Pls refer later
> comments for reasons.
> 
>>  accel/kvm/kvm-all.c        |  1 +
>>  include/sysemu/kvm_int.h   |  1 +
>>  qemu-options.hx            |  9 ++++++-
>>  target/i386/cpu.c          |  2 +-
>>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>>  target/i386/kvm/kvm_i386.h |  2 ++
>>  6 files changed, 65 insertions(+), 2 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 801cff16a5..8b5ba45cf7 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>>      s->xen_evtchn_max_pirq = 256;
>>      s->device = NULL;
>>      s->msr_energy.enable = false;
>> +    s->pmu_cap_disabled = false;
>>  }
> 
> The CPU property "pmu" also defaults to "false"...but:
> 
>  * max CPU would override this and try to enable PMU by default in
>    max_x86_cpu_initfn().
> 
>  * Other named CPU models keep the default setting to avoid affecting
>    the migration.
> 
> The pmu_cap_disabled and “pmu” property look unbound and unassociated,
> so this can cause the conflict when they are not synchronized. For
> example,
> 
> -cpu host -accel kvm,pmu-cap-disabled=on
> 
> The above options will fail to launch a VM (on Intel platform).
> 
> Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
> other and be consistent. But it's not easy because:
>  - There is no proper way to have pmu_cap_disabled set different default
>    values (e.g., "false" for max CPU and "true" for named CPU models)
>    based on different CPU models.
>  - And, no proper place to check the consistency of pmu_cap_disabled and
>    enable_pmu.
> 
> Therefore, I prefer your previous approach, to reuse current CPU "pmu"
> property.

Thank you very much for the suggestion and reasons.

I am going to follow your suggestion to switch back to the previous solution in v2.

> 
> Further, considering that this is currently the only case that needs to
> to set the VM level's capability in the CPU context, there is no need to
> introduce a new kvm interface (in your previous patch), which can instead
> be set in kvm_cpu_realizefn(), like:
> 
> 
> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
> index 99d1941cf51c..05e9c9a1a0cf 100644
> --- a/target/i386/kvm/kvm-cpu.c
> +++ b/target/i386/kvm/kvm-cpu.c
> @@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>  {
>      X86CPU *cpu = X86_CPU(cs);
>      CPUX86State *env = &cpu->env;
> +    KVMState *s = kvm_state;
> +    static bool first = true;
>      bool ret;
> 
>      /*
> @@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>       *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
>       *   cpu_common_realizefn() (via xcc->parent_realize)
>       */
> +
> +    if (first) {
> +        first = false;
> +
> +        /*
> +         * Since Linux v5.18, KVM provides a VM-level capability to easily
> +         * disable PMUs; however, QEMU has been providing PMU property per
> +         * CPU since v1.6. In order to accommodate both, have to configure
> +         * the VM-level capability here.
> +         */
> +        if (!cpu->enable_pmu &&
> +            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
> +            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
> +                                      KVM_PMU_CAP_DISABLE);
> +
> +            if (r < 0) {
> +                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
> +                           strerror(-r));
> +                return false;
> +            }
> +        }
> +    }
> +
>      if (cpu->max_features) {
>          if (enable_cpu_pm) {
>              if (kvm_has_waitpkg()) {
> ---

Sure. I will limit the change within only x86 + KVM.

> 
> In addition, if PMU is disabled, why not mask the perf related bits in
> 8000_0001_ECX? :)
> 

My fault. I have masked only 0x80000022, and I forgot 0x80000001 for AMD.

Thank you very much for the reminder.


I will wait for a day or maybe the weekend. I am going to switch to the previous
solution in v2 if there isn't any further objection with a more valid reason.

Thank you very much for the feedback!

Dongli Zhang



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-07 21:00   ` Maksim Davydov
@ 2024-11-08  1:19     ` dongli.zhang
  2024-11-08 14:07       ` Maksim Davydov
  0 siblings, 1 reply; 32+ messages in thread
From: dongli.zhang @ 2024-11-08  1:19 UTC (permalink / raw)
  To: Maksim Davydov
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, qemu-devel, kvm

Hi Maksim,

On 11/7/24 1:00 PM, Maksim Davydov wrote:
> 
> 
> On 11/4/24 12:40, Dongli Zhang wrote:
>> QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM
>> and kvm_put_msrs() to restore them to KVM. However, there is no support for
>> AMD PMU registers. Currently, has_pmu_version and num_pmu_gp_counters are
>> initialized based on cpuid(0xa), which does not apply to AMD processors.
>> For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers
>> is determined based on the CPU version.
>>
>> To address this issue, we need to add support for AMD PMU registers.
>> Without this support, the following problems can arise:
>>
>> 1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
>> running "perf top", the PMU registers are not disabled properly.
>>
>> 2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
>> does not handle AMD PMU registers, causing some PMU events to remain
>> enabled in KVM.
>>
>> 3. The KVM kvm_pmc_speculative_in_use() function consistently returns true,
>> preventing the reclamation of these events. Consequently, the
>> kvm_pmc->perf_event remains active.
>>
>> 4. After a reboot, the VM kernel may report the following error:
>>
>> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected,
>> complain to your hardware vendor.
>> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR
>> c0010200 is 530076)
>>
>> 5. In the worst case, the active kvm_pmc->perf_event may inject unknown
>> NMIs randomly into the VM kernel:
>>
>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>
>> To resolve these issues, we propose resetting AMD PMU registers during the
>> VM reset process.
>>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>>   target/i386/cpu.h     |   8 +++
>>   target/i386/kvm/kvm.c | 156 +++++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 161 insertions(+), 3 deletions(-)
>>
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index 59959b8b7a..0505eb3b08 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -488,6 +488,14 @@ typedef enum X86Seg {
>>   #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>>   #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>>   +#define MSR_K7_EVNTSEL0                 0xc0010000
>> +#define MSR_K7_PERFCTR0                 0xc0010004
>> +#define MSR_F15H_PERF_CTL0              0xc0010200
>> +#define MSR_F15H_PERF_CTR0              0xc0010201
>> +
>> +#define AMD64_NUM_COUNTERS              4
>> +#define AMD64_NUM_COUNTERS_CORE         6
>> +
>>   #define MSR_MC0_CTL                     0x400
>>   #define MSR_MC0_STATUS                  0x401
>>   #define MSR_MC0_ADDR                    0x402
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index ca2b644e2c..83ec85a9b9 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -2035,7 +2035,7 @@ full:
>>       abort();
>>   }
>>   -static void kvm_init_pmu_info(CPUX86State *env)
>> +static void kvm_init_pmu_info_intel(CPUX86State *env)
>>   {
>>       uint32_t eax, edx;
>>       uint32_t unused;
>> @@ -2072,6 +2072,80 @@ static void kvm_init_pmu_info(CPUX86State *env)
>>       }
>>   }
>>   +static void kvm_init_pmu_info_amd(CPUX86State *env)
>> +{
>> +    int64_t family;
>> +
>> +    has_pmu_version = 0;
>> +
>> +    /*
>> +     * To determine the CPU family, the following code is derived from
>> +     * x86_cpuid_version_get_family().
>> +     */
>> +    family = (env->cpuid_version >> 8) & 0xf;
>> +    if (family == 0xf) {
>> +        family += (env->cpuid_version >> 20) & 0xff;
>> +    }
>> +
>> +    /*
>> +     * Performance-monitoring supported from K7 and later.
>> +     */
>> +    if (family < 6) {
>> +        return;
>> +    }
>> +
>> +    has_pmu_version = 1;
>> +
>> +    if (!(env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE)) {
>> +        num_pmu_gp_counters = AMD64_NUM_COUNTERS;
>> +        return;
>> +    }
>> +
>> +    num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
>> +}
> 
> It seems that AMD implementation has one issue.
> KVM has parameter `enable_pmu`. So vPMU can be disabled in another way, not only
> via KVM_PMU_CAP_DISABLE. For Intel it's not a problem, because the vPMU
> initialization uses info from KVM_GET_SUPPORTED_CPUID. The enable_pmu state is
> reflected in KVM_GET_SUPPORTED_CPUID.  Thus no PMU MSRs in kvm_put_msrs/
> kvm_get_msrs will be used.
> 
> But on AMD we don't use information from KVM_GET_SUPPORTED_CPUID to set an
> appropriate number of PMU registers. So, if vPMU is disabled by KVM parameter
> `enable_pmu` and pmu-cap-disable=false, then has_pmu_version will be 1 after
> kvm_init_pmu_info_amd execution. It means that in kvm_put_msrs/kvm_get_msrs 4
> PMU counters will be processed, but the correct behavior in that situation is to
> skip all PMU registers.
> I think we should get info from KVM to fix that.
> 
> I tested this series on Zen2 and found that PMU MSRs were still processed during
> initialization even with enable_pmu=N. But it doesn't lead to any errors in QEMU

Thank you very much for the feedback and helping catch the bug!

When enable_pmu=N, the QEMU (with this patchset) cannot tell if vPMU is
supported via KVM_CAP_PMU_CAPABILITY.

As it cannot disable the PMU, it falls to the legacy 4 counters.

It falls to 4 counters because KVM disableds PERFCORE on enable_pmu=Y, i.e.,

5220         if (enable_pmu) {
5221                 /*
5222                  * Enumerate support for PERFCTR_CORE if and only if KVM has
5223                  * access to enough counters to virtualize "core" support,
5224                  * otherwise limit vPMU support to the legacy number of
counters.
5225                  */
5226                 if (kvm_pmu_cap.num_counters_gp < AMD64_NUM_COUNTERS_CORE)
5227                         kvm_pmu_cap.num_counters_gp = min(AMD64_NUM_COUNTERS,
5228
kvm_pmu_cap.num_counters_gp);
5229                 else
5230                         kvm_cpu_cap_check_and_set(X86_FEATURE_PERFCTR_CORE);
5231
5232                 if (kvm_pmu_cap.version != 2 ||
5233                     !kvm_cpu_cap_has(X86_FEATURE_PERFCTR_CORE))
5234                         kvm_cpu_cap_clear(X86_FEATURE_PERFMON_V2);
5235         }


During the bootup and reset, the QEMU (with this patchset) erroneously resets
MSRs for the 4 PMCs, via line 3827.

3825 static int kvm_buf_set_msrs(X86CPU *cpu)
3826 {
3827     int ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, cpu->kvm_msr_buf);
3828     if (ret < 0) {
3829         return ret;
3830     }
3831
3832     if (ret < cpu->kvm_msr_buf->nmsrs) {
3833         struct kvm_msr_entry *e = &cpu->kvm_msr_buf->entries[ret];
3834         error_report("error: failed to set MSR 0x%" PRIx32 " to 0x%" PRIx64,
3835                      (uint32_t)e->index, (uint64_t)e->data);
3836     }
3837
3838     assert(ret == cpu->kvm_msr_buf->nmsrs);
3839     return 0;
3840 }

Because enable_pmu=N, the KVM doesn't support those registers. However, it
returns 0 (not 1), because the KVM does nothing in the implicit else (i.e., line
4144).

3847 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
3848 {
... ...
4138         case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
4139         case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
4140         case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
4141         case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
4142                 if (kvm_pmu_is_valid_msr(vcpu, msr))
4143                         return kvm_pmu_set_msr(vcpu, msr_info);
4144
4145                 if (data)
4146                         kvm_pr_unimpl_wrmsr(vcpu, msr, data);
4147                 break;
... ...
4224         default:
4225                 if (kvm_pmu_is_valid_msr(vcpu, msr))
4226                         return kvm_pmu_set_msr(vcpu, msr_info);
4227
4228                 /*
4229                  * Userspace is allowed to write '0' to MSRs that KVM reports
4230                  * as to-be-saved, even if an MSRs isn't fully supported.
4231                  */
4232                 if (msr_info->host_initiated && !data &&
4233                     kvm_is_msr_to_save(msr))
4234                         break;
4235
4236                 return KVM_MSR_RET_INVALID;
4237         }
4238         return 0;
4239 }
4240 EXPORT_SYMBOL_GPL(kvm_set_msr_common);

Fortunately, it returns 0 at line 4238. No error is detected by QEMU.

Perhaps I may need to send message with a small patch to return 1 in the
implicit 'else' to kvm mailing list to confirm if that is expected.

However, the answer is very likely 'expected', because line 4229 to line 4230
already explain it.


Regarding the change in QEMU:

Since kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY) returns 0 for both (1)
enable_pmu=Y, and (2) KVM_CAP_PMU_CAPABILITY not supported, I may need to use
g_file_get_contents() to read from the KVM sysfs parameters (similar to KVM
selftest kvm_is_pmu_enabled()).

> 
>> +
>> +static bool is_same_vendor(CPUX86State *env)
>> +{
>> +    static uint32_t host_cpuid_vendor1;
>> +    static uint32_t host_cpuid_vendor2;
>> +    static uint32_t host_cpuid_vendor3;
>> +
>> +    host_cpuid(0x0, 0, NULL, &host_cpuid_vendor1, &host_cpuid_vendor3,
>> +               &host_cpuid_vendor2);
>> +
>> +    return env->cpuid_vendor1 == host_cpuid_vendor1 &&
>> +           env->cpuid_vendor2 == host_cpuid_vendor2 &&
>> +           env->cpuid_vendor3 == host_cpuid_vendor3;
>> +}
>> +
>> +static void kvm_init_pmu_info(CPUX86State *env)
>> +{
>> +    /*
>> +     * It is not supported to virtualize AMD PMU registers on Intel
>> +     * processors, nor to virtualize Intel PMU registers on AMD processors.
>> +     */
>> +    if (!is_same_vendor(env)) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
>> +     * disable the AMD pmu virtualization.
>> +     *
>> +     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
>> +     * indicates the KVM has already disabled the pmu virtualization.
>> +     */
>> +    if (kvm_state->pmu_cap_disabled) {
>> +        return;
>> +    }
>> +
> 
> It seems that after these changes the issue concerning using
> pmu-cap-disable=true with +pmu on Intel platform (that Zhao Liu has mentioned
> before) is fixed

Can I assume you were going to paste some code below?

Regardless, I am going to following Zhao's suggestion to revert back to my
previous solution.

Thank you very much for the feedback!

Dongli Zhang

> 
>> +    if (IS_INTEL_CPU(env)) {
>> +        kvm_init_pmu_info_intel(env);
>> +    } else if (IS_AMD_CPU(env)) {
>> +        kvm_init_pmu_info_amd(env);
>> +    }
>> +}
>> +
>>   int kvm_arch_init_vcpu(CPUState *cs)
>>   {
>>       struct {
>> @@ -4027,7 +4101,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>               kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env-
>> >poll_control_msr);
>>           }
>>   -        if (has_pmu_version > 0) {
>> +        if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>>               if (has_pmu_version > 1) {
>>                   /* Stop the counter.  */
>>                   kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> @@ -4058,6 +4132,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>                                     env->msr_global_ctrl);
>>               }
>>           }
>> +
>> +        if (IS_AMD_CPU(env) && has_pmu_version > 0) {
>> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
>> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
>> +            /*
>> +             * The address of the next selector or counter register is
>> +             * obtained by incrementing the address of the current selector
>> +             * or counter register by one.
>> +             */
>> +            uint32_t step = 1;
>> +
>> +            /*
>> +             * When PERFCORE is enabled, AMD PMU uses a separate set of
>> +             * addresses for the selector and counter registers.
>> +             * Additionally, the address of the next selector or counter
>> +             * register is determined by incrementing the address of the
>> +             * current register by two.
>> +             */
>> +            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
>> +                sel_base = MSR_F15H_PERF_CTL0;
>> +                ctr_base = MSR_F15H_PERF_CTR0;
>> +                step = 2;
>> +            }
>> +
>> +            for (i = 0; i < num_pmu_gp_counters; i++) {
>> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
>> +                                  env->msr_gp_counters[i]);
>> +                kvm_msr_entry_add(cpu, sel_base + i * step,
>> +                                  env->msr_gp_evtsel[i]);
>> +            }
>> +        }
>> +
>>           /*
>>            * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>>            * only sync them to KVM on the first cpu
>> @@ -4503,7 +4609,8 @@ static int kvm_get_msrs(X86CPU *cpu)
>>       if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>>           kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>>       }
>> -    if (has_pmu_version > 0) {
>> +
>> +    if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>>           if (has_pmu_version > 1) {
>>               kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>               kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
>> @@ -4519,6 +4626,35 @@ static int kvm_get_msrs(X86CPU *cpu)
>>           }
>>       }
>>   +    if (IS_AMD_CPU(env) && has_pmu_version > 0) {
>> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
>> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
>> +        /*
>> +         * The address of the next selector or counter register is
>> +         * obtained by incrementing the address of the current selector
>> +         * or counter register by one.
>> +         */
>> +        uint32_t step = 1;
>> +
>> +        /*
>> +         * When PERFCORE is enabled, AMD PMU uses a separate set of
>> +         * addresses for the selector and counter registers.
>> +         * Additionally, the address of the next selector or counter
>> +         * register is determined by incrementing the address of the
>> +         * current register by two.
>> +         */
>> +        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
>> +            sel_base = MSR_F15H_PERF_CTL0;
>> +            ctr_base = MSR_F15H_PERF_CTR0;
>> +            step = 2;
>> +        }
>> +
>> +        for (i = 0; i < num_pmu_gp_counters; i++) {
>> +            kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
>> +            kvm_msr_entry_add(cpu, sel_base + i * step, 0);
>> +        }
>> +    }
>> +
>>       if (env->mcg_cap) {
>>           kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>>           kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
>> @@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>>           case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>>               env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>>               break;
>> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
>> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
>> +            break;
>> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
>> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
>> +            break;
>> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
>> +            index = index - MSR_F15H_PERF_CTL0;
>> +            if (index & 0x1) {
>> +                env->msr_gp_counters[index] = msrs[i].data;
>> +            } else {
>> +                env->msr_gp_evtsel[index] = msrs[i].data;
>> +            }
>> +            break;
>>           case HV_X64_MSR_HYPERCALL:
>>               env->msr_hv_hypercall = msrs[i].data;
>>               break;
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-07 23:44     ` dongli.zhang
@ 2024-11-08  2:32       ` Zhao Liu
  2024-11-08 12:52       ` Sandipan Das
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Zhao Liu @ 2024-11-08  2:32 UTC (permalink / raw)
  To: dongli.zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, dapeng1.mi,
	zide.chen

> I will wait for a day or maybe the weekend. I am going to switch to the previous
> solution in v2 if there isn't any further objection with a more valid reason.
> 
> Thank you very much for the feedback!
> 

Welcome. It's now v9.2 soft frozen. I'm also continuing to review your
remaining patches.

Regards,
Zhao



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-07 23:44     ` dongli.zhang
  2024-11-08  2:32       ` Zhao Liu
@ 2024-11-08 12:52       ` Sandipan Das
  2024-11-13 17:15       ` Zhao Liu
  2024-11-21 10:06       ` Mi, Dapeng
  3 siblings, 0 replies; 32+ messages in thread
From: Sandipan Das @ 2024-11-08 12:52 UTC (permalink / raw)
  To: dongli.zhang, Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, babu.moger, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, davydov-max, dapeng1.mi, zide.chen

On 11/8/2024 5:14 AM, dongli.zhang@oracle.com wrote:
> Hi Zhao,
> 
> 
> On 11/6/24 11:52 PM, Zhao Liu wrote:
>> (+Dapang & Zide)
>>
>> Hi Dongli,
>>
>> On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
>>> Date: Mon,  4 Nov 2024 01:40:17 -0800
>>> From: Dongli Zhang <dongli.zhang@oracle.com>
>>> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>>>  KVM_PMU_CAP_DISABLE
>>> X-Mailer: git-send-email 2.43.5
>>>
>>> The AMD PMU virtualization is not disabled when configuring
>>> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
>>> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
>>> virtualization in such an environment.
>>>
>>> As a result, VM logs typically show:
>>>
>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>
>>> whereas the expected logs should be:
>>>
>>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>
>>> This discrepancy occurs because AMD PMU does not use CPUID to determine
>>> whether PMU virtualization is supported.
>>
>> Intel platform doesn't have this issue since Linux kernel fails to check
>> the CPU family & model when "-cpu *,-pmu" option clears PMU version.
>>
>> The difference between Intel and AMD platforms, however, is that it seems
>> Intel hardly ever reaches the “...due virtualization” message, but
>> instead reports an error because it recognizes a mismatched family/model.
>>
>> This may be a drawback of the PMU driver's print message, but the result
>> is the same, it prevents the PMU driver from enabling.
>>
>> So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
>> behavior on Intel platform because current "pmu" property works as
>> expected.
> 
> Sure. I will mention this in v2.
> 
>>
>>> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
>>> acceleration. This property sets KVM_PMU_CAP_DISABLE if
>>> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
>>> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
>>> x86 systems.
>>>
>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>> ---
>>> Another previous solution to re-use '-cpu host,-pmu':
>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Nm8Db-mwBoMIwKkRqzC9kgNi5uZ7SCIf43zUBn92Ar_NEbLXq-ZkrDDvpvDQ4cnS2i4VyKAp6CRVE12bRkMF$ 
>>
>> IMO, I prefer the previous version. This VM-level KVM property is
>> difficult to integrate with the existing CPU properties. Pls refer later
>> comments for reasons.
>>
>>>  accel/kvm/kvm-all.c        |  1 +
>>>  include/sysemu/kvm_int.h   |  1 +
>>>  qemu-options.hx            |  9 ++++++-
>>>  target/i386/cpu.c          |  2 +-
>>>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>>>  target/i386/kvm/kvm_i386.h |  2 ++
>>>  6 files changed, 65 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 801cff16a5..8b5ba45cf7 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>>>      s->xen_evtchn_max_pirq = 256;
>>>      s->device = NULL;
>>>      s->msr_energy.enable = false;
>>> +    s->pmu_cap_disabled = false;
>>>  }
>>
>> The CPU property "pmu" also defaults to "false"...but:
>>
>>  * max CPU would override this and try to enable PMU by default in
>>    max_x86_cpu_initfn().
>>
>>  * Other named CPU models keep the default setting to avoid affecting
>>    the migration.
>>
>> The pmu_cap_disabled and “pmu” property look unbound and unassociated,
>> so this can cause the conflict when they are not synchronized. For
>> example,
>>
>> -cpu host -accel kvm,pmu-cap-disabled=on
>>
>> The above options will fail to launch a VM (on Intel platform).
>>
>> Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
>> other and be consistent. But it's not easy because:
>>  - There is no proper way to have pmu_cap_disabled set different default
>>    values (e.g., "false" for max CPU and "true" for named CPU models)
>>    based on different CPU models.
>>  - And, no proper place to check the consistency of pmu_cap_disabled and
>>    enable_pmu.
>>
>> Therefore, I prefer your previous approach, to reuse current CPU "pmu"
>> property.
> 
> Thank you very much for the suggestion and reasons.
> 
> I am going to follow your suggestion to switch back to the previous solution in v2.
> 
>>
>> Further, considering that this is currently the only case that needs to
>> to set the VM level's capability in the CPU context, there is no need to
>> introduce a new kvm interface (in your previous patch), which can instead
>> be set in kvm_cpu_realizefn(), like:
>>
>>
>> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
>> index 99d1941cf51c..05e9c9a1a0cf 100644
>> --- a/target/i386/kvm/kvm-cpu.c
>> +++ b/target/i386/kvm/kvm-cpu.c
>> @@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>  {
>>      X86CPU *cpu = X86_CPU(cs);
>>      CPUX86State *env = &cpu->env;
>> +    KVMState *s = kvm_state;
>> +    static bool first = true;
>>      bool ret;
>>
>>      /*
>> @@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>       *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
>>       *   cpu_common_realizefn() (via xcc->parent_realize)
>>       */
>> +
>> +    if (first) {
>> +        first = false;
>> +
>> +        /*
>> +         * Since Linux v5.18, KVM provides a VM-level capability to easily
>> +         * disable PMUs; however, QEMU has been providing PMU property per
>> +         * CPU since v1.6. In order to accommodate both, have to configure
>> +         * the VM-level capability here.
>> +         */
>> +        if (!cpu->enable_pmu &&
>> +            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
>> +            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
>> +                                      KVM_PMU_CAP_DISABLE);
>> +
>> +            if (r < 0) {
>> +                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
>> +                           strerror(-r));
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +
>>      if (cpu->max_features) {
>>          if (enable_cpu_pm) {
>>              if (kvm_has_waitpkg()) {
>> ---
> 
> Sure. I will limit the change within only x86 + KVM.
> 
>>
>> In addition, if PMU is disabled, why not mask the perf related bits in
>> 8000_0001_ECX? :)
>>
> 
> My fault. I have masked only 0x80000022, and I forgot 0x80000001 for AMD.
> 
> Thank you very much for the reminder.
> 
> 

I think this may not show a functional impact because the guest kernel
would anyway not register the core PMU because of being unable to access
the PMC MSRs as a result of KVM_PMU_CAP_DISABLE but its the right thing
to do because the guest OS may not always be Linux-based :)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset
  2024-11-04  9:40 ` [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset Dongli Zhang
@ 2024-11-08 13:09   ` Sandipan Das
  2024-11-08 16:55     ` dongli.zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Sandipan Das @ 2024-11-08 13:09 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, babu.moger, zhao1.liu,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

On 11/4/2024 3:10 PM, Dongli Zhang wrote:
> Since perfmon-v2, the AMD PMU supports additional registers. This update
> includes get/put functionality for these extra registers.
> 
> Similar to the implementation in KVM:
> 
> - MSR_CORE_PERF_GLOBAL_STATUS and MSR_AMD64_PERF_CNTR_GLOBAL_STATUS both
> use env->msr_global_status.
> - MSR_CORE_PERF_GLOBAL_CTRL and MSR_AMD64_PERF_CNTR_GLOBAL_CTL both use
> env->msr_global_ctrl.
> - MSR_CORE_PERF_GLOBAL_OVF_CTRL and MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR
> both use env->msr_global_ovf_ctrl.
> 
> No changes are needed for vmstate_msr_architectural_pmu or
> pmu_enable_needed().
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  target/i386/cpu.h     |  4 ++++
>  target/i386/kvm/kvm.c | 47 ++++++++++++++++++++++++++++++++++---------
>  2 files changed, 42 insertions(+), 9 deletions(-)
> 
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 0505eb3b08..68ed798808 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -488,6 +488,10 @@ typedef enum X86Seg {
>  #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>  #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>  
> +#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS       0xc0000300
> +#define MSR_AMD64_PERF_CNTR_GLOBAL_CTL          0xc0000301
> +#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR   0xc0000302
> +
>  #define MSR_K7_EVNTSEL0                 0xc0010000
>  #define MSR_K7_PERFCTR0                 0xc0010004
>  #define MSR_F15H_PERF_CTL0              0xc0010200
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 83ec85a9b9..918dcb61fe 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -2074,6 +2074,8 @@ static void kvm_init_pmu_info_intel(CPUX86State *env)
>  
>  static void kvm_init_pmu_info_amd(CPUX86State *env)
>  {
> +    uint32_t eax, ebx;
> +    uint32_t unused;
>      int64_t family;
>  
>      has_pmu_version = 0;
> @@ -2102,6 +2104,13 @@ static void kvm_init_pmu_info_amd(CPUX86State *env)
>      }
>  
>      num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
> +
> +    cpu_x86_cpuid(env, 0x80000022, 0, &eax, &ebx, &unused, &unused);
> +
> +    if (eax & CPUID_8000_0022_EAX_PERFMON_V2) {
> +        has_pmu_version = 2;
> +        num_pmu_gp_counters = ebx & 0xf;
> +    }
>  }
>  
>  static bool is_same_vendor(CPUX86State *env)
> @@ -4144,13 +4153,14 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              uint32_t step = 1;
>  
>              /*
> -             * When PERFCORE is enabled, AMD PMU uses a separate set of
> -             * addresses for the selector and counter registers.
> -             * Additionally, the address of the next selector or counter
> -             * register is determined by incrementing the address of the
> -             * current register by two.
> +             * When PERFCORE or PerfMonV2 is enabled, AMD PMU uses a
> +             * separate set of addresses for the selector and counter
> +             * registers. Additionally, the address of the next selector or
> +             * counter register is determined by incrementing the address
> +             * of the current register by two.
>               */
> -            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
> +            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE ||
> +                has_pmu_version == 2) {

Future PMU versions are expected to be backwards compatible. So it may be
better to look for has_pmu_version > 1.

>                  sel_base = MSR_F15H_PERF_CTL0;
>                  ctr_base = MSR_F15H_PERF_CTR0;
>                  step = 2;
> @@ -4162,6 +4172,15 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>                  kvm_msr_entry_add(cpu, sel_base + i * step,
>                                    env->msr_gp_evtsel[i]);
>              }
> +
> +            if (has_pmu_version == 2) {
> +                kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,
> +                                  env->msr_global_status);
> +                kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,
> +                                  env->msr_global_ovf_ctrl);
> +                kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_CTL,
> +                                  env->msr_global_ctrl);
> +            }
>          }
>  
>          /*
> @@ -4637,13 +4656,14 @@ static int kvm_get_msrs(X86CPU *cpu)
>          uint32_t step = 1;
>  
>          /*
> -         * When PERFCORE is enabled, AMD PMU uses a separate set of
> -         * addresses for the selector and counter registers.
> +         * When PERFCORE or PerfMonV2 is enabled, AMD PMU uses a separate
> +         * set of addresses for the selector and counter registers.
>           * Additionally, the address of the next selector or counter
>           * register is determined by incrementing the address of the
>           * current register by two.
>           */
> -        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
> +        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE ||
> +            has_pmu_version == 2) {
>              sel_base = MSR_F15H_PERF_CTL0;
>              ctr_base = MSR_F15H_PERF_CTR0;
>              step = 2;
> @@ -4653,6 +4673,12 @@ static int kvm_get_msrs(X86CPU *cpu)
>              kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
>              kvm_msr_entry_add(cpu, sel_base + i * step, 0);
>          }
> +
> +        if (has_pmu_version == 2) {
> +            kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
> +            kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, 0);
> +            kvm_msr_entry_add(cpu, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, 0);
> +        }
>      }
>  
>      if (env->mcg_cap) {
> @@ -4949,12 +4975,15 @@ static int kvm_get_msrs(X86CPU *cpu)
>              env->msr_fixed_ctr_ctrl = msrs[i].data;
>              break;
>          case MSR_CORE_PERF_GLOBAL_CTRL:
> +        case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
>              env->msr_global_ctrl = msrs[i].data;
>              break;
>          case MSR_CORE_PERF_GLOBAL_STATUS:
> +        case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS:
>              env->msr_global_status = msrs[i].data;
>              break;
>          case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> +        case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
>              env->msr_global_ovf_ctrl = msrs[i].data;
>              break;
>          case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + MAX_FIXED_COUNTERS - 1:



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-08  1:19     ` dongli.zhang
@ 2024-11-08 14:07       ` Maksim Davydov
  2024-11-08 18:04         ` dongli.zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Maksim Davydov @ 2024-11-08 14:07 UTC (permalink / raw)
  To: dongli.zhang
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, qemu-devel, kvm



On 11/8/24 04:19, dongli.zhang@oracle.com wrote:
> Hi Maksim,
> 
> On 11/7/24 1:00 PM, Maksim Davydov wrote:
>>
>>
>> On 11/4/24 12:40, Dongli Zhang wrote:
>>> QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM
>>> and kvm_put_msrs() to restore them to KVM. However, there is no support for
>>> AMD PMU registers. Currently, has_pmu_version and num_pmu_gp_counters are
>>> initialized based on cpuid(0xa), which does not apply to AMD processors.
>>> For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers
>>> is determined based on the CPU version.
>>>
>>> To address this issue, we need to add support for AMD PMU registers.
>>> Without this support, the following problems can arise:
>>>
>>> 1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while
>>> running "perf top", the PMU registers are not disabled properly.
>>>
>>> 2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs()
>>> does not handle AMD PMU registers, causing some PMU events to remain
>>> enabled in KVM.
>>>
>>> 3. The KVM kvm_pmc_speculative_in_use() function consistently returns true,
>>> preventing the reclamation of these events. Consequently, the
>>> kvm_pmc->perf_event remains active.
>>>
>>> 4. After a reboot, the VM kernel may report the following error:
>>>
>>> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected,
>>> complain to your hardware vendor.
>>> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR
>>> c0010200 is 530076)
>>>
>>> 5. In the worst case, the active kvm_pmc->perf_event may inject unknown
>>> NMIs randomly into the VM kernel:
>>>
>>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>>
>>> To resolve these issues, we propose resetting AMD PMU registers during the
>>> VM reset process.
>>>
>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>> ---
>>>    target/i386/cpu.h     |   8 +++
>>>    target/i386/kvm/kvm.c | 156 +++++++++++++++++++++++++++++++++++++++++-
>>>    2 files changed, 161 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>>> index 59959b8b7a..0505eb3b08 100644
>>> --- a/target/i386/cpu.h
>>> +++ b/target/i386/cpu.h
>>> @@ -488,6 +488,14 @@ typedef enum X86Seg {
>>>    #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>>>    #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>>>    +#define MSR_K7_EVNTSEL0                 0xc0010000
>>> +#define MSR_K7_PERFCTR0                 0xc0010004
>>> +#define MSR_F15H_PERF_CTL0              0xc0010200
>>> +#define MSR_F15H_PERF_CTR0              0xc0010201
>>> +
>>> +#define AMD64_NUM_COUNTERS              4
>>> +#define AMD64_NUM_COUNTERS_CORE         6
>>> +
>>>    #define MSR_MC0_CTL                     0x400
>>>    #define MSR_MC0_STATUS                  0x401
>>>    #define MSR_MC0_ADDR                    0x402
>>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>>> index ca2b644e2c..83ec85a9b9 100644
>>> --- a/target/i386/kvm/kvm.c
>>> +++ b/target/i386/kvm/kvm.c
>>> @@ -2035,7 +2035,7 @@ full:
>>>        abort();
>>>    }
>>>    -static void kvm_init_pmu_info(CPUX86State *env)
>>> +static void kvm_init_pmu_info_intel(CPUX86State *env)
>>>    {
>>>        uint32_t eax, edx;
>>>        uint32_t unused;
>>> @@ -2072,6 +2072,80 @@ static void kvm_init_pmu_info(CPUX86State *env)
>>>        }
>>>    }
>>>    +static void kvm_init_pmu_info_amd(CPUX86State *env)
>>> +{
>>> +    int64_t family;
>>> +
>>> +    has_pmu_version = 0;
>>> +
>>> +    /*
>>> +     * To determine the CPU family, the following code is derived from
>>> +     * x86_cpuid_version_get_family().
>>> +     */
>>> +    family = (env->cpuid_version >> 8) & 0xf;
>>> +    if (family == 0xf) {
>>> +        family += (env->cpuid_version >> 20) & 0xff;
>>> +    }
>>> +
>>> +    /*
>>> +     * Performance-monitoring supported from K7 and later.
>>> +     */
>>> +    if (family < 6) {
>>> +        return;
>>> +    }
>>> +
>>> +    has_pmu_version = 1;
>>> +
>>> +    if (!(env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE)) {
>>> +        num_pmu_gp_counters = AMD64_NUM_COUNTERS;
>>> +        return;
>>> +    }
>>> +
>>> +    num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
>>> +}
>>
>> It seems that AMD implementation has one issue.
>> KVM has parameter `enable_pmu`. So vPMU can be disabled in another way, not only
>> via KVM_PMU_CAP_DISABLE. For Intel it's not a problem, because the vPMU
>> initialization uses info from KVM_GET_SUPPORTED_CPUID. The enable_pmu state is
>> reflected in KVM_GET_SUPPORTED_CPUID.  Thus no PMU MSRs in kvm_put_msrs/
>> kvm_get_msrs will be used.
>>
>> But on AMD we don't use information from KVM_GET_SUPPORTED_CPUID to set an
>> appropriate number of PMU registers. So, if vPMU is disabled by KVM parameter
>> `enable_pmu` and pmu-cap-disable=false, then has_pmu_version will be 1 after
>> kvm_init_pmu_info_amd execution. It means that in kvm_put_msrs/kvm_get_msrs 4
>> PMU counters will be processed, but the correct behavior in that situation is to
>> skip all PMU registers.
>> I think we should get info from KVM to fix that.
>>
>> I tested this series on Zen2 and found that PMU MSRs were still processed during
>> initialization even with enable_pmu=N. But it doesn't lead to any errors in QEMU
> 
> Thank you very much for the feedback and helping catch the bug!
> 
> When enable_pmu=N, the QEMU (with this patchset) cannot tell if vPMU is
> supported via KVM_CAP_PMU_CAPABILITY.
> 
> As it cannot disable the PMU, it falls to the legacy 4 counters.
> 
> It falls to 4 counters because KVM disableds PERFCORE on enable_pmu=Y, i.e.,
> 
> 5220         if (enable_pmu) {
> 5221                 /*
> 5222                  * Enumerate support for PERFCTR_CORE if and only if KVM has
> 5223                  * access to enough counters to virtualize "core" support,
> 5224                  * otherwise limit vPMU support to the legacy number of
> counters.
> 5225                  */
> 5226                 if (kvm_pmu_cap.num_counters_gp < AMD64_NUM_COUNTERS_CORE)
> 5227                         kvm_pmu_cap.num_counters_gp = min(AMD64_NUM_COUNTERS,
> 5228
> kvm_pmu_cap.num_counters_gp);
> 5229                 else
> 5230                         kvm_cpu_cap_check_and_set(X86_FEATURE_PERFCTR_CORE);
> 5231
> 5232                 if (kvm_pmu_cap.version != 2 ||
> 5233                     !kvm_cpu_cap_has(X86_FEATURE_PERFCTR_CORE))
> 5234                         kvm_cpu_cap_clear(X86_FEATURE_PERFMON_V2);
> 5235         }
> 
> 
> During the bootup and reset, the QEMU (with this patchset) erroneously resets
> MSRs for the 4 PMCs, via line 3827.
> 
> 3825 static int kvm_buf_set_msrs(X86CPU *cpu)
> 3826 {
> 3827     int ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, cpu->kvm_msr_buf);
> 3828     if (ret < 0) {
> 3829         return ret;
> 3830     }
> 3831
> 3832     if (ret < cpu->kvm_msr_buf->nmsrs) {
> 3833         struct kvm_msr_entry *e = &cpu->kvm_msr_buf->entries[ret];
> 3834         error_report("error: failed to set MSR 0x%" PRIx32 " to 0x%" PRIx64,
> 3835                      (uint32_t)e->index, (uint64_t)e->data);
> 3836     }
> 3837
> 3838     assert(ret == cpu->kvm_msr_buf->nmsrs);
> 3839     return 0;
> 3840 }
> 
> Because enable_pmu=N, the KVM doesn't support those registers. However, it
> returns 0 (not 1), because the KVM does nothing in the implicit else (i.e., line
> 4144).
> 
> 3847 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 3848 {
> ... ...
> 4138         case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
> 4139         case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
> 4140         case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
> 4141         case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
> 4142                 if (kvm_pmu_is_valid_msr(vcpu, msr))
> 4143                         return kvm_pmu_set_msr(vcpu, msr_info);
> 4144
> 4145                 if (data)
> 4146                         kvm_pr_unimpl_wrmsr(vcpu, msr, data);
> 4147                 break;
> ... ...
> 4224         default:
> 4225                 if (kvm_pmu_is_valid_msr(vcpu, msr))
> 4226                         return kvm_pmu_set_msr(vcpu, msr_info);
> 4227
> 4228                 /*
> 4229                  * Userspace is allowed to write '0' to MSRs that KVM reports
> 4230                  * as to-be-saved, even if an MSRs isn't fully supported.
> 4231                  */
> 4232                 if (msr_info->host_initiated && !data &&
> 4233                     kvm_is_msr_to_save(msr))
> 4234                         break;
> 4235
> 4236                 return KVM_MSR_RET_INVALID;
> 4237         }
> 4238         return 0;
> 4239 }
> 4240 EXPORT_SYMBOL_GPL(kvm_set_msr_common);
> 
> Fortunately, it returns 0 at line 4238. No error is detected by QEMU.
> 
> Perhaps I may need to send message with a small patch to return 1 in the
> implicit 'else' to kvm mailing list to confirm if that is expected.
> 
> However, the answer is very likely 'expected', because line 4229 to line 4230
> already explain it.
> 

Sorry for confusing you. My fault.
I tested the previous series on Intel with the old kernel and QEMU 
failed with `error: failed to set MSR 0x38d to 0x0`. So I expected the 
same error.
But as I can see, AMD PMU registers are processed differently than the 
Intel ones. Also the default MSR behavior in KVM has been changed since 
2de154f541fc

I think that the current implementation with additional parameter 
pmu-cap-disabled does what we expect. The guest will see disabled PMU in 
the same two configurations:
* pmu-cap-disabled=true and enabled_pmu=N
* pmu-cap-disabled=true and enabled_pmu=Y
But in QEMU these two configurations will have different states 
(has_pmu_version 1 and 0 respectively). I think it should be taken into 
account in the implementation without pmu-cap-disabled (which was 
suggested before) to save guest-visible state during migration.

> 
> Regarding the change in QEMU:
> 
> Since kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY) returns 0 for both (1)
> enable_pmu=Y, and (2) KVM_CAP_PMU_CAPABILITY not supported, I may need to use
> g_file_get_contents() to read from the KVM sysfs parameters (similar to KVM
> selftest kvm_is_pmu_enabled()).
> 
>>
>>> +
>>> +static bool is_same_vendor(CPUX86State *env)
>>> +{
>>> +    static uint32_t host_cpuid_vendor1;
>>> +    static uint32_t host_cpuid_vendor2;
>>> +    static uint32_t host_cpuid_vendor3;
>>> +
>>> +    host_cpuid(0x0, 0, NULL, &host_cpuid_vendor1, &host_cpuid_vendor3,
>>> +               &host_cpuid_vendor2);
>>> +
>>> +    return env->cpuid_vendor1 == host_cpuid_vendor1 &&
>>> +           env->cpuid_vendor2 == host_cpuid_vendor2 &&
>>> +           env->cpuid_vendor3 == host_cpuid_vendor3;
>>> +}
>>> +
>>> +static void kvm_init_pmu_info(CPUX86State *env)
>>> +{
>>> +    /*
>>> +     * It is not supported to virtualize AMD PMU registers on Intel
>>> +     * processors, nor to virtualize Intel PMU registers on AMD processors.
>>> +     */
>>> +    if (!is_same_vendor(env)) {
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
>>> +     * disable the AMD pmu virtualization.
>>> +     *
>>> +     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
>>> +     * indicates the KVM has already disabled the pmu virtualization.
>>> +     */
>>> +    if (kvm_state->pmu_cap_disabled) {
>>> +        return;
>>> +    }
>>> +
>>
>> It seems that after these changes the issue concerning using
>> pmu-cap-disable=true with +pmu on Intel platform (that Zhao Liu has mentioned
>> before) is fixed
> 
> Can I assume you were going to paste some code below?
> 
> Regardless, I am going to following Zhao's suggestion to revert back to my
> previous solution.
> 
> Thank you very much for the feedback!
> 
> Dongli Zhang
> 
>>
>>> +    if (IS_INTEL_CPU(env)) {
>>> +        kvm_init_pmu_info_intel(env);
>>> +    } else if (IS_AMD_CPU(env)) {
>>> +        kvm_init_pmu_info_amd(env);
>>> +    }
>>> +}
>>> +
>>>    int kvm_arch_init_vcpu(CPUState *cs)
>>>    {
>>>        struct {
>>> @@ -4027,7 +4101,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>>                kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env-
>>>> poll_control_msr);
>>>            }
>>>    -        if (has_pmu_version > 0) {
>>> +        if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>>>                if (has_pmu_version > 1) {
>>>                    /* Stop the counter.  */
>>>                    kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>> @@ -4058,6 +4132,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>>                                      env->msr_global_ctrl);
>>>                }
>>>            }
>>> +
>>> +        if (IS_AMD_CPU(env) && has_pmu_version > 0) {
>>> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
>>> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
>>> +            /*
>>> +             * The address of the next selector or counter register is
>>> +             * obtained by incrementing the address of the current selector
>>> +             * or counter register by one.
>>> +             */
>>> +            uint32_t step = 1;
>>> +
>>> +            /*
>>> +             * When PERFCORE is enabled, AMD PMU uses a separate set of
>>> +             * addresses for the selector and counter registers.
>>> +             * Additionally, the address of the next selector or counter
>>> +             * register is determined by incrementing the address of the
>>> +             * current register by two.
>>> +             */
>>> +            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
>>> +                sel_base = MSR_F15H_PERF_CTL0;
>>> +                ctr_base = MSR_F15H_PERF_CTR0;
>>> +                step = 2;
>>> +            }
>>> +
>>> +            for (i = 0; i < num_pmu_gp_counters; i++) {
>>> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
>>> +                                  env->msr_gp_counters[i]);
>>> +                kvm_msr_entry_add(cpu, sel_base + i * step,
>>> +                                  env->msr_gp_evtsel[i]);
>>> +            }
>>> +        }
>>> +
>>>            /*
>>>             * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>>>             * only sync them to KVM on the first cpu
>>> @@ -4503,7 +4609,8 @@ static int kvm_get_msrs(X86CPU *cpu)
>>>        if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>>>            kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>>>        }
>>> -    if (has_pmu_version > 0) {
>>> +
>>> +    if (IS_INTEL_CPU(env) && has_pmu_version > 0) {
>>>            if (has_pmu_version > 1) {
>>>                kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>>                kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>> @@ -4519,6 +4626,35 @@ static int kvm_get_msrs(X86CPU *cpu)
>>>            }
>>>        }
>>>    +    if (IS_AMD_CPU(env) && has_pmu_version > 0) {
>>> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
>>> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
>>> +        /*
>>> +         * The address of the next selector or counter register is
>>> +         * obtained by incrementing the address of the current selector
>>> +         * or counter register by one.
>>> +         */
>>> +        uint32_t step = 1;
>>> +
>>> +        /*
>>> +         * When PERFCORE is enabled, AMD PMU uses a separate set of
>>> +         * addresses for the selector and counter registers.
>>> +         * Additionally, the address of the next selector or counter
>>> +         * register is determined by incrementing the address of the
>>> +         * current register by two.
>>> +         */
>>> +        if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
>>> +            sel_base = MSR_F15H_PERF_CTL0;
>>> +            ctr_base = MSR_F15H_PERF_CTR0;
>>> +            step = 2;
>>> +        }
>>> +
>>> +        for (i = 0; i < num_pmu_gp_counters; i++) {
>>> +            kvm_msr_entry_add(cpu, ctr_base + i * step, 0);
>>> +            kvm_msr_entry_add(cpu, sel_base + i * step, 0);
>>> +        }
>>> +    }
>>> +
>>>        if (env->mcg_cap) {
>>>            kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>>>            kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
>>> @@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>>>            case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>>>                env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>>>                break;
>>> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
>>> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
>>> +            break;
>>> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
>>> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
>>> +            break;
>>> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
>>> +            index = index - MSR_F15H_PERF_CTL0;
>>> +            if (index & 0x1) {
>>> +                env->msr_gp_counters[index] = msrs[i].data;
>>> +            } else {
>>> +                env->msr_gp_evtsel[index] = msrs[i].data;
>>> +            }
>>> +            break;
>>>            case HV_X64_MSR_HYPERCALL:
>>>                env->msr_hv_hypercall = msrs[i].data;
>>>                break;
>>
> 

-- 
Best regards,
Maksim Davydov


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset
  2024-11-08 13:09   ` Sandipan Das
@ 2024-11-08 16:55     ` dongli.zhang
  0 siblings, 0 replies; 32+ messages in thread
From: dongli.zhang @ 2024-11-08 16:55 UTC (permalink / raw)
  To: Sandipan Das
  Cc: qemu-devel, kvm, pbonzini, mtosatti, babu.moger, zhao1.liu,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

Hi Sandipan,

On 11/8/24 5:09 AM, Sandipan Das wrote:
> On 11/4/2024 3:10 PM, Dongli Zhang wrote:

[snip]

>> +             * separate set of addresses for the selector and counter
>> +             * registers. Additionally, the address of the next selector or
>> +             * counter register is determined by incrementing the address
>> +             * of the current register by two.
>>               */
>> -            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) {
>> +            if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE ||
>> +                has_pmu_version == 2) {
> 
> Future PMU versions are expected to be backwards compatible. So it may be
> better to look for has_pmu_version > 1.
> 

Sure. I will change that in v2. Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset
  2024-11-08 14:07       ` Maksim Davydov
@ 2024-11-08 18:04         ` dongli.zhang
  0 siblings, 0 replies; 32+ messages in thread
From: dongli.zhang @ 2024-11-08 18:04 UTC (permalink / raw)
  To: Maksim Davydov
  Cc: pbonzini, mtosatti, sandipan.das, babu.moger, zhao1.liu, likexu,
	like.xu.linux, zhenyuw, groug, lyan, khorenko, alexander.ivanov,
	den, joe.jin, qemu-devel, kvm

Hi Maksim,

On 11/8/24 6:07 AM, Maksim Davydov wrote:
> 
> 

[snip]

>>>> +
>>>> +    num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE;
>>>> +}
>>>
>>> It seems that AMD implementation has one issue.
>>> KVM has parameter `enable_pmu`. So vPMU can be disabled in another way, not only
>>> via KVM_PMU_CAP_DISABLE. For Intel it's not a problem, because the vPMU
>>> initialization uses info from KVM_GET_SUPPORTED_CPUID. The enable_pmu state is
>>> reflected in KVM_GET_SUPPORTED_CPUID.  Thus no PMU MSRs in kvm_put_msrs/
>>> kvm_get_msrs will be used.
>>>
>>> But on AMD we don't use information from KVM_GET_SUPPORTED_CPUID to set an
>>> appropriate number of PMU registers. So, if vPMU is disabled by KVM parameter
>>> `enable_pmu` and pmu-cap-disable=false, then has_pmu_version will be 1 after
>>> kvm_init_pmu_info_amd execution. It means that in kvm_put_msrs/kvm_get_msrs 4
>>> PMU counters will be processed, but the correct behavior in that situation is to
>>> skip all PMU registers.
>>> I think we should get info from KVM to fix that.
>>>
>>> I tested this series on Zen2 and found that PMU MSRs were still processed during
>>> initialization even with enable_pmu=N. But it doesn't lead to any errors in QEMU
>>
>> Thank you very much for the feedback and helping catch the bug!
>>
>> When enable_pmu=N, the QEMU (with this patchset) cannot tell if vPMU is
>> supported via KVM_CAP_PMU_CAPABILITY.
>>
>> As it cannot disable the PMU, it falls to the legacy 4 counters.
>>
>> It falls to 4 counters because KVM disableds PERFCORE on enable_pmu=Y, i.e.,
>>
>> 5220         if (enable_pmu) {
>> 5221                 /*
>> 5222                  * Enumerate support for PERFCTR_CORE if and only if KVM has
>> 5223                  * access to enough counters to virtualize "core" support,
>> 5224                  * otherwise limit vPMU support to the legacy number of
>> counters.
>> 5225                  */
>> 5226                 if (kvm_pmu_cap.num_counters_gp < AMD64_NUM_COUNTERS_CORE)
>> 5227                         kvm_pmu_cap.num_counters_gp =
>> min(AMD64_NUM_COUNTERS,
>> 5228
>> kvm_pmu_cap.num_counters_gp);
>> 5229                 else
>> 5230                         kvm_cpu_cap_check_and_set(X86_FEATURE_PERFCTR_CORE);
>> 5231
>> 5232                 if (kvm_pmu_cap.version != 2 ||
>> 5233                     !kvm_cpu_cap_has(X86_FEATURE_PERFCTR_CORE))
>> 5234                         kvm_cpu_cap_clear(X86_FEATURE_PERFMON_V2);
>> 5235         }
>>
>>
>> During the bootup and reset, the QEMU (with this patchset) erroneously resets
>> MSRs for the 4 PMCs, via line 3827.
>>
>> 3825 static int kvm_buf_set_msrs(X86CPU *cpu)
>> 3826 {
>> 3827     int ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, cpu->kvm_msr_buf);
>> 3828     if (ret < 0) {
>> 3829         return ret;
>> 3830     }
>> 3831
>> 3832     if (ret < cpu->kvm_msr_buf->nmsrs) {
>> 3833         struct kvm_msr_entry *e = &cpu->kvm_msr_buf->entries[ret];
>> 3834         error_report("error: failed to set MSR 0x%" PRIx32 " to 0x%" PRIx64,
>> 3835                      (uint32_t)e->index, (uint64_t)e->data);
>> 3836     }
>> 3837
>> 3838     assert(ret == cpu->kvm_msr_buf->nmsrs);
>> 3839     return 0;
>> 3840 }
>>
>> Because enable_pmu=N, the KVM doesn't support those registers. However, it
>> returns 0 (not 1), because the KVM does nothing in the implicit else (i.e., line
>> 4144).
>>
>> 3847 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 3848 {
>> ... ...
>> 4138         case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
>> 4139         case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
>> 4140         case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
>> 4141         case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
>> 4142                 if (kvm_pmu_is_valid_msr(vcpu, msr))
>> 4143                         return kvm_pmu_set_msr(vcpu, msr_info);
>> 4144
>> 4145                 if (data)
>> 4146                         kvm_pr_unimpl_wrmsr(vcpu, msr, data);
>> 4147                 break;
>> ... ...
>> 4224         default:
>> 4225                 if (kvm_pmu_is_valid_msr(vcpu, msr))
>> 4226                         return kvm_pmu_set_msr(vcpu, msr_info);
>> 4227
>> 4228                 /*
>> 4229                  * Userspace is allowed to write '0' to MSRs that KVM
>> reports
>> 4230                  * as to-be-saved, even if an MSRs isn't fully supported.
>> 4231                  */
>> 4232                 if (msr_info->host_initiated && !data &&
>> 4233                     kvm_is_msr_to_save(msr))
>> 4234                         break;
>> 4235
>> 4236                 return KVM_MSR_RET_INVALID;
>> 4237         }
>> 4238         return 0;
>> 4239 }
>> 4240 EXPORT_SYMBOL_GPL(kvm_set_msr_common);
>>
>> Fortunately, it returns 0 at line 4238. No error is detected by QEMU.
>>
>> Perhaps I may need to send message with a small patch to return 1 in the
>> implicit 'else' to kvm mailing list to confirm if that is expected.
>>
>> However, the answer is very likely 'expected', because line 4229 to line 4230
>> already explain it.
>>
> 
> Sorry for confusing you. My fault.
> I tested the previous series on Intel with the old kernel and QEMU failed with
> `error: failed to set MSR 0x38d to 0x0`. So I expected the same error.
> But as I can see, AMD PMU registers are processed differently than the Intel
> ones. Also the default MSR behavior in KVM has been changed since 2de154f541fc

I think the AMD PMU registers are treated equally with Intel ones as by the
commit 2de154f541fc. Both Intel and AMD PMU registers are in msrs_to_save_pmu[].

The objective was "to avoid spurious unsupported accesses".

> 
> I think that the current implementation with additional parameter pmu-cap-
> disabled does what we expect. The guest will see disabled PMU in the same two
> configurations:
> * pmu-cap-disabled=true and enabled_pmu=N
> * pmu-cap-disabled=true and enabled_pmu=Y
> But in QEMU these two configurations will have different states (has_pmu_version
> 1 and 0 respectively). I think it should be taken into account in the

Although the unsupported MSR write doesn't trigger any issue (thanks to
msrs_to_save_pmu[]), I agree this is the bug that I will address in v2.

Thanks to the reminder, indeed I have noticed another issue to be addressed in
v2: something unexpected may happen if we migrate from old KVM to new KVM
(assuming same QEMU versions).

Suppose one user never notice "-pmu" doesn't work on old AMD KVM, but still add
"-pmu" to QEMU command line.

old AMD KVM: "-pmu" doesn't take effect, due to the lack of KVM_CAP_PMU_CAPABILITY.

new AMD KVM: "-pmu" takes effect.

After the migration, the vPMU won't work any longer from guest's perspective.

> implementation without pmu-cap-disabled (which was suggested before) to save
> guest-visible state during migration.

Yes, I am going to revert back to my previous solution with "-pmu".

Thanks everyone's suggestion on "-pmu" vs. "pmu-cap-disabled". To finalize the
decision helps move forward.


Would you mind clarify "without pmu-cap-disabled (which was suggested before) to
save guest-visible state during migration."?

Would you mean the compatibility issue between old QEMU version (without
"pmu-cap-disabled") and new QEMU version (with "pmu-cap-disabled")?

Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] target/i386/kvm: init PMU information only once
  2024-11-04  9:40 ` [PATCH 3/7] target/i386/kvm: init PMU information only once Dongli Zhang
@ 2024-11-10 15:29   ` Zhao Liu
  2024-11-13  1:50     ` dongli.zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Zhao Liu @ 2024-11-10 15:29 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

Hi Dongli,

>  int kvm_arch_init_vcpu(CPUState *cs)
>  {
>      struct {
> @@ -2237,6 +2247,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>      cpuid_i = kvm_x86_build_cpuid(env, cpuid_data.entries, cpuid_i);
>      cpuid_data.cpuid.nent = cpuid_i;
>  
> +    /*
> +     * Initialize PMU information only once for the first vCPU.
> +     */
> +    if (cs == first_cpu) {
> +        kvm_init_pmu_info(env);
> +    }
> +

Thank you for the optimization. However, I think it’s not necessary
because:

* This is not a hot path and not a performance bottleneck.
* Many CPUID leaves are consistent across CPUs, and 0xA is just one of them.
* And encoding them all in kvm_x86_build_cpuid() is a common pattern.
  Separating out 0xa disrupts code readability and fragments the CPUID encoding.

Therefore, code maintainability and correctness might be more important here,
than performance concern.

>      if (((env->cpuid_version >> 8)&0xF) >= 6
>          && (env->features[FEAT_1_EDX] & (CPUID_MCE | CPUID_MCA)) ==
>             (CPUID_MCE | CPUID_MCA)) {
> -- 
> 2.39.3
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] target/i386/kvm: init PMU information only once
  2024-11-10 15:29   ` Zhao Liu
@ 2024-11-13  1:50     ` dongli.zhang
  2024-11-13 16:48       ` Zhao Liu
  0 siblings, 1 reply; 32+ messages in thread
From: dongli.zhang @ 2024-11-13  1:50 UTC (permalink / raw)
  To: Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

Hi Zhao,

On 11/10/24 7:29 AM, Zhao Liu wrote:
> Hi Dongli,
> 
>>  int kvm_arch_init_vcpu(CPUState *cs)
>>  {
>>      struct {
>> @@ -2237,6 +2247,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>>      cpuid_i = kvm_x86_build_cpuid(env, cpuid_data.entries, cpuid_i);
>>      cpuid_data.cpuid.nent = cpuid_i;
>>  
>> +    /*
>> +     * Initialize PMU information only once for the first vCPU.
>> +     */
>> +    if (cs == first_cpu) {
>> +        kvm_init_pmu_info(env);
>> +    }
>> +
> 
> Thank you for the optimization. However, I think it’s not necessary
> because:
> 
> * This is not a hot path and not a performance bottleneck.
> * Many CPUID leaves are consistent across CPUs, and 0xA is just one of them.
> * And encoding them all in kvm_x86_build_cpuid() is a common pattern.
>   Separating out 0xa disrupts code readability and fragments the CPUID encoding.
> 
> Therefore, code maintainability and correctness might be more important here,
> than performance concern.

I am going to remove this patch in v2.

Just a reminder, we may have more code in this function by other patches,
including the initialization of both Intel and AMD PMU infortmation (PerfMonV2).

Thank you very much!

Dongli Zhang

> 
>>      if (((env->cpuid_version >> 8)&0xF) >= 6
>>          && (env->features[FEAT_1_EDX] & (CPUID_MCE | CPUID_MCA)) ==
>>             (CPUID_MCE | CPUID_MCA)) {
>> -- 
>> 2.39.3
>>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] target/i386/kvm: init PMU information only once
  2024-11-13  1:50     ` dongli.zhang
@ 2024-11-13 16:48       ` Zhao Liu
  0 siblings, 0 replies; 32+ messages in thread
From: Zhao Liu @ 2024-11-13 16:48 UTC (permalink / raw)
  To: dongli.zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max

On Tue, Nov 12, 2024 at 05:50:26PM -0800, dongli.zhang@oracle.com wrote:
> Date: Tue, 12 Nov 2024 17:50:26 -0800
> From: dongli.zhang@oracle.com
> Subject: Re: [PATCH 3/7] target/i386/kvm: init PMU information only once
> 
> Hi Zhao,
> 
> On 11/10/24 7:29 AM, Zhao Liu wrote:
> > Hi Dongli,
> > 
> >>  int kvm_arch_init_vcpu(CPUState *cs)
> >>  {
> >>      struct {
> >> @@ -2237,6 +2247,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >>      cpuid_i = kvm_x86_build_cpuid(env, cpuid_data.entries, cpuid_i);
> >>      cpuid_data.cpuid.nent = cpuid_i;
> >>  
> >> +    /*
> >> +     * Initialize PMU information only once for the first vCPU.
> >> +     */
> >> +    if (cs == first_cpu) {
> >> +        kvm_init_pmu_info(env);
> >> +    }
> >> +
> > 
> > Thank you for the optimization. However, I think it’s not necessary
> > because:
> > 
> > * This is not a hot path and not a performance bottleneck.
> > * Many CPUID leaves are consistent across CPUs, and 0xA is just one of them.
> > * And encoding them all in kvm_x86_build_cpuid() is a common pattern.
> >   Separating out 0xa disrupts code readability and fragments the CPUID encoding.
> > 
> > Therefore, code maintainability and correctness might be more important here,
> > than performance concern.
> 
> I am going to remove this patch in v2.
> 
> Just a reminder, we may have more code in this function by other patches,
> including the initialization of both Intel and AMD PMU infortmation (PerfMonV2).

Yes, I mean we can just remove "first_cpu" check and move this function
into kvm_x86_build_cpuid() again. Your function wrapping is fine for me.

> Dongli Zhang
> 
> > 
> >>      if (((env->cpuid_version >> 8)&0xF) >= 6
> >>          && (env->features[FEAT_1_EDX] & (CPUID_MCE | CPUID_MCA)) ==
> >>             (CPUID_MCE | CPUID_MCA)) {
> >> -- 
> >> 2.39.3
> >>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-07 23:44     ` dongli.zhang
  2024-11-08  2:32       ` Zhao Liu
  2024-11-08 12:52       ` Sandipan Das
@ 2024-11-13 17:15       ` Zhao Liu
  2024-11-14  0:13         ` dongli.zhang
  2024-11-21 10:06       ` Mi, Dapeng
  3 siblings, 1 reply; 32+ messages in thread
From: Zhao Liu @ 2024-11-13 17:15 UTC (permalink / raw)
  To: dongli.zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, dapeng1.mi,
	zide.chen

> > Further, considering that this is currently the only case that needs to
> > to set the VM level's capability in the CPU context, there is no need to
> > introduce a new kvm interface (in your previous patch), which can instead
> > be set in kvm_cpu_realizefn(), like:
> >

Now your case is not the only user of kvm_arch_pre_create_vcpu(), and
TDX also needs this [*]. So, this is the support for bringing back your
previous solution (preferably with comments, as I suggested earlier,
explaining why it's necessary to handle VM-level cap in the CPU
context). :-)

[*]: https://lore.kernel.org/qemu-devel/20241105062408.3533704-8-xiaoyao.li@intel.com/



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-13 17:15       ` Zhao Liu
@ 2024-11-14  0:13         ` dongli.zhang
  0 siblings, 0 replies; 32+ messages in thread
From: dongli.zhang @ 2024-11-14  0:13 UTC (permalink / raw)
  To: Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, dapeng1.mi,
	zide.chen



On 11/13/24 9:15 AM, Zhao Liu wrote:
>>> Further, considering that this is currently the only case that needs to
>>> to set the VM level's capability in the CPU context, there is no need to
>>> introduce a new kvm interface (in your previous patch), which can instead
>>> be set in kvm_cpu_realizefn(), like:
>>>
> 
> Now your case is not the only user of kvm_arch_pre_create_vcpu(), and
> TDX also needs this [*]. So, this is the support for bringing back your
> previous solution (preferably with comments, as I suggested earlier,
> explaining why it's necessary to handle VM-level cap in the CPU
> context). :-)
> 
> [*]: https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20241105062408.3533704-8-xiaoyao.li@intel.com/__;!!ACWV5N9M2RV99hQ!PGJQ9Kv-1mo_4YX1fb4N9YvZZqHInebtCNcSniRi3qJNPIfG5zZnDp1b1gVmO-DdpYxMvO1hlH9owAHV5UMT$ 
> 

Thank you very much for the reminder!

Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-07 23:44     ` dongli.zhang
                         ` (2 preceding siblings ...)
  2024-11-13 17:15       ` Zhao Liu
@ 2024-11-21 10:06       ` Mi, Dapeng
  2025-02-07  9:52         ` Mi, Dapeng
  3 siblings, 1 reply; 32+ messages in thread
From: Mi, Dapeng @ 2024-11-21 10:06 UTC (permalink / raw)
  To: dongli.zhang, Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, zide.chen


On 11/8/2024 7:44 AM, dongli.zhang@oracle.com wrote:
> Hi Zhao,
>
>
> On 11/6/24 11:52 PM, Zhao Liu wrote:
>> (+Dapang & Zide)
>>
>> Hi Dongli,
>>
>> On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
>>> Date: Mon,  4 Nov 2024 01:40:17 -0800
>>> From: Dongli Zhang <dongli.zhang@oracle.com>
>>> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>>>  KVM_PMU_CAP_DISABLE
>>> X-Mailer: git-send-email 2.43.5
>>>
>>> The AMD PMU virtualization is not disabled when configuring
>>> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
>>> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
>>> virtualization in such an environment.
>>>
>>> As a result, VM logs typically show:
>>>
>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>
>>> whereas the expected logs should be:
>>>
>>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>
>>> This discrepancy occurs because AMD PMU does not use CPUID to determine
>>> whether PMU virtualization is supported.
>> Intel platform doesn't have this issue since Linux kernel fails to check
>> the CPU family & model when "-cpu *,-pmu" option clears PMU version.
>>
>> The difference between Intel and AMD platforms, however, is that it seems
>> Intel hardly ever reaches the “...due virtualization” message, but
>> instead reports an error because it recognizes a mismatched family/model.
>>
>> This may be a drawback of the PMU driver's print message, but the result
>> is the same, it prevents the PMU driver from enabling.
>>
>> So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
>> behavior on Intel platform because current "pmu" property works as
>> expected.
> Sure. I will mention this in v2.
>
>>> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
>>> acceleration. This property sets KVM_PMU_CAP_DISABLE if
>>> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
>>> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
>>> x86 systems.
>>>
>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>> ---
>>> Another previous solution to re-use '-cpu host,-pmu':
>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Nm8Db-mwBoMIwKkRqzC9kgNi5uZ7SCIf43zUBn92Ar_NEbLXq-ZkrDDvpvDQ4cnS2i4VyKAp6CRVE12bRkMF$ 
>> IMO, I prefer the previous version. This VM-level KVM property is
>> difficult to integrate with the existing CPU properties. Pls refer later
>> comments for reasons.
>>
>>>  accel/kvm/kvm-all.c        |  1 +
>>>  include/sysemu/kvm_int.h   |  1 +
>>>  qemu-options.hx            |  9 ++++++-
>>>  target/i386/cpu.c          |  2 +-
>>>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>>>  target/i386/kvm/kvm_i386.h |  2 ++
>>>  6 files changed, 65 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 801cff16a5..8b5ba45cf7 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>>>      s->xen_evtchn_max_pirq = 256;
>>>      s->device = NULL;
>>>      s->msr_energy.enable = false;
>>> +    s->pmu_cap_disabled = false;
>>>  }
>> The CPU property "pmu" also defaults to "false"...but:
>>
>>  * max CPU would override this and try to enable PMU by default in
>>    max_x86_cpu_initfn().
>>
>>  * Other named CPU models keep the default setting to avoid affecting
>>    the migration.
>>
>> The pmu_cap_disabled and “pmu” property look unbound and unassociated,
>> so this can cause the conflict when they are not synchronized. For
>> example,
>>
>> -cpu host -accel kvm,pmu-cap-disabled=on
>>
>> The above options will fail to launch a VM (on Intel platform).
>>
>> Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
>> other and be consistent. But it's not easy because:
>>  - There is no proper way to have pmu_cap_disabled set different default
>>    values (e.g., "false" for max CPU and "true" for named CPU models)
>>    based on different CPU models.
>>  - And, no proper place to check the consistency of pmu_cap_disabled and
>>    enable_pmu.
>>
>> Therefore, I prefer your previous approach, to reuse current CPU "pmu"
>> property.
> Thank you very much for the suggestion and reasons.
>
> I am going to follow your suggestion to switch back to the previous solution in v2.

+1.

 I also prefer to leverage current exist "+/-pmu" option instead of adding
a new option. More options, more complexity. When they are not
inconsistent, which has higher priority? all these are issues.

Although KVM_CAP_PMU_CAPABILITY is a VM-level PMU capability, but all CPUs
in a same VM should always share same PMU configuration (Don't consider
hybrid platforms which have many issues need to be handled specifically).


>
>> Further, considering that this is currently the only case that needs to
>> to set the VM level's capability in the CPU context, there is no need to
>> introduce a new kvm interface (in your previous patch), which can instead
>> be set in kvm_cpu_realizefn(), like:
>>
>>
>> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
>> index 99d1941cf51c..05e9c9a1a0cf 100644
>> --- a/target/i386/kvm/kvm-cpu.c
>> +++ b/target/i386/kvm/kvm-cpu.c
>> @@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>  {
>>      X86CPU *cpu = X86_CPU(cs);
>>      CPUX86State *env = &cpu->env;
>> +    KVMState *s = kvm_state;
>> +    static bool first = true;
>>      bool ret;
>>
>>      /*
>> @@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>       *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
>>       *   cpu_common_realizefn() (via xcc->parent_realize)
>>       */
>> +
>> +    if (first) {
>> +        first = false;
>> +
>> +        /*
>> +         * Since Linux v5.18, KVM provides a VM-level capability to easily
>> +         * disable PMUs; however, QEMU has been providing PMU property per
>> +         * CPU since v1.6. In order to accommodate both, have to configure
>> +         * the VM-level capability here.
>> +         */
>> +        if (!cpu->enable_pmu &&
>> +            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
>> +            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
>> +                                      KVM_PMU_CAP_DISABLE);
>> +
>> +            if (r < 0) {
>> +                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
>> +                           strerror(-r));
>> +                return false;
>> +            }
>> +        }

It seems KVM_CAP_PMU_CAPABILITY is called to only disable PMU here. From
point view of logic completeness,  KVM_CAP_PMU_CAPABILITY should be called
to enabled PMU as well when user wants to enable PMU.

I know currently we only need to disable PMU, but we may need to enable PMU
via KVM_CAP_PMU_CAPABILITY soon.

We are working on the new KVM mediated vPMU framework, Sean suggest to
leverage KVM_CAP_PMU_CAPABILITY to enable mediated vPMU dynamically
(https://lore.kernel.org/all/Zz4uhmuPcZl9vJVr@google.com/). So It would be
better if the enable logic can be added here as well.

Thanks.


>> +    }
>> +
>>      if (cpu->max_features) {
>>          if (enable_cpu_pm) {
>>              if (kvm_has_waitpkg()) {
>> ---
> Sure. I will limit the change within only x86 + KVM.
>
>> In addition, if PMU is disabled, why not mask the perf related bits in
>> 8000_0001_ECX? :)
>>
> My fault. I have masked only 0x80000022, and I forgot 0x80000001 for AMD.
>
> Thank you very much for the reminder.
>
>
> I will wait for a day or maybe the weekend. I am going to switch to the previous
> solution in v2 if there isn't any further objection with a more valid reason.
>
> Thank you very much for the feedback!
>
> Dongli Zhang
>
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2024-11-21 10:06       ` Mi, Dapeng
@ 2025-02-07  9:52         ` Mi, Dapeng
  2025-02-09 20:12           ` dongli.zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Mi, Dapeng @ 2025-02-07  9:52 UTC (permalink / raw)
  To: dongli.zhang, Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, zide.chen


On 11/21/2024 6:06 PM, Mi, Dapeng wrote:
> On 11/8/2024 7:44 AM, dongli.zhang@oracle.com wrote:
>> Hi Zhao,
>>
>>
>> On 11/6/24 11:52 PM, Zhao Liu wrote:
>>> (+Dapang & Zide)
>>>
>>> Hi Dongli,
>>>
>>> On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
>>>> Date: Mon,  4 Nov 2024 01:40:17 -0800
>>>> From: Dongli Zhang <dongli.zhang@oracle.com>
>>>> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>>>>  KVM_PMU_CAP_DISABLE
>>>> X-Mailer: git-send-email 2.43.5
>>>>
>>>> The AMD PMU virtualization is not disabled when configuring
>>>> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
>>>> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
>>>> virtualization in such an environment.
>>>>
>>>> As a result, VM logs typically show:
>>>>
>>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>>
>>>> whereas the expected logs should be:
>>>>
>>>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>>
>>>> This discrepancy occurs because AMD PMU does not use CPUID to determine
>>>> whether PMU virtualization is supported.
>>> Intel platform doesn't have this issue since Linux kernel fails to check
>>> the CPU family & model when "-cpu *,-pmu" option clears PMU version.
>>>
>>> The difference between Intel and AMD platforms, however, is that it seems
>>> Intel hardly ever reaches the “...due virtualization” message, but
>>> instead reports an error because it recognizes a mismatched family/model.
>>>
>>> This may be a drawback of the PMU driver's print message, but the result
>>> is the same, it prevents the PMU driver from enabling.
>>>
>>> So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
>>> behavior on Intel platform because current "pmu" property works as
>>> expected.
>> Sure. I will mention this in v2.
>>
>>>> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
>>>> acceleration. This property sets KVM_PMU_CAP_DISABLE if
>>>> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
>>>> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
>>>> x86 systems.
>>>>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>> ---
>>>> Another previous solution to re-use '-cpu host,-pmu':
>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Nm8Db-mwBoMIwKkRqzC9kgNi5uZ7SCIf43zUBn92Ar_NEbLXq-ZkrDDvpvDQ4cnS2i4VyKAp6CRVE12bRkMF$ 
>>> IMO, I prefer the previous version. This VM-level KVM property is
>>> difficult to integrate with the existing CPU properties. Pls refer later
>>> comments for reasons.
>>>
>>>>  accel/kvm/kvm-all.c        |  1 +
>>>>  include/sysemu/kvm_int.h   |  1 +
>>>>  qemu-options.hx            |  9 ++++++-
>>>>  target/i386/cpu.c          |  2 +-
>>>>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>>>>  target/i386/kvm/kvm_i386.h |  2 ++
>>>>  6 files changed, 65 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>> index 801cff16a5..8b5ba45cf7 100644
>>>> --- a/accel/kvm/kvm-all.c
>>>> +++ b/accel/kvm/kvm-all.c
>>>> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>>>>      s->xen_evtchn_max_pirq = 256;
>>>>      s->device = NULL;
>>>>      s->msr_energy.enable = false;
>>>> +    s->pmu_cap_disabled = false;
>>>>  }
>>> The CPU property "pmu" also defaults to "false"...but:
>>>
>>>  * max CPU would override this and try to enable PMU by default in
>>>    max_x86_cpu_initfn().
>>>
>>>  * Other named CPU models keep the default setting to avoid affecting
>>>    the migration.
>>>
>>> The pmu_cap_disabled and “pmu” property look unbound and unassociated,
>>> so this can cause the conflict when they are not synchronized. For
>>> example,
>>>
>>> -cpu host -accel kvm,pmu-cap-disabled=on
>>>
>>> The above options will fail to launch a VM (on Intel platform).
>>>
>>> Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
>>> other and be consistent. But it's not easy because:
>>>  - There is no proper way to have pmu_cap_disabled set different default
>>>    values (e.g., "false" for max CPU and "true" for named CPU models)
>>>    based on different CPU models.
>>>  - And, no proper place to check the consistency of pmu_cap_disabled and
>>>    enable_pmu.
>>>
>>> Therefore, I prefer your previous approach, to reuse current CPU "pmu"
>>> property.
>> Thank you very much for the suggestion and reasons.
>>
>> I am going to follow your suggestion to switch back to the previous solution in v2.
> +1.
>
>  I also prefer to leverage current exist "+/-pmu" option instead of adding
> a new option. More options, more complexity. When they are not
> inconsistent, which has higher priority? all these are issues.
>
> Although KVM_CAP_PMU_CAPABILITY is a VM-level PMU capability, but all CPUs
> in a same VM should always share same PMU configuration (Don't consider
> hybrid platforms which have many issues need to be handled specifically).
>
>
>>> Further, considering that this is currently the only case that needs to
>>> to set the VM level's capability in the CPU context, there is no need to
>>> introduce a new kvm interface (in your previous patch), which can instead
>>> be set in kvm_cpu_realizefn(), like:
>>>
>>>
>>> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
>>> index 99d1941cf51c..05e9c9a1a0cf 100644
>>> --- a/target/i386/kvm/kvm-cpu.c
>>> +++ b/target/i386/kvm/kvm-cpu.c
>>> @@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>>  {
>>>      X86CPU *cpu = X86_CPU(cs);
>>>      CPUX86State *env = &cpu->env;
>>> +    KVMState *s = kvm_state;
>>> +    static bool first = true;
>>>      bool ret;
>>>
>>>      /*
>>> @@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>>       *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
>>>       *   cpu_common_realizefn() (via xcc->parent_realize)
>>>       */
>>> +
>>> +    if (first) {
>>> +        first = false;
>>> +
>>> +        /*
>>> +         * Since Linux v5.18, KVM provides a VM-level capability to easily
>>> +         * disable PMUs; however, QEMU has been providing PMU property per
>>> +         * CPU since v1.6. In order to accommodate both, have to configure
>>> +         * the VM-level capability here.
>>> +         */
>>> +        if (!cpu->enable_pmu &&
>>> +            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
>>> +            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
>>> +                                      KVM_PMU_CAP_DISABLE);
>>> +
>>> +            if (r < 0) {
>>> +                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
>>> +                           strerror(-r));
>>> +                return false;
>>> +            }
>>> +        }
> It seems KVM_CAP_PMU_CAPABILITY is called to only disable PMU here. From
> point view of logic completeness,  KVM_CAP_PMU_CAPABILITY should be called
> to enabled PMU as well when user wants to enable PMU.
>
> I know currently we only need to disable PMU, but we may need to enable PMU
> via KVM_CAP_PMU_CAPABILITY soon.
>
> We are working on the new KVM mediated vPMU framework, Sean suggest to
> leverage KVM_CAP_PMU_CAPABILITY to enable mediated vPMU dynamically
> (https://lore.kernel.org/all/Zz4uhmuPcZl9vJVr@google.com/). So It would be
> better if the enable logic can be added here as well.
>
> Thanks.

Hi Dongli,

May I know if you have plan to continue to update this patch recently? As
previous comment said, our KVM mediated vPMU solution needs qemu to
explicitly call KVM_CAP_PMU_CAPABILITY to enable mediated vPMU as well. If
you have no plan to update this patch recently, would you mind me to pick
up this patch
(https://lore.kernel.org/all/20221119122901.2469-2-dongli.zhang@oracle.com/)
and post with other our mediated vPMU related patches to enable mediated vPMU?

Thanks!

Dapeng Mi

>
>
>>> +    }
>>> +
>>>      if (cpu->max_features) {
>>>          if (enable_cpu_pm) {
>>>              if (kvm_has_waitpkg()) {
>>> ---
>> Sure. I will limit the change within only x86 + KVM.
>>
>>> In addition, if PMU is disabled, why not mask the perf related bits in
>>> 8000_0001_ECX? :)
>>>
>> My fault. I have masked only 0x80000022, and I forgot 0x80000001 for AMD.
>>
>> Thank you very much for the reminder.
>>
>>
>> I will wait for a day or maybe the weekend. I am going to switch to the previous
>> solution in v2 if there isn't any further objection with a more valid reason.
>>
>> Thank you very much for the feedback!
>>
>> Dongli Zhang
>>
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2025-02-07  9:52         ` Mi, Dapeng
@ 2025-02-09 20:12           ` dongli.zhang
  2025-02-10  8:04             ` Mi, Dapeng
  0 siblings, 1 reply; 32+ messages in thread
From: dongli.zhang @ 2025-02-09 20:12 UTC (permalink / raw)
  To: Mi, Dapeng, Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, zide.chen

Hi Dapeng,

On 2/7/25 1:52 AM, Mi, Dapeng wrote:
> 
> On 11/21/2024 6:06 PM, Mi, Dapeng wrote:
>> On 11/8/2024 7:44 AM, dongli.zhang@oracle.com wrote:
>>> Hi Zhao,
>>>
>>>
>>> On 11/6/24 11:52 PM, Zhao Liu wrote:
>>>> (+Dapang & Zide)
>>>>
>>>> Hi Dongli,
>>>>
>>>> On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
>>>>> Date: Mon,  4 Nov 2024 01:40:17 -0800
>>>>> From: Dongli Zhang <dongli.zhang@oracle.com>
>>>>> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>>>>>  KVM_PMU_CAP_DISABLE
>>>>> X-Mailer: git-send-email 2.43.5
>>>>>
>>>>> The AMD PMU virtualization is not disabled when configuring
>>>>> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
>>>>> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
>>>>> virtualization in such an environment.
>>>>>
>>>>> As a result, VM logs typically show:
>>>>>
>>>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>>>
>>>>> whereas the expected logs should be:
>>>>>
>>>>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>>>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>>>
>>>>> This discrepancy occurs because AMD PMU does not use CPUID to determine
>>>>> whether PMU virtualization is supported.
>>>> Intel platform doesn't have this issue since Linux kernel fails to check
>>>> the CPU family & model when "-cpu *,-pmu" option clears PMU version.
>>>>
>>>> The difference between Intel and AMD platforms, however, is that it seems
>>>> Intel hardly ever reaches the “...due virtualization” message, but
>>>> instead reports an error because it recognizes a mismatched family/model.
>>>>
>>>> This may be a drawback of the PMU driver's print message, but the result
>>>> is the same, it prevents the PMU driver from enabling.
>>>>
>>>> So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
>>>> behavior on Intel platform because current "pmu" property works as
>>>> expected.
>>> Sure. I will mention this in v2.
>>>
>>>>> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
>>>>> acceleration. This property sets KVM_PMU_CAP_DISABLE if
>>>>> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
>>>>> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
>>>>> x86 systems.
>>>>>
>>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>>> ---
>>>>> Another previous solution to re-use '-cpu host,-pmu':
>>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Nm8Db-mwBoMIwKkRqzC9kgNi5uZ7SCIf43zUBn92Ar_NEbLXq-ZkrDDvpvDQ4cnS2i4VyKAp6CRVE12bRkMF$ 
>>>> IMO, I prefer the previous version. This VM-level KVM property is
>>>> difficult to integrate with the existing CPU properties. Pls refer later
>>>> comments for reasons.
>>>>
>>>>>  accel/kvm/kvm-all.c        |  1 +
>>>>>  include/sysemu/kvm_int.h   |  1 +
>>>>>  qemu-options.hx            |  9 ++++++-
>>>>>  target/i386/cpu.c          |  2 +-
>>>>>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>>>>>  target/i386/kvm/kvm_i386.h |  2 ++
>>>>>  6 files changed, 65 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>>> index 801cff16a5..8b5ba45cf7 100644
>>>>> --- a/accel/kvm/kvm-all.c
>>>>> +++ b/accel/kvm/kvm-all.c
>>>>> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>>>>>      s->xen_evtchn_max_pirq = 256;
>>>>>      s->device = NULL;
>>>>>      s->msr_energy.enable = false;
>>>>> +    s->pmu_cap_disabled = false;
>>>>>  }
>>>> The CPU property "pmu" also defaults to "false"...but:
>>>>
>>>>  * max CPU would override this and try to enable PMU by default in
>>>>    max_x86_cpu_initfn().
>>>>
>>>>  * Other named CPU models keep the default setting to avoid affecting
>>>>    the migration.
>>>>
>>>> The pmu_cap_disabled and “pmu” property look unbound and unassociated,
>>>> so this can cause the conflict when they are not synchronized. For
>>>> example,
>>>>
>>>> -cpu host -accel kvm,pmu-cap-disabled=on
>>>>
>>>> The above options will fail to launch a VM (on Intel platform).
>>>>
>>>> Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
>>>> other and be consistent. But it's not easy because:
>>>>  - There is no proper way to have pmu_cap_disabled set different default
>>>>    values (e.g., "false" for max CPU and "true" for named CPU models)
>>>>    based on different CPU models.
>>>>  - And, no proper place to check the consistency of pmu_cap_disabled and
>>>>    enable_pmu.
>>>>
>>>> Therefore, I prefer your previous approach, to reuse current CPU "pmu"
>>>> property.
>>> Thank you very much for the suggestion and reasons.
>>>
>>> I am going to follow your suggestion to switch back to the previous solution in v2.
>> +1.
>>
>>  I also prefer to leverage current exist "+/-pmu" option instead of adding
>> a new option. More options, more complexity. When they are not
>> inconsistent, which has higher priority? all these are issues.
>>
>> Although KVM_CAP_PMU_CAPABILITY is a VM-level PMU capability, but all CPUs
>> in a same VM should always share same PMU configuration (Don't consider
>> hybrid platforms which have many issues need to be handled specifically).
>>
>>
>>>> Further, considering that this is currently the only case that needs to
>>>> to set the VM level's capability in the CPU context, there is no need to
>>>> introduce a new kvm interface (in your previous patch), which can instead
>>>> be set in kvm_cpu_realizefn(), like:
>>>>
>>>>
>>>> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
>>>> index 99d1941cf51c..05e9c9a1a0cf 100644
>>>> --- a/target/i386/kvm/kvm-cpu.c
>>>> +++ b/target/i386/kvm/kvm-cpu.c
>>>> @@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>>>  {
>>>>      X86CPU *cpu = X86_CPU(cs);
>>>>      CPUX86State *env = &cpu->env;
>>>> +    KVMState *s = kvm_state;
>>>> +    static bool first = true;
>>>>      bool ret;
>>>>
>>>>      /*
>>>> @@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>>>       *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
>>>>       *   cpu_common_realizefn() (via xcc->parent_realize)
>>>>       */
>>>> +
>>>> +    if (first) {
>>>> +        first = false;
>>>> +
>>>> +        /*
>>>> +         * Since Linux v5.18, KVM provides a VM-level capability to easily
>>>> +         * disable PMUs; however, QEMU has been providing PMU property per
>>>> +         * CPU since v1.6. In order to accommodate both, have to configure
>>>> +         * the VM-level capability here.
>>>> +         */
>>>> +        if (!cpu->enable_pmu &&
>>>> +            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
>>>> +            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
>>>> +                                      KVM_PMU_CAP_DISABLE);
>>>> +
>>>> +            if (r < 0) {
>>>> +                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
>>>> +                           strerror(-r));
>>>> +                return false;
>>>> +            }
>>>> +        }
>> It seems KVM_CAP_PMU_CAPABILITY is called to only disable PMU here. From
>> point view of logic completeness,  KVM_CAP_PMU_CAPABILITY should be called
>> to enabled PMU as well when user wants to enable PMU.
>>
>> I know currently we only need to disable PMU, but we may need to enable PMU
>> via KVM_CAP_PMU_CAPABILITY soon.
>>
>> We are working on the new KVM mediated vPMU framework, Sean suggest to
>> leverage KVM_CAP_PMU_CAPABILITY to enable mediated vPMU dynamically
>> (https://urldefense.com/v3/__https://lore.kernel.org/all/Zz4uhmuPcZl9vJVr@google.com/__;!!ACWV5N9M2RV99hQ!JQx8CdjEI-J6WbzbvB7vHcZ0nJPkzUhvl6TvWvDorAal1XAC17dwpRa6b6Xlva--pK-4ej3Ota0k9SJl3BUWXKTew70$ ). So It would be
>> better if the enable logic can be added here as well.
>>
>> Thanks.
> 
> Hi Dongli,
> 
> May I know if you have plan to continue to update this patch recently? As
> previous comment said, our KVM mediated vPMU solution needs qemu to
> explicitly call KVM_CAP_PMU_CAPABILITY to enable mediated vPMU as well. If
> you have no plan to update this patch recently, would you mind me to pick
> up this patch
> (https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-2-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!JQx8CdjEI-J6WbzbvB7vHcZ0nJPkzUhvl6TvWvDorAal1XAC17dwpRa6b6Xlva--pK-4ej3Ota0k9SJl3BUWzQmZ_yA$ )
> and post with other our mediated vPMU related patches to enable mediated vPMU?
> 
> Thanks!
> 
> Dapeng Mi


Sorry for the delay — it took some effort to learn about mediated vPMU (as
you suggested) to adapt this patch accordingly.

Yes, feel free to pick up this patch for mediated vPMU, as I don’t want to
block your work, although, I do plan to continue updating it.

I am continuing working on it, but my primary objective is to reset the AMD
PMU during QEMU reset, which depends on KVM_PMU_CAP_DISABLE.

[PATCH 5/7] target/i386/kvm: Reset AMD PMU registers during VM reset
[PATCH 6/7] target/i386/kvm: Support perfmon-v2 for reset

Would you mind keeping me updated on any changes/discussions you make to
QEMU on KVM_PMU_CAP_DISABLE for mediated vPMU? That way, I can adjust my
code accordingly once your QEMU patch for KVM_PMU_CAP_DISABLE is finalized.

In the meantime, I am continuing working on the entire patchset and I can
change the code when you post the relevant QEMU changes on
KVM_PMU_CAP_DISABLE soon.

Would that work for you?

Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2025-02-09 20:12           ` dongli.zhang
@ 2025-02-10  8:04             ` Mi, Dapeng
  0 siblings, 0 replies; 32+ messages in thread
From: Mi, Dapeng @ 2025-02-10  8:04 UTC (permalink / raw)
  To: dongli.zhang, Zhao Liu
  Cc: qemu-devel, kvm, pbonzini, mtosatti, sandipan.das, babu.moger,
	likexu, like.xu.linux, zhenyuw, groug, lyan, khorenko,
	alexander.ivanov, den, joe.jin, davydov-max, zide.chen


On 2/10/2025 4:12 AM, dongli.zhang@oracle.com wrote:
> Hi Dapeng,
>
> On 2/7/25 1:52 AM, Mi, Dapeng wrote:
>> On 11/21/2024 6:06 PM, Mi, Dapeng wrote:
>>> On 11/8/2024 7:44 AM, dongli.zhang@oracle.com wrote:
>>>> Hi Zhao,
>>>>
>>>>
>>>> On 11/6/24 11:52 PM, Zhao Liu wrote:
>>>>> (+Dapang & Zide)
>>>>>
>>>>> Hi Dongli,
>>>>>
>>>>> On Mon, Nov 04, 2024 at 01:40:17AM -0800, Dongli Zhang wrote:
>>>>>> Date: Mon,  4 Nov 2024 01:40:17 -0800
>>>>>> From: Dongli Zhang <dongli.zhang@oracle.com>
>>>>>> Subject: [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set
>>>>>>  KVM_PMU_CAP_DISABLE
>>>>>> X-Mailer: git-send-email 2.43.5
>>>>>>
>>>>>> The AMD PMU virtualization is not disabled when configuring
>>>>>> "-cpu host,-pmu" in the QEMU command line on an AMD server. Neither
>>>>>> "-cpu host,-pmu" nor "-cpu EPYC" effectively disables AMD PMU
>>>>>> virtualization in such an environment.
>>>>>>
>>>>>> As a result, VM logs typically show:
>>>>>>
>>>>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>>>>
>>>>>> whereas the expected logs should be:
>>>>>>
>>>>>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>>>>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>>>>
>>>>>> This discrepancy occurs because AMD PMU does not use CPUID to determine
>>>>>> whether PMU virtualization is supported.
>>>>> Intel platform doesn't have this issue since Linux kernel fails to check
>>>>> the CPU family & model when "-cpu *,-pmu" option clears PMU version.
>>>>>
>>>>> The difference between Intel and AMD platforms, however, is that it seems
>>>>> Intel hardly ever reaches the “...due virtualization” message, but
>>>>> instead reports an error because it recognizes a mismatched family/model.
>>>>>
>>>>> This may be a drawback of the PMU driver's print message, but the result
>>>>> is the same, it prevents the PMU driver from enabling.
>>>>>
>>>>> So, please mention that KVM_PMU_CAP_DISABLE doesn't change the PMU
>>>>> behavior on Intel platform because current "pmu" property works as
>>>>> expected.
>>>> Sure. I will mention this in v2.
>>>>
>>>>>> To address this, we introduce a new property, 'pmu-cap-disabled', for KVM
>>>>>> acceleration. This property sets KVM_PMU_CAP_DISABLE if
>>>>>> KVM_CAP_PMU_CAPABILITY is supported. Note that this feature currently
>>>>>> supports only x86 hosts, as KVM_CAP_PMU_CAPABILITY is used exclusively for
>>>>>> x86 systems.
>>>>>>
>>>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>>>> ---
>>>>>> Another previous solution to re-use '-cpu host,-pmu':
>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!Nm8Db-mwBoMIwKkRqzC9kgNi5uZ7SCIf43zUBn92Ar_NEbLXq-ZkrDDvpvDQ4cnS2i4VyKAp6CRVE12bRkMF$ 
>>>>> IMO, I prefer the previous version. This VM-level KVM property is
>>>>> difficult to integrate with the existing CPU properties. Pls refer later
>>>>> comments for reasons.
>>>>>
>>>>>>  accel/kvm/kvm-all.c        |  1 +
>>>>>>  include/sysemu/kvm_int.h   |  1 +
>>>>>>  qemu-options.hx            |  9 ++++++-
>>>>>>  target/i386/cpu.c          |  2 +-
>>>>>>  target/i386/kvm/kvm.c      | 52 ++++++++++++++++++++++++++++++++++++++
>>>>>>  target/i386/kvm/kvm_i386.h |  2 ++
>>>>>>  6 files changed, 65 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>>>> index 801cff16a5..8b5ba45cf7 100644
>>>>>> --- a/accel/kvm/kvm-all.c
>>>>>> +++ b/accel/kvm/kvm-all.c
>>>>>> @@ -3933,6 +3933,7 @@ static void kvm_accel_instance_init(Object *obj)
>>>>>>      s->xen_evtchn_max_pirq = 256;
>>>>>>      s->device = NULL;
>>>>>>      s->msr_energy.enable = false;
>>>>>> +    s->pmu_cap_disabled = false;
>>>>>>  }
>>>>> The CPU property "pmu" also defaults to "false"...but:
>>>>>
>>>>>  * max CPU would override this and try to enable PMU by default in
>>>>>    max_x86_cpu_initfn().
>>>>>
>>>>>  * Other named CPU models keep the default setting to avoid affecting
>>>>>    the migration.
>>>>>
>>>>> The pmu_cap_disabled and “pmu” property look unbound and unassociated,
>>>>> so this can cause the conflict when they are not synchronized. For
>>>>> example,
>>>>>
>>>>> -cpu host -accel kvm,pmu-cap-disabled=on
>>>>>
>>>>> The above options will fail to launch a VM (on Intel platform).
>>>>>
>>>>> Ideally, the “pmu” property and pmu-cap-disabled should be bound to each
>>>>> other and be consistent. But it's not easy because:
>>>>>  - There is no proper way to have pmu_cap_disabled set different default
>>>>>    values (e.g., "false" for max CPU and "true" for named CPU models)
>>>>>    based on different CPU models.
>>>>>  - And, no proper place to check the consistency of pmu_cap_disabled and
>>>>>    enable_pmu.
>>>>>
>>>>> Therefore, I prefer your previous approach, to reuse current CPU "pmu"
>>>>> property.
>>>> Thank you very much for the suggestion and reasons.
>>>>
>>>> I am going to follow your suggestion to switch back to the previous solution in v2.
>>> +1.
>>>
>>>  I also prefer to leverage current exist "+/-pmu" option instead of adding
>>> a new option. More options, more complexity. When they are not
>>> inconsistent, which has higher priority? all these are issues.
>>>
>>> Although KVM_CAP_PMU_CAPABILITY is a VM-level PMU capability, but all CPUs
>>> in a same VM should always share same PMU configuration (Don't consider
>>> hybrid platforms which have many issues need to be handled specifically).
>>>
>>>
>>>>> Further, considering that this is currently the only case that needs to
>>>>> to set the VM level's capability in the CPU context, there is no need to
>>>>> introduce a new kvm interface (in your previous patch), which can instead
>>>>> be set in kvm_cpu_realizefn(), like:
>>>>>
>>>>>
>>>>> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
>>>>> index 99d1941cf51c..05e9c9a1a0cf 100644
>>>>> --- a/target/i386/kvm/kvm-cpu.c
>>>>> +++ b/target/i386/kvm/kvm-cpu.c
>>>>> @@ -42,6 +42,8 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>>>>  {
>>>>>      X86CPU *cpu = X86_CPU(cs);
>>>>>      CPUX86State *env = &cpu->env;
>>>>> +    KVMState *s = kvm_state;
>>>>> +    static bool first = true;
>>>>>      bool ret;
>>>>>
>>>>>      /*
>>>>> @@ -63,6 +65,29 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>>>>>       *   check/update ucode_rev, phys_bits, guest_phys_bits, mwait
>>>>>       *   cpu_common_realizefn() (via xcc->parent_realize)
>>>>>       */
>>>>> +
>>>>> +    if (first) {
>>>>> +        first = false;
>>>>> +
>>>>> +        /*
>>>>> +         * Since Linux v5.18, KVM provides a VM-level capability to easily
>>>>> +         * disable PMUs; however, QEMU has been providing PMU property per
>>>>> +         * CPU since v1.6. In order to accommodate both, have to configure
>>>>> +         * the VM-level capability here.
>>>>> +         */
>>>>> +        if (!cpu->enable_pmu &&
>>>>> +            kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY)) {
>>>>> +            int r = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
>>>>> +                                      KVM_PMU_CAP_DISABLE);
>>>>> +
>>>>> +            if (r < 0) {
>>>>> +                error_setg(errp, "kvm: Failed to disable pmu cap: %s",
>>>>> +                           strerror(-r));
>>>>> +                return false;
>>>>> +            }
>>>>> +        }
>>> It seems KVM_CAP_PMU_CAPABILITY is called to only disable PMU here. From
>>> point view of logic completeness,  KVM_CAP_PMU_CAPABILITY should be called
>>> to enabled PMU as well when user wants to enable PMU.
>>>
>>> I know currently we only need to disable PMU, but we may need to enable PMU
>>> via KVM_CAP_PMU_CAPABILITY soon.
>>>
>>> We are working on the new KVM mediated vPMU framework, Sean suggest to
>>> leverage KVM_CAP_PMU_CAPABILITY to enable mediated vPMU dynamically
>>> (https://urldefense.com/v3/__https://lore.kernel.org/all/Zz4uhmuPcZl9vJVr@google.com/__;!!ACWV5N9M2RV99hQ!JQx8CdjEI-J6WbzbvB7vHcZ0nJPkzUhvl6TvWvDorAal1XAC17dwpRa6b6Xlva--pK-4ej3Ota0k9SJl3BUWXKTew70$ ). So It would be
>>> better if the enable logic can be added here as well.
>>>
>>> Thanks.
>> Hi Dongli,
>>
>> May I know if you have plan to continue to update this patch recently? As
>> previous comment said, our KVM mediated vPMU solution needs qemu to
>> explicitly call KVM_CAP_PMU_CAPABILITY to enable mediated vPMU as well. If
>> you have no plan to update this patch recently, would you mind me to pick
>> up this patch
>> (https://urldefense.com/v3/__https://lore.kernel.org/all/20221119122901.2469-2-dongli.zhang@oracle.com/__;!!ACWV5N9M2RV99hQ!JQx8CdjEI-J6WbzbvB7vHcZ0nJPkzUhvl6TvWvDorAal1XAC17dwpRa6b6Xlva--pK-4ej3Ota0k9SJl3BUWzQmZ_yA$ )
>> and post with other our mediated vPMU related patches to enable mediated vPMU?
>>
>> Thanks!
>>
>> Dapeng Mi
>
> Sorry for the delay — it took some effort to learn about mediated vPMU (as
> you suggested) to adapt this patch accordingly.
>
> Yes, feel free to pick up this patch for mediated vPMU, as I don’t want to
> block your work, although, I do plan to continue updating it.
>
> I am continuing working on it, but my primary objective is to reset the AMD
> PMU during QEMU reset, which depends on KVM_PMU_CAP_DISABLE.
>
> [PATCH 5/7] target/i386/kvm: Reset AMD PMU registers during VM reset
> [PATCH 6/7] target/i386/kvm: Support perfmon-v2 for reset
>
> Would you mind keeping me updated on any changes/discussions you make to
> QEMU on KVM_PMU_CAP_DISABLE for mediated vPMU? That way, I can adjust my
> code accordingly once your QEMU patch for KVM_PMU_CAP_DISABLE is finalized.
>
> In the meantime, I am continuing working on the entire patchset and I can
> change the code when you post the relevant QEMU changes on
> KVM_PMU_CAP_DISABLE soon.
>
> Would that work for you?

Dongli,

Thanks for your feedback. Sure, I would add you into the mail loop when
sending the qemu mediated vPMU patches.

BTW, I found Xiaoyao ever sent a quite familiar patch
(https://lore.kernel.org/qemu-devel/20220317135913.2166202-10-xiaoyao.li@intel.com/)
and he has updated the patch to latest qemu code base, I would pick up his
patch directly.

Thanks,

Dapeng Mi


>
> Thank you very much!
>
> Dongli Zhang


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-02-10  8:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-04  9:40 [PATCH 0/7] target/i386/kvm/pmu: Enhancement, Bugfix and Cleanup Dongli Zhang
2024-11-04  9:40 ` [PATCH 1/7] target/i386: disable PerfMonV2 when PERFCORE unavailable Dongli Zhang
2024-11-06  3:54   ` Zhao Liu
2024-11-07  0:29     ` dongli.zhang
2024-11-07  7:57       ` Zhao Liu
2024-11-04  9:40 ` [PATCH 2/7] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
2024-11-07  7:52   ` Zhao Liu
2024-11-07 23:44     ` dongli.zhang
2024-11-08  2:32       ` Zhao Liu
2024-11-08 12:52       ` Sandipan Das
2024-11-13 17:15       ` Zhao Liu
2024-11-14  0:13         ` dongli.zhang
2024-11-21 10:06       ` Mi, Dapeng
2025-02-07  9:52         ` Mi, Dapeng
2025-02-09 20:12           ` dongli.zhang
2025-02-10  8:04             ` Mi, Dapeng
2024-11-04  9:40 ` [PATCH 3/7] target/i386/kvm: init PMU information only once Dongli Zhang
2024-11-10 15:29   ` Zhao Liu
2024-11-13  1:50     ` dongli.zhang
2024-11-13 16:48       ` Zhao Liu
2024-11-04  9:40 ` [PATCH 4/7] target/i386/kvm: rename architectural PMU variables Dongli Zhang
2024-11-04  9:40 ` [PATCH 5/7] target/i386/kvm: reset AMD PMU registers during VM reset Dongli Zhang
2024-11-06  9:58   ` Sandipan Das
2024-11-07  0:33     ` dongli.zhang
2024-11-07 21:00   ` Maksim Davydov
2024-11-08  1:19     ` dongli.zhang
2024-11-08 14:07       ` Maksim Davydov
2024-11-08 18:04         ` dongli.zhang
2024-11-04  9:40 ` [PATCH 6/7] target/i386/kvm: support perfmon-v2 for reset Dongli Zhang
2024-11-08 13:09   ` Sandipan Das
2024-11-08 16:55     ` dongli.zhang
2024-11-04  9:40 ` [PATCH 7/7] target/i386/kvm: don't stop Intel PMU counters Dongli Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).