linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/38] Mediated vPMU 4.0 for x86
@ 2025-03-24 17:30 Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces Mingwei Zhang
                   ` (40 more replies)
  0 siblings, 41 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

With joint effort from the upstream KVM community, we come up with the
4th version of mediated vPMU for x86. We have made the following changes
on top of the previous RFC v3.

v3 -> v4
 - Rebase whole patchset on 6.14-rc3 base.
 - Address Peter's comments on Perf part.
 - Address Sean's comments on KVM part.
   * Change key word "passthrough" to "mediated" in all patches
   * Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY.
   * Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR
     save/retore list support for GLOBAL_CTRL, thus the support of mediated
     vPMU is constrained to SapphireRapids and later CPUs on Intel side.
   * Merge some small changes into a single patch.
 - Address Sandipan's comment on invalid pmu pointer.
 - Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly
   manipulate pmc->eventsel and pmu->fixed_ctr_ctrl.


Testing (Intel side):
 - Perf-based legacy vPMU (force emulation on/off)
   * Kselftests pmu_counters_test, pmu_event_filter_test and
     vmx_pmu_caps_test pass.
   * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
   * Basic perf counting/sampling tests in 3 scenarios, guest-only,
     host-only and host-guest coexistence all pass.

 - Mediated vPMU (force emulation on/off)
   * Kselftests pmu_counters_test, pmu_event_filter_test and
     vmx_pmu_caps_test pass.
   * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
   * Basic perf counting/sampling tests in 3 scenarios, guest-only,
     host-only and host-guest coexistence all pass.

 - Failures. All above tests passed on Intel Granite Rapids as well
   except a failure on KUT/pmu_pebs.
   * GP counter 0 (0xfffffffffffe): PEBS record (written seq 0)
     is verified (including size, counters and cfg).
   * The pebs_data_cfg (0xb500000000) doesn't match with the
     effective MSR_PEBS_DATA_CFG (0x0).
   * This failure has nothing to do with this mediated vPMU patch set. The
     failure is caused by Granite Rapids supported timed PEBS which needs
     extra support on Qemu and KUT/pmu_pebs. These extra support would be
     sent in separate patches later.


Testing (AMD side):
 - Kselftests pmu_counters_test, pmu_event_filter_test and
   vmx_pmu_caps_test all pass

 - legacy guest with KUT/pmu:
   * qmeu option: -cpu host, -perfctr-core
   * when set force_emulation_prefix=1, passes
   * when set force_emulation_prefix=0, passes
 - perfmon-v1 guest with KUT/pmu:
   * qmeu option: -cpu host, -perfmon-v2
   * when set force_emulation_prefix=1, passes
   * when set force_emulation_prefix=0, passes
 - perfmon-v2 guest with KUT/pmu:
   * qmeu option: -cpu host
   * when set force_emulation_prefix=1, passes
   * when set force_emulation_prefix=0, passes

 - perf_fuzzer (perfmon-v2):
   * fails with soft lockup in guest in current version.
   * culprit could be between 6.13 ~ 6.14-rc3 within KVM
   * Series tested on 6.12 and 6.13 without issue.

Note: a QEMU series is needed to run mediated vPMU v4:
 - https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.com/

History:
 - RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/
 - RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
 - RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.intel.com/


Dapeng Mi (18):
  KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
  KVM: x86/pmu: Check PMU cpuid configuration from user space
  KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
  KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
  KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
  KVM: VMX: Add macros to wrap around
    {secondary,tertiary}_exec_controls_changebit()
  KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
    vm_exit/entry_ctrl
  KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
  KVM: x86/pmu: Setup PMU MSRs' interception mode
  KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
  KVM: x86/pmu: Handle emulated instruction for mediated vPMU
  KVM: nVMX: Add macros to simplify nested MSR interception setting
  KVM: selftests: Add mediated vPMU supported for pmu tests
  KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
  KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
  KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space

Kan Liang (8):
  perf: Support get/put mediated PMU interfaces
  perf: Skip pmu_ctx based on event_type
  perf: Clean up perf ctx time
  perf: Add a EVENT_GUEST flag
  perf: Add generic exclude_guest support
  perf: Add switch_guest_ctx() interface
  perf/x86: Support switch_guest_ctx interface
  perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU

Mingwei Zhang (5):
  perf/x86: Forbid PMI handler when guest own PMU
  perf/x86/core: Plumb mediated PMU capability from x86_pmu to
    x86_pmu_cap
  KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
  KVM: nVMX: Add nested virtualization support for mediated PMU

Sandipan Das (4):
  perf/x86/core: Do not set bit width for unavailable counters
  KVM: x86/pmu: Add AMD PMU registers to direct access list
  KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
    write to event selectors
  perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host

Xiong Zhang (3):
  x86/irq: Factor out common code for installing kvm irq handler
  perf: core/x86: Register a new vector for KVM GUEST PMI
  KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler

 arch/x86/events/amd/core.c                    |   2 +
 arch/x86/events/core.c                        |  40 +-
 arch/x86/events/intel/core.c                  |   5 +
 arch/x86/include/asm/hardirq.h                |   1 +
 arch/x86/include/asm/idtentry.h               |   1 +
 arch/x86/include/asm/irq.h                    |   2 +-
 arch/x86/include/asm/irq_vectors.h            |   5 +-
 arch/x86/include/asm/kvm-x86-pmu-ops.h        |   2 +
 arch/x86/include/asm/kvm_host.h               |  10 +
 arch/x86/include/asm/msr-index.h              |  18 +-
 arch/x86/include/asm/perf_event.h             |   1 +
 arch/x86/include/asm/vmx.h                    |   1 +
 arch/x86/kernel/idt.c                         |   1 +
 arch/x86/kernel/irq.c                         |  39 +-
 arch/x86/kvm/cpuid.c                          |  15 +
 arch/x86/kvm/pmu.c                            | 254 ++++++++-
 arch/x86/kvm/pmu.h                            |  45 ++
 arch/x86/kvm/svm/pmu.c                        | 148 ++++-
 arch/x86/kvm/svm/svm.c                        |  26 +
 arch/x86/kvm/svm/svm.h                        |   2 +-
 arch/x86/kvm/vmx/capabilities.h               |  11 +-
 arch/x86/kvm/vmx/nested.c                     |  68 ++-
 arch/x86/kvm/vmx/pmu_intel.c                  | 224 ++++++--
 arch/x86/kvm/vmx/vmx.c                        |  89 +--
 arch/x86/kvm/vmx/vmx.h                        |  11 +-
 arch/x86/kvm/x86.c                            |  63 ++-
 arch/x86/kvm/x86.h                            |   2 +
 include/linux/perf_event.h                    |  47 +-
 kernel/events/core.c                          | 519 ++++++++++++++----
 .../beauty/arch/x86/include/asm/irq_vectors.h |   5 +-
 .../selftests/kvm/include/kvm_test_harness.h  |  13 +
 .../testing/selftests/kvm/include/kvm_util.h  |   3 +
 .../selftests/kvm/include/x86/processor.h     |   8 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  23 +
 .../selftests/kvm/x86/pmu_counters_test.c     |  24 +-
 .../selftests/kvm/x86/pmu_event_filter_test.c |   8 +-
 .../selftests/kvm/x86/vmx_pmu_caps_test.c     |   2 +-
 37 files changed, 1480 insertions(+), 258 deletions(-)


base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply	[flat|nested] 127+ messages in thread

* [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-14 22:48   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 02/38] perf: Skip pmu_ctx based on event_type Mingwei Zhang
                   ` (39 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

Currently, the guest and host share the PMU resources when a guest is
running. KVM has to create an extra virtual event to simulate the
guest's event, which brings several issues, e.g., high overhead, not
accuracy and etc.

A new mediated PMU method is proposed to address the issue. It requires
that the PMU resources can be fully occupied by the guest while it's
running. Two new interfaces are implemented to fulfill the requirement.
The hypervisor should invoke the interface while creating a guest which
wants the mediated PMU capability.

The PMU resources should only be temporarily occupied as a whole when a
guest is running. When the guest is out, the PMU resources are still
shared among different users.

The exclude_guest event modifier is used to guarantee the exclusive
occupation of the PMU resources. When creating a guest, the hypervisor
should check whether there are !exclude_guest events in the system.
If yes, the creation should fail. Because some PMU resources have been
occupied by other users.
If no, the PMU resources can be safely accessed by the guest directly.
Perf guarantees that no new !exclude_guest events are created when a
guest is running.

Only the mediated PMU is affected, but not for other PMU e.g., uncore
and SW PMU. The behavior of those PMUs are not changed. The guest
enter/exit interfaces should only impact the supported PMUs.
Add a new PERF_PMU_CAP_MEDIATED_VPMU flag to indicate the PMUs that
support the feature.

Add nr_include_guest_events to track the !exclude_guest events of PMU
with PERF_PMU_CAP_MEDIATED_VPMU.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h | 11 +++++++
 kernel/events/core.c       | 66 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8333f132f4a9..54018dd0b2a4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -301,6 +301,8 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
 #define PERF_PMU_CAP_AUX_PAUSE			0x0200
+/* Support to passthrough whole PMU resoure to guest */
+#define PERF_PMU_CAP_MEDIATED_VPMU		0x0400
 
 /**
  * pmu::scope
@@ -1811,6 +1813,8 @@ extern void perf_event_task_tick(void);
 extern int perf_event_account_interrupt(struct perf_event *event);
 extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
+int perf_get_mediated_pmu(void);
+void perf_put_mediated_pmu(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1901,6 +1905,13 @@ static inline int perf_exclude_event(struct perf_event *event, struct pt_regs *r
 {
 	return 0;
 }
+
+static inline int perf_get_mediated_pmu(void)
+{
+	return 0;
+}
+
+static inline void perf_put_mediated_pmu(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bcb09e011e9e..be623701dc48 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -431,6 +431,20 @@ static atomic_t nr_bpf_events __read_mostly;
 static atomic_t nr_cgroup_events __read_mostly;
 static atomic_t nr_text_poke_events __read_mostly;
 static atomic_t nr_build_id_events __read_mostly;
+static atomic_t nr_include_guest_events __read_mostly;
+
+static atomic_t nr_mediated_pmu_vms;
+static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+
+/* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */
+static inline bool is_include_guest_event(struct perf_event *event)
+{
+	if ((event->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU) &&
+	    !event->attr.exclude_guest)
+		return true;
+
+	return false;
+}
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -5320,6 +5334,9 @@ static void _free_event(struct perf_event *event)
 
 	unaccount_event(event);
 
+	if (is_include_guest_event(event))
+		atomic_dec(&nr_include_guest_events);
+
 	security_perf_event_free(event);
 
 	if (event->rb) {
@@ -5877,6 +5894,36 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
 }
 EXPORT_SYMBOL_GPL(perf_event_pause);
 
+/*
+ * Currently invoked at VM creation to
+ * - Check whether there are existing !exclude_guest events of PMU with
+ *   PERF_PMU_CAP_MEDIATED_VPMU
+ * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
+ *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
+ *
+ * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
+ * still owns all the PMU resources.
+ */
+int perf_get_mediated_pmu(void)
+{
+	guard(mutex)(&perf_mediated_pmu_mutex);
+	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
+		return 0;
+
+	if (atomic_read(&nr_include_guest_events))
+		return -EBUSY;
+
+	atomic_inc(&nr_mediated_pmu_vms);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
+
+void perf_put_mediated_pmu(void)
+{
+	atomic_dec(&nr_mediated_pmu_vms);
+}
+EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
@@ -12210,6 +12257,17 @@ static void account_event(struct perf_event *event)
 	account_pmu_sb_event(event);
 }
 
+static int perf_account_include_guest_event(void)
+{
+	guard(mutex)(&perf_mediated_pmu_mutex);
+
+	if (atomic_read(&nr_mediated_pmu_vms))
+		return -EOPNOTSUPP;
+
+	atomic_inc(&nr_include_guest_events);
+	return 0;
+}
+
 /*
  * Allocate and initialize an event structure
  */
@@ -12435,11 +12493,19 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_callchain_buffer;
 
+	if (is_include_guest_event(event)) {
+		err = perf_account_include_guest_event();
+		if (err)
+			goto err_security_alloc;
+	}
+
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
 	return event;
 
+err_security_alloc:
+	security_perf_event_free(event);
 err_callchain_buffer:
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 02/38] perf: Skip pmu_ctx based on event_type
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 03/38] perf: Clean up perf ctx time Mingwei Zhang
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-passthrough PMUs. It doesn't
make sense to keep adding bool variables.

Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.

Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 kernel/events/core.c | 73 ++++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 33 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index be623701dc48..8d3a0cc59fb4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -163,7 +163,7 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU	= 0x10,
 	EVENT_CGROUP	= 0x20,
-
+	EVENT_FLAGS	= EVENT_CGROUP,
 	/* compound helpers */
 	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
 	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
@@ -733,27 +733,37 @@ do {									\
 	___p;								\
 })
 
-#define for_each_epc(_epc, _ctx, _pmu, _cgroup)				\
+static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
+			      enum event_type_t event_type)
+{
+	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
+		return true;
+	return false;
+}
+
+#define for_each_epc(_epc, _ctx, _pmu, _event_type)			\
 	list_for_each_entry(_epc, &((_ctx)->pmu_ctx_list), pmu_ctx_entry) \
-		if (_cgroup && !_epc->nr_cgroups)			\
+		if (perf_skip_pmu_ctx(_epc, _event_type))		\
 			continue;					\
 		else if (_pmu && _epc->pmu != _pmu)			\
 			continue;					\
 		else
 
-static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
+static void perf_ctx_disable(struct perf_event_context *ctx,
+			     enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;
 
-	for_each_epc(pmu_ctx, ctx, NULL, cgroup)
+	for_each_epc(pmu_ctx, ctx, NULL, event_type)
 		perf_pmu_disable(pmu_ctx->pmu);
 }
 
-static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
+static void perf_ctx_enable(struct perf_event_context *ctx,
+			    enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;
 
-	for_each_epc(pmu_ctx, ctx, NULL, cgroup)
+	for_each_epc(pmu_ctx, ctx, NULL, event_type)
 		perf_pmu_enable(pmu_ctx->pmu);
 }
 
@@ -913,7 +923,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 		return;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-	perf_ctx_disable(&cpuctx->ctx, true);
+	perf_ctx_disable(&cpuctx->ctx, EVENT_CGROUP);
 
 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_ALL|EVENT_CGROUP);
 	/*
@@ -929,7 +939,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 	 */
 	ctx_sched_in(&cpuctx->ctx, NULL, EVENT_ALL|EVENT_CGROUP);
 
-	perf_ctx_enable(&cpuctx->ctx, true);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_CGROUP);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 
@@ -2796,11 +2806,11 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 
 	event_type &= EVENT_ALL;
 
-	for_each_epc(epc, &cpuctx->ctx, pmu, false)
+	for_each_epc(epc, &cpuctx->ctx, pmu, 0)
 		perf_pmu_disable(epc->pmu);
 
 	if (task_ctx) {
-		for_each_epc(epc, task_ctx, pmu, false)
+		for_each_epc(epc, task_ctx, pmu, 0)
 			perf_pmu_disable(epc->pmu);
 
 		task_ctx_sched_out(task_ctx, pmu, event_type);
@@ -2820,11 +2830,11 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 
 	perf_event_sched_in(cpuctx, task_ctx, pmu);
 
-	for_each_epc(epc, &cpuctx->ctx, pmu, false)
+	for_each_epc(epc, &cpuctx->ctx, pmu, 0)
 		perf_pmu_enable(epc->pmu);
 
 	if (task_ctx) {
-		for_each_epc(epc, task_ctx, pmu, false)
+		for_each_epc(epc, task_ctx, pmu, 0)
 			perf_pmu_enable(epc->pmu);
 	}
 }
@@ -3374,11 +3384,10 @@ static void
 ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+	enum event_type_t active_type = event_type & ~EVENT_FLAGS;
 	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
 
-	event_type &= ~EVENT_CGROUP;
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3409,7 +3418,7 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 	 * see __load_acquire() in perf_event_time_now()
 	 */
 	barrier();
-	ctx->is_active &= ~event_type;
+	ctx->is_active &= ~active_type;
 
 	if (!(ctx->is_active & EVENT_ALL)) {
 		/*
@@ -3430,7 +3439,7 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 
 	is_active ^= ctx->is_active; /* changed bits */
 
-	for_each_epc(pmu_ctx, ctx, pmu, cgroup)
+	for_each_epc(pmu_ctx, ctx, pmu, event_type)
 		__pmu_ctx_sched_out(pmu_ctx, is_active);
 }
 
@@ -3622,7 +3631,7 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {
 
-			perf_ctx_disable(ctx, false);
+			perf_ctx_disable(ctx, 0);
 
 			/* PMIs are disabled; ctx->nr_no_switch_fast is stable. */
 			if (local_read(&ctx->nr_no_switch_fast) ||
@@ -3647,7 +3656,7 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 			perf_ctx_sched_task_cb(ctx, false);
 			perf_event_swap_task_ctx_data(ctx, next_ctx);
 
-			perf_ctx_enable(ctx, false);
+			perf_ctx_enable(ctx, 0);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
@@ -3671,13 +3680,13 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);
 
 inside_switch:
 		perf_ctx_sched_task_cb(ctx, false);
 		task_ctx_sched_out(ctx, NULL, EVENT_ALL);
 
-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -3981,11 +3990,9 @@ static void
 ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+	enum event_type_t active_type = event_type & ~EVENT_FLAGS;
 	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -4003,7 +4010,7 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 		barrier();
 	}
 
-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= active_type | EVENT_TIME;
 	if (ctx->task) {
 		if (!(is_active & EVENT_ALL))
 			cpuctx->task_ctx = ctx;
@@ -4018,13 +4025,13 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED) {
-		for_each_epc(pmu_ctx, ctx, pmu, cgroup)
+		for_each_epc(pmu_ctx, ctx, pmu, event_type)
 			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED);
 	}
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE) {
-		for_each_epc(pmu_ctx, ctx, pmu, cgroup)
+		for_each_epc(pmu_ctx, ctx, pmu, event_type)
 			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE);
 	}
 }
@@ -4041,11 +4048,11 @@ static void perf_event_context_sched_in(struct task_struct *task)
 
 	if (cpuctx->task_ctx == ctx) {
 		perf_ctx_lock(cpuctx, ctx);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);
 
 		perf_ctx_sched_task_cb(ctx, true);
 
-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		perf_ctx_unlock(cpuctx, ctx);
 		goto rcu_unlock;
 	}
@@ -4058,7 +4065,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
 	if (!ctx->nr_events)
 		goto unlock;
 
-	perf_ctx_disable(ctx, false);
+	perf_ctx_disable(ctx, 0);
 	/*
 	 * We want to keep the following priority order:
 	 * cpu pinned (that don't need to move), task pinned,
@@ -4068,7 +4075,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
 	 * events, no need to flip the cpuctx's events around.
 	 */
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
-		perf_ctx_disable(&cpuctx->ctx, false);
+		perf_ctx_disable(&cpuctx->ctx, 0);
 		ctx_sched_out(&cpuctx->ctx, NULL, EVENT_FLEXIBLE);
 	}
 
@@ -4077,9 +4084,9 @@ static void perf_event_context_sched_in(struct task_struct *task)
 	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
 
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
-		perf_ctx_enable(&cpuctx->ctx, false);
+		perf_ctx_enable(&cpuctx->ctx, 0);
 
-	perf_ctx_enable(ctx, false);
+	perf_ctx_enable(ctx, 0);
 
 unlock:
 	perf_ctx_unlock(cpuctx, ctx);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 03/38] perf: Clean up perf ctx time
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 02/38] perf: Skip pmu_ctx based on event_type Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Mingwei Zhang
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

The current perf tracks two timestamps for the normal ctx and cgroup.
The same type of variables and similar codes are used to track the
timestamps. In the following patch, the third timestamp to track the
guest time will be introduced.
To avoid the code duplication, add a new struct perf_time_ctx and factor
out a generic function update_perf_time_ctx().

No functional change.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h | 13 +++----
 kernel/events/core.c       | 70 +++++++++++++++++---------------------
 2 files changed, 39 insertions(+), 44 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 54018dd0b2a4..a2fd1bdc955c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -953,6 +953,11 @@ struct perf_event_groups {
 	u64		index;
 };
 
+struct perf_time_ctx {
+	u64		time;
+	u64		stamp;
+	u64		offset;
+};
 
 /**
  * struct perf_event_context - event context structure
@@ -992,9 +997,7 @@ struct perf_event_context {
 	/*
 	 * Context clock, runs when context enabled.
 	 */
-	u64				time;
-	u64				timestamp;
-	u64				timeoffset;
+	struct perf_time_ctx		time;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -1085,9 +1088,7 @@ struct bpf_perf_event_data_kern {
  * This is a per-cpu dynamically allocated data structure.
  */
 struct perf_cgroup_info {
-	u64				time;
-	u64				timestamp;
-	u64				timeoffset;
+	struct perf_time_ctx		time;
 	int				active;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8d3a0cc59fb4..e38c8b5e8086 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -770,6 +770,24 @@ static void perf_ctx_enable(struct perf_event_context *ctx,
 static void ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type);
 static void ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type);
 
+static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
+{
+	if (adv)
+		time->time += now - time->stamp;
+	time->stamp = now;
+
+	/*
+	 * The above: time' = time + (now - timestamp), can be re-arranged
+	 * into: time` = now + (time - timestamp), which gives a single value
+	 * offset to compute future time without locks on.
+	 *
+	 * See perf_event_time_now(), which can be used from NMI context where
+	 * it's (obviously) not possible to acquire ctx->lock in order to read
+	 * both the above values in a consistent manner.
+	 */
+	WRITE_ONCE(time->offset, time->time - time->stamp);
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
@@ -811,7 +829,7 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
 	struct perf_cgroup_info *t;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	return t->time;
+	return t->time.time;
 }
 
 static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
@@ -820,22 +838,11 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
 	if (!__load_acquire(&t->active))
-		return t->time;
-	now += READ_ONCE(t->timeoffset);
+		return t->time.time;
+	now += READ_ONCE(t->time.offset);
 	return now;
 }
 
-static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
-{
-	if (adv)
-		info->time += now - info->timestamp;
-	info->timestamp = now;
-	/*
-	 * see update_context_time()
-	 */
-	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
-}
-
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
@@ -849,7 +856,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
 			cgrp = container_of(css, struct perf_cgroup, css);
 			info = this_cpu_ptr(cgrp->info);
 
-			__update_cgrp_time(info, now, true);
+			update_perf_time_ctx(&info->time, now, true);
 			if (final)
 				__store_release(&info->active, 0);
 		}
@@ -872,7 +879,7 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 	 * Do not update time when cgroup is not active
 	 */
 	if (info->active)
-		__update_cgrp_time(info, perf_clock(), true);
+		update_perf_time_ctx(&info->time, perf_clock(), true);
 }
 
 static inline void
@@ -896,7 +903,7 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		__update_cgrp_time(info, ctx->timestamp, false);
+		update_perf_time_ctx(&info->time, ctx->time.stamp, false);
 		__store_release(&info->active, 1);
 	}
 }
@@ -1511,20 +1518,7 @@ static void __update_context_time(struct perf_event_context *ctx, bool adv)
 
 	lockdep_assert_held(&ctx->lock);
 
-	if (adv)
-		ctx->time += now - ctx->timestamp;
-	ctx->timestamp = now;
-
-	/*
-	 * The above: time' = time + (now - timestamp), can be re-arranged
-	 * into: time` = now + (time - timestamp), which gives a single value
-	 * offset to compute future time without locks on.
-	 *
-	 * See perf_event_time_now(), which can be used from NMI context where
-	 * it's (obviously) not possible to acquire ctx->lock in order to read
-	 * both the above values in a consistent manner.
-	 */
-	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
+	update_perf_time_ctx(&ctx->time, now, adv);
 }
 
 static void update_context_time(struct perf_event_context *ctx)
@@ -1542,7 +1536,7 @@ static u64 perf_event_time(struct perf_event *event)
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx->time;
+	return ctx->time.time;
 }
 
 static u64 perf_event_time_now(struct perf_event *event, u64 now)
@@ -1556,9 +1550,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
 		return perf_cgroup_event_time_now(event, now);
 
 	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
-		return ctx->time;
+		return ctx->time.time;
 
-	now += READ_ONCE(ctx->timeoffset);
+	now += READ_ONCE(ctx->time.offset);
 	return now;
 }
 
@@ -11533,14 +11527,14 @@ static void task_clock_event_update(struct perf_event *event, u64 now)
 
 static void task_clock_event_start(struct perf_event *event, int flags)
 {
-	local64_set(&event->hw.prev_count, event->ctx->time);
+	local64_set(&event->hw.prev_count, event->ctx->time.time);
 	perf_swevent_start_hrtimer(event);
 }
 
 static void task_clock_event_stop(struct perf_event *event, int flags)
 {
 	perf_swevent_cancel_hrtimer(event);
-	task_clock_event_update(event, event->ctx->time);
+	task_clock_event_update(event, event->ctx->time.time);
 }
 
 static int task_clock_event_add(struct perf_event *event, int flags)
@@ -11560,8 +11554,8 @@ static void task_clock_event_del(struct perf_event *event, int flags)
 static void task_clock_event_read(struct perf_event *event)
 {
 	u64 now = perf_clock();
-	u64 delta = now - event->ctx->timestamp;
-	u64 time = event->ctx->time + delta;
+	u64 delta = now - event->ctx->time.stamp;
+	u64 time = event->ctx->time.time + delta;
 
 	task_clock_event_update(event, time);
 }
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (2 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 03/38] perf: Clean up perf ctx time Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-14 22:51   ` Sean Christopherson
                     ` (2 more replies)
  2025-03-24 17:30 ` [PATCH v4 05/38] perf: Add generic exclude_guest support Mingwei Zhang
                   ` (36 subsequent siblings)
  40 siblings, 3 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.

However, either way doesn't work with the introduced passthrough vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

The ctx time is shared among different PMUs. The time cannot be stopped
when a guest is running. It is required to calculate the time for events
from other PMUs, e.g., uncore events. Add timeguest to track the guest
run time. For an exclude_guest event, the elapsed time equals
the ctx time - guest time.
Cgroup has dedicated times. Use the same method to deduct the guest time
from the cgroup time as well.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |   6 ++
 kernel/events/core.c       | 209 +++++++++++++++++++++++++++++--------
 2 files changed, 169 insertions(+), 46 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a2fd1bdc955c..7bda1e20be12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -999,6 +999,11 @@ struct perf_event_context {
 	 */
 	struct perf_time_ctx		time;
 
+	/*
+	 * Context clock, runs when in the guest mode.
+	 */
+	struct perf_time_ctx		timeguest;
+
 	/*
 	 * These fields let us detect when two contexts have both
 	 * been cloned (inherited) from a common ancestor.
@@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern {
  */
 struct perf_cgroup_info {
 	struct perf_time_ctx		time;
+	struct perf_time_ctx		timeguest;
 	int				active;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e38c8b5e8086..7a2115b2c5c1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -163,7 +163,8 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU	= 0x10,
 	EVENT_CGROUP	= 0x20,
-	EVENT_FLAGS	= EVENT_CGROUP,
+	EVENT_GUEST	= 0x40,
+	EVENT_FLAGS	= EVENT_CGROUP | EVENT_GUEST,
 	/* compound helpers */
 	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
 	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
@@ -435,6 +436,7 @@ static atomic_t nr_include_guest_events __read_mostly;
 
 static atomic_t nr_mediated_pmu_vms;
 static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+static DEFINE_PER_CPU(bool, perf_in_guest);
 
 /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */
 static inline bool is_include_guest_event(struct perf_event *event)
@@ -738,6 +740,9 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
 {
 	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
 		return true;
+	if ((event_type & EVENT_GUEST) &&
+	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU))
+		return true;
 	return false;
 }
 
@@ -788,6 +793,39 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
 	WRITE_ONCE(time->offset, time->time - time->stamp);
 }
 
+static_assert(offsetof(struct perf_event_context, timeguest) -
+	      offsetof(struct perf_event_context, time) ==
+	      sizeof(struct perf_time_ctx));
+
+#define T_TOTAL		0
+#define T_GUEST		1
+
+static inline u64 __perf_event_time_ctx(struct perf_event *event,
+					struct perf_time_ctx *times)
+{
+	u64 time = times[T_TOTAL].time;
+
+	if (event->attr.exclude_guest)
+		time -= times[T_GUEST].time;
+
+	return time;
+}
+
+static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
+					    struct perf_time_ctx *times,
+					    u64 now)
+{
+	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
+		/*
+		 * (now + times[total].offset) - (now + times[guest].offset) :=
+		 * times[total].offset - times[guest].offset
+		 */
+		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);
+	}
+
+	return now + READ_ONCE(times[T_TOTAL].offset);
+}
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
@@ -824,12 +862,16 @@ static inline int is_cgroup_event(struct perf_event *event)
 	return event->cgrp != NULL;
 }
 
+static_assert(offsetof(struct perf_cgroup_info, timeguest) -
+	      offsetof(struct perf_cgroup_info, time) ==
+	      sizeof(struct perf_time_ctx));
+
 static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
 	struct perf_cgroup_info *t;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	return t->time.time;
+	return __perf_event_time_ctx(event, &t->time);
 }
 
 static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
@@ -838,9 +880,21 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
 	if (!__load_acquire(&t->active))
-		return t->time.time;
-	now += READ_ONCE(t->time.offset);
-	return now;
+		return __perf_event_time_ctx(event, &t->time);
+
+	return __perf_event_time_ctx_now(event, &t->time, now);
+}
+
+static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
+{
+	update_perf_time_ctx(&info->timeguest, now, adv);
+}
+
+static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
+{
+	update_perf_time_ctx(&info->time, now, true);
+	if (__this_cpu_read(perf_in_guest))
+		__update_cgrp_guest_time(info, now, true);
 }
 
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
@@ -856,7 +910,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
 			cgrp = container_of(css, struct perf_cgroup, css);
 			info = this_cpu_ptr(cgrp->info);
 
-			update_perf_time_ctx(&info->time, now, true);
+			update_cgrp_time(info, now);
 			if (final)
 				__store_release(&info->active, 0);
 		}
@@ -879,11 +933,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 	 * Do not update time when cgroup is not active
 	 */
 	if (info->active)
-		update_perf_time_ctx(&info->time, perf_clock(), true);
+		update_cgrp_time(info, perf_clock());
 }
 
 static inline void
-perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
+perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
 {
 	struct perf_event_context *ctx = &cpuctx->ctx;
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
@@ -903,8 +957,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		update_perf_time_ctx(&info->time, ctx->time.stamp, false);
-		__store_release(&info->active, 1);
+		if (guest) {
+			__update_cgrp_guest_time(info, ctx->time.stamp, false);
+		} else {
+			update_perf_time_ctx(&info->time, ctx->time.stamp, false);
+			__store_release(&info->active, 1);
+		}
 	}
 }
 
@@ -1104,7 +1162,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
 }
 
 static inline void
-perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
+perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
 {
 }
 
@@ -1514,16 +1572,24 @@ static void perf_unpin_context(struct perf_event_context *ctx)
  */
 static void __update_context_time(struct perf_event_context *ctx, bool adv)
 {
-	u64 now = perf_clock();
+	lockdep_assert_held(&ctx->lock);
+
+	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
+}
 
+static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
+{
 	lockdep_assert_held(&ctx->lock);
 
-	update_perf_time_ctx(&ctx->time, now, adv);
+	/* must be called after __update_context_time(); */
+	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
 }
 
 static void update_context_time(struct perf_event_context *ctx)
 {
 	__update_context_time(ctx, true);
+	if (__this_cpu_read(perf_in_guest))
+		__update_context_guest_time(ctx, true);
 }
 
 static u64 perf_event_time(struct perf_event *event)
@@ -1536,7 +1602,7 @@ static u64 perf_event_time(struct perf_event *event)
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx->time.time;
+	return __perf_event_time_ctx(event, &ctx->time);
 }
 
 static u64 perf_event_time_now(struct perf_event *event, u64 now)
@@ -1550,10 +1616,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
 		return perf_cgroup_event_time_now(event, now);
 
 	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
-		return ctx->time.time;
+		return __perf_event_time_ctx(event, &ctx->time);
 
-	now += READ_ONCE(ctx->time.offset);
-	return now;
+	return __perf_event_time_ctx_now(event, &ctx->time, now);
 }
 
 static enum event_type_t get_event_type(struct perf_event *event)
@@ -2384,20 +2449,23 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
 }
 
 static inline void
-__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final)
+__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx,
+		  bool final, enum event_type_t event_type)
 {
 	if (ctx->is_active & EVENT_TIME) {
 		if (ctx->is_active & EVENT_FROZEN)
 			return;
+
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx, final);
+		/* vPMU should not stop time */
+		update_cgrp_time_from_cpuctx(cpuctx, !(event_type & EVENT_GUEST) && final);
 	}
 }
 
 static inline void
 ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx)
 {
-	__ctx_time_update(cpuctx, ctx, false);
+	__ctx_time_update(cpuctx, ctx, false, 0);
 }
 
 /*
@@ -3405,7 +3473,7 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 	 *
 	 * would only update time for the pinned events.
 	 */
-	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx);
+	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx, event_type);
 
 	/*
 	 * CPU-release for the below ->is_active store,
@@ -3431,7 +3499,18 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 			cpuctx->task_ctx = NULL;
 	}
 
-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule out all exclude_guest events of PMU
+		 * with PERF_PMU_CAP_MEDIATED_VPMU.
+		 */
+		is_active = EVENT_ALL;
+		__update_context_guest_time(ctx, false);
+		perf_cgroup_set_timestamp(cpuctx, true);
+		barrier();
+	} else {
+		is_active ^= ctx->is_active; /* changed bits */
+	}
 
 	for_each_epc(pmu_ctx, ctx, pmu, event_type)
 		__pmu_ctx_sched_out(pmu_ctx, is_active);
@@ -3926,10 +4005,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
 		event_update_userpage(event);
 }
 
+struct merge_sched_data {
+	int can_add_hw;
+	enum event_type_t event_type;
+};
+
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	int *can_add_hw = data;
+	struct merge_sched_data *msd = data;
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
@@ -3937,13 +4021,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	if (!event_filter_match(event))
 		return 0;
 
-	if (group_can_go_on(event, *can_add_hw)) {
+	/*
+	 * Don't schedule in any host events from PMU with
+	 * PERF_PMU_CAP_MEDIATED_VPMU, while a guest is running.
+	 */
+	if (__this_cpu_read(perf_in_guest) &&
+	    event->pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU &&
+	    !(msd->event_type & EVENT_GUEST))
+		return 0;
+
+	if (group_can_go_on(event, msd->can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
 	}
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		*can_add_hw = 0;
+		msd->can_add_hw = 0;
 		if (event->attr.pinned) {
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
@@ -3962,11 +4055,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
 
 static void pmu_groups_sched_in(struct perf_event_context *ctx,
 				struct perf_event_groups *groups,
-				struct pmu *pmu)
+				struct pmu *pmu,
+				enum event_type_t event_type)
 {
-	int can_add_hw = 1;
+	struct merge_sched_data msd = {
+		.can_add_hw = 1,
+		.event_type = event_type,
+	};
 	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
-			   merge_sched_in, &can_add_hw);
+			   merge_sched_in, &msd);
 }
 
 static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
@@ -3975,9 +4072,9 @@ static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
 	struct perf_event_context *ctx = pmu_ctx->ctx;
 
 	if (event_type & EVENT_PINNED)
-		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu);
+		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu, event_type);
 	if (event_type & EVENT_FLEXIBLE)
-		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu);
+		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu, event_type);
 }
 
 static void
@@ -3994,9 +4091,11 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 		return;
 
 	if (!(is_active & EVENT_TIME)) {
+		/* EVENT_TIME should be active while the guest runs */
+		WARN_ON_ONCE(event_type & EVENT_GUEST);
 		/* start ctx time */
 		__update_context_time(ctx, false);
-		perf_cgroup_set_timestamp(cpuctx);
+		perf_cgroup_set_timestamp(cpuctx, false);
 		/*
 		 * CPU-release for the below ->is_active store,
 		 * see __load_acquire() in perf_event_time_now()
@@ -4012,7 +4111,23 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 	}
 
-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule in the required exclude_guest events of PMU
+		 * with PERF_PMU_CAP_MEDIATED_VPMU.
+		 */
+		is_active = event_type & EVENT_ALL;
+
+		/*
+		 * Update ctx time to set the new start time for
+		 * the exclude_guest events.
+		 */
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx, false);
+		barrier();
+	} else {
+		is_active ^= ctx->is_active; /* changed bits */
+	}
 
 	/*
 	 * First go through the list and put on any pinned groups
@@ -4020,13 +4135,13 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
 	 */
 	if (is_active & EVENT_PINNED) {
 		for_each_epc(pmu_ctx, ctx, pmu, event_type)
-			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED);
+			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED | (event_type & EVENT_GUEST));
 	}
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE) {
 		for_each_epc(pmu_ctx, ctx, pmu, event_type)
-			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE);
+			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE | (event_type & EVENT_GUEST));
 	}
 }
 
@@ -6285,23 +6400,25 @@ void perf_event_update_userpage(struct perf_event *event)
 	if (!rb)
 		goto unlock;
 
-	/*
-	 * compute total_time_enabled, total_time_running
-	 * based on snapshot values taken when the event
-	 * was last scheduled in.
-	 *
-	 * we cannot simply called update_context_time()
-	 * because of locking issue as we can be called in
-	 * NMI context
-	 */
-	calc_timer_values(event, &now, &enabled, &running);
-
-	userpg = rb->user_page;
 	/*
 	 * Disable preemption to guarantee consistent time stamps are stored to
 	 * the user page.
 	 */
 	preempt_disable();
+
+	/*
+	 * compute total_time_enabled, total_time_running
+	 * based on snapshot values taken when the event
+	 * was last scheduled in.
+	 *
+	 * we cannot simply called update_context_time()
+	 * because of locking issue as we can be called in
+	 * NMI context
+	 */
+	calc_timer_values(event, &now, &enabled, &running);
+
+	userpg = rb->user_page;
+
 	++userpg->lock;
 	barrier();
 	userpg->index = perf_event_index(event);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (3 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-04-25 11:13   ` Peter Zijlstra
  2025-05-21 19:55   ` Namhyung Kim
  2025-03-24 17:30 ` [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
                   ` (35 subsequent siblings)
  40 siblings, 2 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

Only KVM knows the exact time when a guest is entering/exiting. Expose
two interfaces to KVM to switch the ownership of the PMU resources.

All the pinned events must be scheduled in first. Extend the
perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |  4 ++
 kernel/events/core.c       | 80 ++++++++++++++++++++++++++++++++++----
 2 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7bda1e20be12..37187ee8e226 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1822,6 +1822,8 @@ extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 int perf_get_mediated_pmu(void);
 void perf_put_mediated_pmu(void);
+void perf_guest_enter(void);
+void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1919,6 +1921,8 @@ static inline int perf_get_mediated_pmu(void)
 }
 
 static inline void perf_put_mediated_pmu(void)			{ }
+static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_exit(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7a2115b2c5c1..d05487d465c9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2827,14 +2827,15 @@ static void task_ctx_sched_out(struct perf_event_context *ctx,
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 				struct perf_event_context *ctx,
-				struct pmu *pmu)
+				struct pmu *pmu,
+				enum event_type_t event_type)
 {
-	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED);
+	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED | event_type);
 	if (ctx)
-		 ctx_sched_in(ctx, pmu, EVENT_PINNED);
-	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
+		ctx_sched_in(ctx, pmu, EVENT_PINNED | event_type);
+	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE | event_type);
 	if (ctx)
-		 ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE);
+		ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE | event_type);
 }
 
 /*
@@ -2890,7 +2891,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	else if (event_type & EVENT_PINNED)
 		ctx_sched_out(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
 
-	perf_event_sched_in(cpuctx, task_ctx, pmu);
+	perf_event_sched_in(cpuctx, task_ctx, pmu, 0);
 
 	for_each_epc(epc, &cpuctx->ctx, pmu, 0)
 		perf_pmu_enable(epc->pmu);
@@ -4188,7 +4189,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
 		ctx_sched_out(&cpuctx->ctx, NULL, EVENT_FLEXIBLE);
 	}
 
-	perf_event_sched_in(cpuctx, ctx, NULL);
+	perf_event_sched_in(cpuctx, ctx, NULL, 0);
 
 	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
 
@@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
 
+static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
+{
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+}
+
+/* When entering a guest, schedule out all exclude_guest events. */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
+		goto unlock;
+
+	perf_host_exit(cpuctx);
+
+	__this_cpu_write(perf_in_guest, true);
+
+unlock:
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+EXPORT_SYMBOL_GPL(perf_guest_enter);
+
+static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
+{
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx)
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+
+	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
+
+	if (cpuctx->task_ctx)
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+}
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
+		goto unlock;
+
+	perf_host_enter(cpuctx);
+
+	__this_cpu_write(perf_in_guest, false);
+unlock:
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+EXPORT_SYMBOL_GPL(perf_guest_exit);
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (4 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 05/38] perf: Add generic exclude_guest support Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-14 23:21   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
                   ` (34 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

KVM will register irq handler for POSTED_INTR_WAKEUP_VECTOR and
KVM_GUEST_PMI_VECTOR, the existing kvm_set_posted_intr_wakeup_handler() is
renamed to x86_set_kvm_irq_handler(), and vector input parameter is used
to distinguish POSTED_INTR_WARKUP_VECTOR and KVM_GUEST_PMI_VECTOR.

Caller should call x86_set_kvm_irq_handler() once to register
a non-dummy handler for each vector. If caller register one
handler for a vector, later the caller register the same or different
non-dummy handler again, the second call will output warn message.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/irq.h |  2 +-
 arch/x86/kernel/irq.c      | 18 ++++++++++++------
 arch/x86/kvm/vmx/vmx.c     |  4 ++--
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 194dfff84cb1..050a247b69b4 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -30,7 +30,7 @@ struct irq_desc;
 extern void fixup_irqs(void);
 
 #if IS_ENABLED(CONFIG_KVM)
-extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
+void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void));
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 385e3a5fc304..18cd418fe106 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -312,16 +312,22 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
 
-void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
+void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 {
-	if (handler)
+	if (!handler)
+		handler = dummy_handler;
+
+	if (vector == POSTED_INTR_WAKEUP_VECTOR &&
+	    (handler == dummy_handler ||
+	     kvm_posted_intr_wakeup_handler == dummy_handler))
 		kvm_posted_intr_wakeup_handler = handler;
-	else {
-		kvm_posted_intr_wakeup_handler = dummy_handler;
+	else
+		WARN_ON_ONCE(1);
+
+	if (handler == dummy_handler)
 		synchronize_rcu();
-	}
 }
-EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
+EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);
 
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6c56d5235f0f..00ac94535c21 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8279,7 +8279,7 @@ void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 
 void vmx_hardware_unsetup(void)
 {
-	kvm_set_posted_intr_wakeup_handler(NULL);
+	x86_set_kvm_irq_handler(POSTED_INTR_WAKEUP_VECTOR, NULL);
 
 	if (nested)
 		nested_vmx_hardware_unsetup();
@@ -8583,7 +8583,7 @@ __init int vmx_hardware_setup(void)
 	if (r && nested)
 		nested_vmx_hardware_unsetup();
 
-	kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler);
+	x86_set_kvm_irq_handler(POSTED_INTR_WAKEUP_VECTOR, pi_wakeup_handler);
 
 	return r;
 }
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (5 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-14 23:24   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 08/38] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler Mingwei Zhang
                   ` (33 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Create a new vector in the host IDT for kvm guest PMI handling within
mediated passthrough vPMU. In addition, guest PMI handler registration
is added into x86_set_kvm_irq_handler().

This is the preparation work to support mediated passthrough vPMU to
handle kvm guest PMIs without interference from PMI handler of the host
PMU.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/hardirq.h                |  1 +
 arch/x86/include/asm/idtentry.h               |  1 +
 arch/x86/include/asm/irq_vectors.h            |  5 ++++-
 arch/x86/kernel/idt.c                         |  1 +
 arch/x86/kernel/irq.c                         | 21 +++++++++++++++++++
 .../beauty/arch/x86/include/asm/irq_vectors.h |  5 ++++-
 6 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 6ffa8b75f4cd..25fac35b9a29 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -19,6 +19,7 @@ typedef struct {
 	unsigned int kvm_posted_intr_ipis;
 	unsigned int kvm_posted_intr_wakeup_ipis;
 	unsigned int kvm_posted_intr_nested_ipis;
+	unsigned int kvm_guest_pmis;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index ad5c68f0509d..b0cb3220e1bb 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
+DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	        sysvec_kvm_guest_pmi_handler);
 #else
 # define fred_sysvec_kvm_posted_intr_ipi		NULL
 # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 47051871b436..250cdab11306 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -77,7 +77,10 @@
  */
 #define IRQ_WORK_VECTOR			0xf6
 
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
+#if IS_ENABLED(CONFIG_KVM)
+#define KVM_GUEST_PMI_VECTOR		0xf5
+#endif
+
 #define DEFERRED_ERROR_VECTOR		0xf4
 
 /* Vector on which hypervisor callbacks will be delivered */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index f445bec516a0..0bec4c7e2308 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
 	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
 	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
+	INTG(KVM_GUEST_PMI_VECTOR,		asm_sysvec_kvm_guest_pmi_handler),
 # endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 18cd418fe106..b29714e23fc4 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -183,6 +183,12 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+
+	seq_printf(p, "%*s: ", prec, "VPMU");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->kvm_guest_pmis);
+	seq_puts(p, " KVM GUEST PMI\n");
 #endif
 #ifdef CONFIG_X86_POSTED_MSI
 	seq_printf(p, "%*s: ", prec, "PMN");
@@ -311,6 +317,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 #if IS_ENABLED(CONFIG_KVM)
 static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
+static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
 
 void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 {
@@ -321,6 +328,10 @@ void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 	    (handler == dummy_handler ||
 	     kvm_posted_intr_wakeup_handler == dummy_handler))
 		kvm_posted_intr_wakeup_handler = handler;
+	else if (vector == KVM_GUEST_PMI_VECTOR &&
+		 (handler == dummy_handler ||
+		  kvm_guest_pmi_handler == dummy_handler))
+		kvm_guest_pmi_handler = handler;
 	else
 		WARN_ON_ONCE(1);
 
@@ -356,6 +367,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 	apic_eoi();
 	inc_irq_stat(kvm_posted_intr_nested_ipis);
 }
+
+/*
+ * Handler for KVM_GUEST_PMI_VECTOR.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_guest_pmi_handler)
+{
+	apic_eoi();
+	inc_irq_stat(kvm_guest_pmis);
+	kvm_guest_pmi_handler();
+}
 #endif
 
 #ifdef CONFIG_X86_POSTED_MSI
diff --git a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
index 47051871b436..250cdab11306 100644
--- a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
+++ b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
@@ -77,7 +77,10 @@
  */
 #define IRQ_WORK_VECTOR			0xf6
 
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
+#if IS_ENABLED(CONFIG_KVM)
+#define KVM_GUEST_PMI_VECTOR		0xf5
+#endif
+
 #define DEFERRED_ERROR_VECTOR		0xf4
 
 /* Vector on which hypervisor callbacks will be delivered */
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 08/38] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (6 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 09/38] perf: Add switch_guest_ctx() interface Mingwei Zhang
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Add function to register/unregister guest KVM PMI handler at KVM module
initialization and destroy. This allows the host PMU with passthough
capability enabled can switch PMI handler at PMU context switch.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02159c967d29..72995952978a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13984,6 +13984,16 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+static void kvm_handle_guest_pmi(void)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	if (WARN_ON_ONCE(!vcpu))
+		return;
+
+	kvm_make_request(KVM_REQ_PMI, vcpu);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
@@ -14021,12 +14031,14 @@ static int __init kvm_x86_init(void)
 
 	kvm_mmu_x86_module_init();
 	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
+	x86_set_kvm_irq_handler(KVM_GUEST_PMI_VECTOR, kvm_handle_guest_pmi);
 	return 0;
 }
 module_init(kvm_x86_init);
 
 static void __exit kvm_x86_exit(void)
 {
+	x86_set_kvm_irq_handler(KVM_GUEST_PMI_VECTOR, NULL);
 	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
 }
 module_exit(kvm_x86_exit);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 09/38] perf: Add switch_guest_ctx() interface
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (7 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 08/38] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-04-25 11:12   ` Peter Zijlstra
                     ` (2 more replies)
  2025-03-24 17:30 ` [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface Mingwei Zhang
                   ` (31 subsequent siblings)
  40 siblings, 3 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

When entering/exiting a guest, some contexts for a guest have to be
switched. For examples, there is a dedicated interrupt vector for
guests on Intel platforms.

When PMI switch into a new guest vector, guest_lvtpc value need to be
reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
bit should be cleared also, then PMI can be generated continuously
for guest. So guest_lvtpc parameter is added into perf_guest_enter()
and switch_guest_ctx().

Add a dedicated list to track all the pmus with the PASSTHROUGH cap, which
may require switching the guest context. It can avoid going through the
huge pmus list.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h | 17 +++++++++++--
 kernel/events/core.c       | 51 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 37187ee8e226..58c1cf6939bf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -584,6 +584,11 @@ struct pmu {
 	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
 	 */
 	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
+
+	/*
+	 * Switch guest context when a guest enter/exit, e.g., interrupt vectors.
+	 */
+	void (*switch_guest_ctx)	(bool enter, void *data); /* optional */
 };
 
 enum perf_addr_filter_action_t {
@@ -1030,6 +1035,11 @@ struct perf_event_context {
 	local_t				nr_no_switch_fast;
 };
 
+struct mediated_pmus_list {
+	raw_spinlock_t		lock;
+	struct list_head	list;
+};
+
 struct perf_cpu_pmu_context {
 	struct perf_event_pmu_context	epc;
 	struct perf_event_pmu_context	*task_epc;
@@ -1044,6 +1054,9 @@ struct perf_cpu_pmu_context {
 	struct hrtimer			hrtimer;
 	ktime_t				hrtimer_interval;
 	unsigned int			hrtimer_active;
+
+	/* Track the PMU with PERF_PMU_CAP_MEDIATED_VPMU cap */
+	struct list_head		mediated_entry;
 };
 
 /**
@@ -1822,7 +1835,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 int perf_get_mediated_pmu(void);
 void perf_put_mediated_pmu(void);
-void perf_guest_enter(void);
+void perf_guest_enter(u32 guest_lvtpc);
 void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
@@ -1921,7 +1934,7 @@ static inline int perf_get_mediated_pmu(void)
 }
 
 static inline void perf_put_mediated_pmu(void)			{ }
-static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
 static inline void perf_guest_exit(void)			{ }
 #endif
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d05487d465c9..406b86641f02 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -451,6 +451,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
 static struct srcu_struct pmus_srcu;
+static DEFINE_PER_CPU(struct mediated_pmus_list, mediated_pmus);
 static cpumask_var_t perf_online_mask;
 static cpumask_var_t perf_online_core_mask;
 static cpumask_var_t perf_online_die_mask;
@@ -6053,8 +6054,26 @@ static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
 	}
 }
 
+static void perf_switch_guest_ctx(bool enter, u32 guest_lvtpc)
+{
+	struct mediated_pmus_list *pmus = this_cpu_ptr(&mediated_pmus);
+	struct perf_cpu_pmu_context *cpc;
+	struct pmu *pmu;
+
+	lockdep_assert_irqs_disabled();
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cpc, &pmus->list, mediated_entry) {
+		pmu = cpc->epc.pmu;
+
+		if (pmu->switch_guest_ctx)
+			pmu->switch_guest_ctx(enter, (void *)&guest_lvtpc);
+	}
+	rcu_read_unlock();
+}
+
 /* When entering a guest, schedule out all exclude_guest events. */
-void perf_guest_enter(void)
+void perf_guest_enter(u32 guest_lvtpc)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 
@@ -6067,6 +6086,8 @@ void perf_guest_enter(void)
 
 	perf_host_exit(cpuctx);
 
+	perf_switch_guest_ctx(true, guest_lvtpc);
+
 	__this_cpu_write(perf_in_guest, true);
 
 unlock:
@@ -6098,6 +6119,8 @@ void perf_guest_exit(void)
 	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
 		goto unlock;
 
+	perf_switch_guest_ctx(false, 0);
+
 	perf_host_enter(cpuctx);
 
 	__this_cpu_write(perf_in_guest, false);
@@ -12104,6 +12127,15 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
 		__perf_init_event_pmu_context(&cpc->epc, pmu);
 		__perf_mux_hrtimer_init(cpc, cpu);
+
+		if (pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU) {
+			struct mediated_pmus_list *pmus;
+
+			pmus = per_cpu_ptr(&mediated_pmus, cpu);
+			raw_spin_lock(&pmus->lock);
+			list_add_rcu(&cpc->mediated_entry, &pmus->list);
+			raw_spin_unlock(&pmus->lock);
+		}
 	}
 
 	if (!pmu->start_txn) {
@@ -12162,6 +12194,20 @@ void perf_pmu_unregister(struct pmu *pmu)
 	mutex_lock(&pmus_lock);
 	list_del_rcu(&pmu->entry);
 
+	if (pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU) {
+		struct mediated_pmus_list *pmus;
+		struct perf_cpu_pmu_context *cpc;
+		int cpu;
+
+		for_each_possible_cpu(cpu) {
+			cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+			pmus = per_cpu_ptr(&mediated_pmus, cpu);
+			raw_spin_lock(&pmus->lock);
+			list_del_rcu(&cpc->mediated_entry);
+			raw_spin_unlock(&pmus->lock);
+		}
+	}
+
 	/*
 	 * We dereference the pmu list under both SRCU and regular RCU, so
 	 * synchronize against both of those.
@@ -14252,6 +14298,9 @@ static void __init perf_event_init_all_cpus(void)
 
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
 
+		INIT_LIST_HEAD(&per_cpu(mediated_pmus.list, cpu));
+		raw_spin_lock_init(&per_cpu(mediated_pmus.lock, cpu));
+
 		cpuctx = per_cpu_ptr(&perf_cpu_context, cpu);
 		__perf_event_init_context(&cpuctx->ctx);
 		lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (8 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 09/38] perf: Add switch_guest_ctx() interface Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-04-25 11:15   ` Peter Zijlstra
  2025-03-24 17:30 ` [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
                   ` (30 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

Implement switch_guest_ctx interface for x86 PMU, switch PMI to dedicated
KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
NMI at perf guest exit.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8f218ac0d445..28161d6ff26d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2677,6 +2677,16 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
 	return ret;
 }
 
+static void x86_pmu_switch_guest_ctx(bool enter, void *data)
+{
+	u32 guest_lvtpc = *(u32 *)data;
+
+	if (enter)
+		apic_write(APIC_LVTPC, guest_lvtpc);
+	else
+		apic_write(APIC_LVTPC, APIC_DM_NMI);
+}
+
 static struct pmu pmu = {
 	.pmu_enable		= x86_pmu_enable,
 	.pmu_disable		= x86_pmu_disable,
@@ -2706,6 +2716,8 @@ static struct pmu pmu = {
 	.aux_output_match	= x86_pmu_aux_output_match,
 
 	.filter			= x86_pmu_filter,
+
+	.switch_guest_ctx	= x86_pmu_switch_guest_ctx,
 };
 
 void arch_perf_update_userpage(struct perf_event *event,
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (9 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-15  0:00   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 12/38] perf/x86/core: Do not set bit width for unavailable counters Mingwei Zhang
                   ` (29 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
be held pending until EFLAGS.IF is set. In the meantime, if the logical
processor receives an NMI for any reason at all, perf_event_nmi_handler()
will be invoked. If there is any active perf event anywhere on the system,
x86_pmu_handle_irq() will be invoked, and it will clear
IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
be a mystery which counter(s) overflowed.

When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
environment.

So modify perf_event_nmi_handler() to check perf_in_guest per cpu variable,
and if so, to simply return without calling x86_pmu_handle_irq().

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 28161d6ff26d..96a173bbbec2 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -54,6 +54,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
 	.pmu = &pmu,
 };
 
+static DEFINE_PER_CPU(bool, pmi_vector_is_nmi) = true;
+
 DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
 DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
 DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
@@ -1737,6 +1739,24 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	u64 finish_clock;
 	int ret;
 
+	/*
+	 * When guest pmu context is loaded this handler should be forbidden from
+	 * running, the reasons are:
+	 * 1. After perf_guest_enter() is called, and before cpu enter into
+	 *    non-root mode, host non-PMI NMI could happen, but x86_pmu_handle_irq()
+	 *    restore PMU to use NMI vector, which destroy KVM PMI vector setting.
+	 * 2. When VM is running, host non-PMI NMI causes VM exit, KVM will
+	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
+	 *    guest PMU context (kvm_pmu_put_guest_context()), as x86_pmu_handle_irq()
+	 *    clear global_status MSR which has guest status now, then this destroy
+	 *    guest PMU status.
+	 * 3. After VM exit, but before KVM save guest PMU context, host non-PMI NMI
+	 *    could happen, x86_pmu_handle_irq() clear global_status MSR which has
+	 *    guest status now, then this destroy guest PMU status.
+	 */
+	if (!this_cpu_read(pmi_vector_is_nmi))
+		return NMI_DONE;
+
 	/*
 	 * All PMUs/events that share this PMI handler should make sure to
 	 * increment active_events for their events.
@@ -2681,10 +2701,13 @@ static void x86_pmu_switch_guest_ctx(bool enter, void *data)
 {
 	u32 guest_lvtpc = *(u32 *)data;
 
-	if (enter)
+	if (enter) {
 		apic_write(APIC_LVTPC, guest_lvtpc);
-	else
+		this_cpu_write(pmi_vector_is_nmi, false);
+	} else {
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
+		this_cpu_write(pmi_vector_is_nmi, true);
+	}
 }
 
 static struct pmu pmu = {
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 12/38] perf/x86/core: Do not set bit width for unavailable counters
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (10 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 13/38] perf/x86/core: Plumb mediated PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Sandipan Das <sandipan.das@amd.com>

Not all x86 processors have fixed counters. It may also be the case that
a processor has only fixed counters and no general-purpose counters. Set
the bit widths corresponding to each counter type only if such counters
are available.

Fixes: b3d9468a8bd2 ("perf, x86: Expose perf capability to other modules")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 96a173bbbec2..7c852ee3e217 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3107,8 +3107,8 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->version		= x86_pmu.version;
 	cap->num_counters_gp	= x86_pmu_num_counters(NULL);
 	cap->num_counters_fixed	= x86_pmu_num_counters_fixed(NULL);
-	cap->bit_width_gp	= x86_pmu.cntval_bits;
-	cap->bit_width_fixed	= x86_pmu.cntval_bits;
+	cap->bit_width_gp	= cap->num_counters_gp ? x86_pmu.cntval_bits : 0;
+	cap->bit_width_fixed	= cap->num_counters_fixed ? x86_pmu.cntval_bits : 0;
 	cap->events_mask	= (unsigned int)x86_pmu.events_maskl;
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 	cap->pebs_ept		= x86_pmu.pebs_ept;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 13/38] perf/x86/core: Plumb mediated PMU capability from x86_pmu to x86_pmu_cap
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (11 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 12/38] perf/x86/core: Do not set bit width for unavailable counters Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter Mingwei Zhang
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

Plumb mediated PMU capability to x86_pmu_cap in order to let any kernel
entity such as KVM know that host PMU support mediated PMU mode and has
the implementation.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c            | 1 +
 arch/x86/include/asm/perf_event.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7c852ee3e217..7a792486d9fb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3112,6 +3112,7 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->events_mask	= (unsigned int)x86_pmu.events_maskl;
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 	cap->pebs_ept		= x86_pmu.pebs_ept;
+	cap->mediated		= !!(pmu.capabilities & PERF_PMU_CAP_MEDIATED_VPMU);
 }
 EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
 
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 0ba8d20f2d1d..3aee76f3316c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -285,6 +285,7 @@ struct x86_pmu_capability {
 	unsigned int	events_mask;
 	int		events_mask_len;
 	unsigned int	pebs_ept	:1;
+	unsigned int	mediated	:1;
 };
 
 /*
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (12 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 13/38] perf/x86/core: Plumb mediated PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-15  0:09   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space Mingwei Zhang
                   ` (26 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Introduce enable_mediated_pmu global parameter to control if mediated
vPMU can be enabled on KVM level. Even enable_mediated_pmu is set to
true in KVM, user space hypervisor still need to enable mediated vPMU
explicitly by calling KVM_CAP_PMU_CAPABILITY ioctl. This gives
hypervisor flexibility to enable or disable mediated vPMU for each VM.

Mediated vPMU depends on some PMU features on higher PMU version, like
PERF_GLOBAL_STATUS_SET MSR in v4+ for Intel PMU. Thus introduce a
pmu_ops variable MIN_MEDIATED_PMU_VERSION to indicates the minimum host
PMU version which mediated vPMU needs.

Currently enable_mediated_pmu is not exposed to user space as a module
parameter until all mediated vPMU code are in place.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c              |  3 ++-
 arch/x86/kvm/pmu.h              | 11 +++++++++
 arch/x86/kvm/svm/pmu.c          |  1 +
 arch/x86/kvm/vmx/capabilities.h |  3 ++-
 arch/x86/kvm/vmx/pmu_intel.c    |  5 ++++
 arch/x86/kvm/vmx/vmx.c          |  3 ++-
 arch/x86/kvm/x86.c              | 44 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.h              |  1 +
 8 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 75e9cfc689f8..4f455afe4009 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -775,7 +775,8 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->pebs_data_cfg_rsvd = ~0ull;
 	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
 
-	if (!vcpu->kvm->arch.enable_pmu)
+	if (!vcpu->kvm->arch.enable_pmu ||
+	    (!lapic_in_kernel(vcpu) && enable_mediated_pmu))
 		return;
 
 	kvm_pmu_call(refresh)(vcpu);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ad89d0bd6005..dd45a0c6be74 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -45,6 +45,7 @@ struct kvm_pmu_ops {
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
 	const int MIN_NR_GP_COUNTERS;
+	const int MIN_MEDIATED_PMU_VERSION;
 };
 
 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
@@ -63,6 +64,12 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
 	return pmu->version > 1;
 }
 
+static inline bool kvm_mediated_pmu_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu->kvm->arch.enable_pmu &&
+	       enable_mediated_pmu && vcpu_to_pmu(vcpu)->version;
+}
+
 /*
  * KVM tracks all counters in 64-bit bitmaps, with general purpose counters
  * mapped to bits 31:0 and fixed counters mapped to 63:32, e.g. fixed counter 0
@@ -210,6 +217,10 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 			enable_pmu = false;
 	}
 
+	if (!enable_pmu || !kvm_pmu_cap.mediated ||
+	    pmu_ops->MIN_MEDIATED_PMU_VERSION > kvm_pmu_cap.version)
+		enable_mediated_pmu = false;
+
 	if (!enable_pmu) {
 		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
 		return;
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 288f7f2a46f2..c8b9fd9b5350 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -239,4 +239,5 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_MAX_NR_AMD_GP_COUNTERS,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
+	.MIN_MEDIATED_PMU_VERSION = 2,
 };
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..fac2c80ddbab 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -390,7 +390,8 @@ static inline bool vmx_pt_mode_is_host_guest(void)
 
 static inline bool vmx_pebs_supported(void)
 {
-	return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
+	return boot_cpu_has(X86_FEATURE_PEBS) &&
+	       !enable_mediated_pmu && kvm_pmu_cap.pebs_ept;
 }
 
 static inline bool cpu_has_notify_vmexit(void)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 77012b2eca0e..425e93d4b1c6 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -739,4 +739,9 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_MAX_NR_INTEL_GP_COUNTERS,
 	.MIN_NR_GP_COUNTERS = 1,
+	/*
+	 * Intel mediated vPMU support depends on
+	 * MSR_CORE_PERF_GLOBAL_STATUS_SET which is supported from 4+.
+	 */
+	.MIN_MEDIATED_PMU_VERSION = 4,
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 00ac94535c21..a4b5b6455c7b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7916,7 +7916,8 @@ static __init u64 vmx_get_perf_capabilities(void)
 	if (boot_cpu_has(X86_FEATURE_PDCM))
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
 
-	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
+	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
+	    !enable_mediated_pmu) {
 		x86_perf_get_lbr(&vmx_lbr_caps);
 
 		/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 72995952978a..1ebe169b88b6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -188,6 +188,14 @@ bool __read_mostly enable_pmu = true;
 EXPORT_SYMBOL_GPL(enable_pmu);
 module_param(enable_pmu, bool, 0444);
 
+/*
+ * Enable/disable mediated passthrough PMU virtualization.
+ * Don't expose it to userspace as a module paramerter until
+ * all mediated vPMU code is in place.
+ */
+bool __read_mostly enable_mediated_pmu;
+EXPORT_SYMBOL_GPL(enable_mediated_pmu);
+
 bool __read_mostly eager_page_split = true;
 module_param(eager_page_split, bool, 0644);
 
@@ -6643,9 +6651,28 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			break;
 
 		mutex_lock(&kvm->lock);
-		if (!kvm->created_vcpus) {
-			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
-			r = 0;
+		/*
+		 * To keep PMU configuration "simple", setting vPMU support is
+		 * disallowed if vCPUs are created, or if mediated PMU support
+		 * was already enabled for the VM.
+		 */
+		if (!kvm->created_vcpus &&
+		    (!enable_mediated_pmu || !kvm->arch.enable_pmu)) {
+			bool pmu_enable = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
+
+			if (enable_mediated_pmu && pmu_enable) {
+				char *err_msg = "Fail to enable mediated vPMU, " \
+					"please disable system wide perf events or nmi_watchdog " \
+					"(echo 0 > /proc/sys/kernel/nmi_watchdog).\n";
+
+				r = perf_get_mediated_pmu();
+				if (r)
+					kvm_err("%s", err_msg);
+			} else
+				r = 0;
+
+			if (!r)
+				kvm->arch.enable_pmu = pmu_enable;
 		}
 		mutex_unlock(&kvm->lock);
 		break;
@@ -12723,7 +12750,14 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
 	kvm->arch.apic_bus_cycle_ns = APIC_BUS_CYCLE_NS_DEFAULT;
 	kvm->arch.guest_can_read_msr_platform_info = true;
-	kvm->arch.enable_pmu = enable_pmu;
+
+	/*
+	 * PMU virtualization is opt-in when mediated PMU support is enabled.
+	 * KVM_CAP_PMU_CAPABILITY ioctl must be called explicitly to enable
+	 * mediated vPMU. For legacy perf-based vPMU, its behavior isn't changed,
+	 * KVM_CAP_PMU_CAPABILITY ioctl is optional.
+	 */
+	kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
@@ -12876,6 +12910,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
 		mutex_unlock(&kvm->slots_lock);
 	}
+	if (kvm->arch.enable_pmu && enable_mediated_pmu)
+		perf_put_mediated_pmu();
 	kvm_unload_vcpu_mmus(kvm);
 	kvm_x86_call(vm_destroy)(kvm);
 	kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1));
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 91e50a513100..dbf9973b3d09 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -391,6 +391,7 @@ extern struct kvm_caps kvm_caps;
 extern struct kvm_host_values kvm_host;
 
 extern bool enable_pmu;
+extern bool enable_mediated_pmu;
 
 /*
  * Get a filtered version of KVM's supported XCR0 that strips out dynamic
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (13 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-15  0:12   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 16/38] KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers Mingwei Zhang
                   ` (25 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Check user space's PMU cpuid configuration and filter the invalid
configuration.

Either legacy perf-based vPMU or mediated vPMU needs kernel to support
local APIC, otherwise PMI has no way to be injected into guest. If
kernel doesn't support local APIC, reject user space to enable PMU
cpuid.

User space configured PMU version must be no larger than KVM supported
maximum pmu version for mediated vPMU, otherwise guest may manipulate
some unsupported or unallowed PMU MSRs, this is dangerous and harmful.

If the pmu version is larger than 1 but smaller than 5, CPUID.AH.ECX
must be 0 as well which is required by SDM.

Suggested-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/cpuid.c | 15 +++++++++++++++
 arch/x86/kvm/pmu.c   |  7 +++++--
 arch/x86/kvm/pmu.h   |  1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8eb3a88707f2..f849ced9deba 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -179,6 +179,21 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu)
 			return -EINVAL;
 	}
 
+	best = kvm_find_cpuid_entry(vcpu, 0xa);
+	if (vcpu->kvm->arch.enable_pmu && best) {
+		union cpuid10_eax eax;
+
+		eax.full = best->eax;
+		if (enable_mediated_pmu &&
+		    eax.split.version_id > kvm_pmu_cap.version)
+			return -EINVAL;
+		if (eax.split.version_id > 0 && !vcpu_pmu_can_enable(vcpu))
+			return -EINVAL;
+		if (eax.split.version_id > 1 && eax.split.version_id < 5 &&
+		    best->ecx != 0)
+			return -EINVAL;
+	}
+
 	/*
 	 * Exposing dynamic xfeatures to the guest requires additional
 	 * enabling in the FPU, e.g. to expand the guest XSAVE state size.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 4f455afe4009..92c742ead663 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -743,6 +743,10 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 	kvm_pmu_call(reset)(vcpu);
 }
 
+inline bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu)
+{
+	return vcpu->kvm->arch.enable_pmu && lapic_in_kernel(vcpu);
+}
 
 /*
  * Refresh the PMU configuration for the vCPU, e.g. if userspace changes CPUID
@@ -775,8 +779,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->pebs_data_cfg_rsvd = ~0ull;
 	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
 
-	if (!vcpu->kvm->arch.enable_pmu ||
-	    (!lapic_in_kernel(vcpu) && enable_mediated_pmu))
+	if (!vcpu_pmu_can_enable(vcpu))
 		return;
 
 	kvm_pmu_call(refresh)(vcpu);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index dd45a0c6be74..e1d0096f249b 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -284,6 +284,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
+bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 16/38] KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (14 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{} Mingwei Zhang
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Rename the two helpers vmx_vmentry/vmexit_ctrl() to
vmx_get_initial_vmentry/vmexit_ctrl() to represent their real meaning.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a4b5b6455c7b..acd3582874b9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4424,7 +4424,7 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 	return pin_based_exec_ctrl;
 }
 
-static u32 vmx_vmentry_ctrl(void)
+static u32 vmx_get_initial_vmentry_ctrl(void)
 {
 	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
 
@@ -4441,7 +4441,7 @@ static u32 vmx_vmentry_ctrl(void)
 	return vmentry_ctrl;
 }
 
-static u32 vmx_vmexit_ctrl(void)
+static u32 vmx_get_initial_vmexit_ctrl(void)
 {
 	u32 vmexit_ctrl = vmcs_config.vmexit_ctrl;
 
@@ -4806,10 +4806,10 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
 		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
 
-	vm_exit_controls_set(vmx, vmx_vmexit_ctrl());
+	vm_exit_controls_set(vmx, vmx_get_initial_vmexit_ctrl());
 
 	/* 22.2.1, 20.8.1 */
-	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
+	vm_entry_controls_set(vmx, vmx_get_initial_vmentry_ctrl());
 
 	vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (15 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 16/38] KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-05-15  0:12   ` Sean Christopherson
  2025-03-24 17:30 ` [PATCH v4 18/38] KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header Mingwei Zhang
                   ` (23 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add perf_capabilities in kvm_host_values{} structure to record host perf
capabilities. KVM needs to know if host supports some PMU capabilities
and then decide if passthrough or intercept some PMU MSRs or instruction
like rdpmc, e.g. If host supports PERF_METRICES, but guest is configured
not to support it, then rdpmc instruction needs to be intercepted.

Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 8 ++------
 arch/x86/kvm/x86.c     | 3 +++
 arch/x86/kvm/x86.h     | 1 +
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index acd3582874b9..ca1c53f855e0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7908,14 +7908,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 static __init u64 vmx_get_perf_capabilities(void)
 {
 	u64 perf_cap = PMU_CAP_FW_WRITES;
-	u64 host_perf_cap = 0;
 
 	if (!enable_pmu)
 		return 0;
 
-	if (boot_cpu_has(X86_FEATURE_PDCM))
-		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
-
 	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
 	    !enable_mediated_pmu) {
 		x86_perf_get_lbr(&vmx_lbr_caps);
@@ -7928,11 +7924,11 @@ static __init u64 vmx_get_perf_capabilities(void)
 		if (!vmx_lbr_caps.has_callstack)
 			memset(&vmx_lbr_caps, 0, sizeof(vmx_lbr_caps));
 		else if (vmx_lbr_caps.nr)
-			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
+			perf_cap |= kvm_host.perf_capabilities & PMU_CAP_LBR_FMT;
 	}
 
 	if (vmx_pebs_supported()) {
-		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
+		perf_cap |= kvm_host.perf_capabilities & PERF_CAP_PEBS_MASK;
 
 		/*
 		 * Disallow adaptive PEBS as it is functionally broken, can be
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1ebe169b88b6..578e5f110b6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9786,6 +9786,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
 		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, kvm_host.arch_capabilities);
 
+	if (boot_cpu_has(X86_FEATURE_PDCM))
+		rdmsrl(MSR_IA32_PERF_CAPABILITIES, kvm_host.perf_capabilities);
+
 	r = ops->hardware_setup();
 	if (r != 0)
 		goto out_mmu_exit;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index dbf9973b3d09..b1df4ad2341b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -46,6 +46,7 @@ struct kvm_host_values {
 	u64 xcr0;
 	u64 xss;
 	u64 arch_capabilities;
+	u64 perf_capabilities;
 };
 
 void kvm_spurious_fault(void);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 18/38] KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (16 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{} Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:30 ` [PATCH v4 19/38] KVM: VMX: Add macros to wrap around {secondary,tertiary}_exec_controls_changebit() Mingwei Zhang
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h and rename them with
PERF_CAP prefix to keep consistent with other perf capabilities macros.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/msr-index.h | 15 +++++++++------
 arch/x86/kvm/vmx/capabilities.h  |  3 ---
 arch/x86/kvm/vmx/pmu_intel.c     |  4 ++--
 arch/x86/kvm/vmx/vmx.c           | 12 ++++++------
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 72765b2fe0d8..ca70846ffd55 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -305,12 +305,15 @@
 #define PERF_CAP_PT_IDX			16
 
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
-#define PERF_CAP_PEBS_TRAP             BIT_ULL(6)
-#define PERF_CAP_ARCH_REG              BIT_ULL(7)
-#define PERF_CAP_PEBS_FORMAT           0xf00
-#define PERF_CAP_PEBS_BASELINE         BIT_ULL(14)
-#define PERF_CAP_PEBS_MASK	(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
-				 PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
+
+#define PERF_CAP_LBR_FMT		0x3f
+#define PERF_CAP_PEBS_TRAP		BIT_ULL(6)
+#define PERF_CAP_ARCH_REG		BIT_ULL(7)
+#define PERF_CAP_PEBS_FORMAT		0xf00
+#define PERF_CAP_FW_WRITES		BIT_ULL(13)
+#define PERF_CAP_PEBS_BASELINE		BIT_ULL(14)
+#define PERF_CAP_PEBS_MASK		(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
+					 PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
 
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define RTIT_CTL_TRACEEN		BIT(0)
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index fac2c80ddbab..013536fde10b 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -21,9 +21,6 @@ extern int __read_mostly pt_mode;
 #define PT_MODE_SYSTEM		0
 #define PT_MODE_HOST_GUEST	1
 
-#define PMU_CAP_FW_WRITES	(1ULL << 13)
-#define PMU_CAP_LBR_FMT		0x3f
-
 struct nested_vmx_msrs {
 	/*
 	 * We only store the "true" versions of the VMX capability MSRs. We
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 425e93d4b1c6..fc017e9a6a0c 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -118,7 +118,7 @@ static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
 
 static inline bool fw_writes_is_enabled(struct kvm_vcpu *vcpu)
 {
-	return (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_FW_WRITES) != 0;
+	return (vcpu_get_perf_capabilities(vcpu) & PERF_CAP_FW_WRITES) != 0;
 }
 
 static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
@@ -543,7 +543,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 
 	perf_capabilities = vcpu_get_perf_capabilities(vcpu);
 	if (cpuid_model_is_consistent(vcpu) &&
-	    (perf_capabilities & PMU_CAP_LBR_FMT))
+	    (perf_capabilities & PERF_CAP_LBR_FMT))
 		memcpy(&lbr_desc->records, &vmx_lbr_caps, sizeof(vmx_lbr_caps));
 	else
 		lbr_desc->records.nr = 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ca1c53f855e0..9c4b3c2b1d65 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2188,7 +2188,7 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
 	    (host_initiated || guest_cpu_cap_has(vcpu, X86_FEATURE_BUS_LOCK_DETECT)))
 		debugctl |= DEBUGCTLMSR_BUS_LOCK_DETECT;
 
-	if ((kvm_caps.supported_perf_cap & PMU_CAP_LBR_FMT) &&
+	if ((kvm_caps.supported_perf_cap & PERF_CAP_LBR_FMT) &&
 	    (host_initiated || intel_pmu_lbr_is_enabled(vcpu)))
 		debugctl |= DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
 
@@ -2464,9 +2464,9 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			vmx->pt_desc.guest.addr_a[index / 2] = data;
 		break;
 	case MSR_IA32_PERF_CAPABILITIES:
-		if (data & PMU_CAP_LBR_FMT) {
-			if ((data & PMU_CAP_LBR_FMT) !=
-			    (kvm_caps.supported_perf_cap & PMU_CAP_LBR_FMT))
+		if (data & PERF_CAP_LBR_FMT) {
+			if ((data & PERF_CAP_LBR_FMT) !=
+			    (kvm_caps.supported_perf_cap & PERF_CAP_LBR_FMT))
 				return 1;
 			if (!cpuid_model_is_consistent(vcpu))
 				return 1;
@@ -7907,7 +7907,7 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 static __init u64 vmx_get_perf_capabilities(void)
 {
-	u64 perf_cap = PMU_CAP_FW_WRITES;
+	u64 perf_cap = PERF_CAP_FW_WRITES;
 
 	if (!enable_pmu)
 		return 0;
@@ -7924,7 +7924,7 @@ static __init u64 vmx_get_perf_capabilities(void)
 		if (!vmx_lbr_caps.has_callstack)
 			memset(&vmx_lbr_caps, 0, sizeof(vmx_lbr_caps));
 		else if (vmx_lbr_caps.nr)
-			perf_cap |= kvm_host.perf_capabilities & PMU_CAP_LBR_FMT;
+			perf_cap |= kvm_host.perf_capabilities & PERF_CAP_LBR_FMT;
 	}
 
 	if (vmx_pebs_supported()) {
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 19/38] KVM: VMX: Add macros to wrap around {secondary,tertiary}_exec_controls_changebit()
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (17 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 18/38] KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header Mingwei Zhang
@ 2025-03-24 17:30 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc Mingwei Zhang
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add macros around helpers that changes VMCS bits to simplify vmx exec ctrl
bits clearing and setting.

No function change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 20 +++++++-------------
 arch/x86/kvm/vmx/vmx.h |  8 ++++++++
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9c4b3c2b1d65..ff66f17d6358 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4471,19 +4471,13 @@ void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 
 	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
 
-	if (kvm_vcpu_apicv_active(vcpu)) {
-		secondary_exec_controls_setbit(vmx,
-					       SECONDARY_EXEC_APIC_REGISTER_VIRT |
-					       SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-		if (enable_ipiv)
-			tertiary_exec_controls_setbit(vmx, TERTIARY_EXEC_IPI_VIRT);
-	} else {
-		secondary_exec_controls_clearbit(vmx,
-						 SECONDARY_EXEC_APIC_REGISTER_VIRT |
-						 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
-		if (enable_ipiv)
-			tertiary_exec_controls_clearbit(vmx, TERTIARY_EXEC_IPI_VIRT);
-	}
+	secondary_exec_controls_changebit(vmx,
+					  SECONDARY_EXEC_APIC_REGISTER_VIRT |
+					  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY,
+					  kvm_vcpu_apicv_active(vcpu));
+	if (enable_ipiv)
+		tertiary_exec_controls_changebit(vmx, TERTIARY_EXEC_IPI_VIRT,
+						 kvm_vcpu_apicv_active(vcpu));
 
 	vmx_update_msr_bitmap_x2apic(vcpu);
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 8b111ce1087c..5c505af553c8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -612,6 +612,14 @@ static __always_inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##b
 {												\
 	BUILD_BUG_ON(!(val & (KVM_REQUIRED_VMX_##uname | KVM_OPTIONAL_VMX_##uname)));		\
 	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);				\
+}												\
+static __always_inline void lname##_controls_changebit(struct vcpu_vmx *vmx, u##bits val,	\
+						       bool set)				\
+{												\
+	if (set)										\
+		lname##_controls_setbit(vmx, val);						\
+	else											\
+		lname##_controls_clearbit(vmx, val);						\
 }
 BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (18 preceding siblings ...)
  2025-03-24 17:30 ` [PATCH v4 19/38] KVM: VMX: Add macros to wrap around {secondary,tertiary}_exec_controls_changebit() Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-15  0:19   ` Sean Christopherson
  2025-05-26  6:15   ` Sandipan Das
  2025-03-24 17:31 ` [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl Mingwei Zhang
                   ` (20 subsequent siblings)
  40 siblings, 2 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Check if rdpmc can be intercepted for mediated vPMU. Simply speaking,
if guest own all PMU counters in mediated vPMU, then rdpmc interception
should be disabled to mitigate the performance impact, otherwise rdpmc
has to be intercepted to avoid guest obtain host counter's data via
rdpmc instruction.

Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/kvm/pmu.c               | 34 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/pmu.h               | 19 ++++++++++++++++++
 arch/x86/kvm/svm/pmu.c           | 14 ++++++++++++-
 arch/x86/kvm/vmx/pmu_intel.c     | 18 ++++++++---------
 5 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ca70846ffd55..337f4b0a2998 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -312,6 +312,7 @@
 #define PERF_CAP_PEBS_FORMAT		0xf00
 #define PERF_CAP_FW_WRITES		BIT_ULL(13)
 #define PERF_CAP_PEBS_BASELINE		BIT_ULL(14)
+#define PERF_CAP_PERF_METRICS		BIT_ULL(15)
 #define PERF_CAP_PEBS_MASK		(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
 					 PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 92c742ead663..6ad71752be4b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -604,6 +604,40 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	return 0;
 }
 
+inline bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	if (!kvm_mediated_pmu_enabled(vcpu))
+		return false;
+
+	/*
+	 * VMware allows access to these Pseduo-PMCs even when read via RDPMC
+	 * in Ring3 when CR4.PCE=0.
+	 */
+	if (enable_vmware_backdoor)
+		return false;
+
+	/*
+	 * FIXME: In theory, perf metrics is always combined with fixed
+	 *	  counter 3. it's fair enough to compare the guest and host
+	 *	  fixed counter number and don't need to check perf metrics
+	 *	  explicitly. However kvm_pmu_cap.num_counters_fixed is limited
+	 *	  KVM_MAX_NR_FIXED_COUNTERS (3) as fixed counter 3 is not
+	 *	  supported now. perf metrics is still needed to be checked
+	 *	  explicitly here. Once fixed counter 3 is supported, the perf
+	 *	  metrics checking can be removed.
+	 */
+	return pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
+	       pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
+	       vcpu_has_perf_metrics(vcpu) == kvm_host_has_perf_metrics() &&
+	       pmu->counter_bitmask[KVM_PMC_GP] ==
+				(BIT_ULL(kvm_pmu_cap.bit_width_gp) - 1) &&
+	       pmu->counter_bitmask[KVM_PMC_FIXED] ==
+				(BIT_ULL(kvm_pmu_cap.bit_width_fixed) - 1);
+}
+EXPORT_SYMBOL_GPL(kvm_rdpmc_in_guest);
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
 	if (lapic_in_kernel(vcpu)) {
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e1d0096f249b..509c995b7871 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -271,6 +271,24 @@ static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc)
 	return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
 }
 
+static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
+{
+	if (!guest_cpu_cap_has(vcpu, X86_FEATURE_PDCM))
+		return 0;
+
+	return vcpu->arch.perf_capabilities;
+}
+
+static inline bool vcpu_has_perf_metrics(struct kvm_vcpu *vcpu)
+{
+	return !!(vcpu_get_perf_capabilities(vcpu) & PERF_CAP_PERF_METRICS);
+}
+
+static inline bool kvm_host_has_perf_metrics(void)
+{
+	return !!(kvm_host.perf_capabilities & PERF_CAP_PERF_METRICS);
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
@@ -287,6 +305,7 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
 bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
+bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu);
 
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index c8b9fd9b5350..153972e944eb 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -173,7 +173,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return 1;
 }
 
-static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
+static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	union cpuid_0x80000022_ebx ebx;
@@ -212,6 +212,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
 }
 
+static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	__amd_pmu_refresh(vcpu);
+
+	if (kvm_rdpmc_in_guest(vcpu))
+		svm_clr_intercept(svm, INTERCEPT_RDPMC);
+	else
+		svm_set_intercept(svm, INTERCEPT_RDPMC);
+}
+
 static void amd_pmu_init(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index fc017e9a6a0c..2a5f79206b02 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -108,14 +108,6 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
 	return &counters[array_index_nospec(idx, num_counters)];
 }
 
-static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
-{
-	if (!guest_cpu_cap_has(vcpu, X86_FEATURE_PDCM))
-		return 0;
-
-	return vcpu->arch.perf_capabilities;
-}
-
 static inline bool fw_writes_is_enabled(struct kvm_vcpu *vcpu)
 {
 	return (vcpu_get_perf_capabilities(vcpu) & PERF_CAP_FW_WRITES) != 0;
@@ -456,7 +448,7 @@ static void intel_pmu_enable_fixed_counter_bits(struct kvm_pmu *pmu, u64 bits)
 		pmu->fixed_ctr_ctrl_rsvd &= ~intel_fixed_bits_by_idx(i, bits);
 }
 
-static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
+static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
@@ -564,6 +556,14 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
+{
+	__intel_pmu_refresh(vcpu);
+
+	exec_controls_changebit(to_vmx(vcpu), CPU_BASED_RDPMC_EXITING,
+				!kvm_rdpmc_in_guest(vcpu));
+}
+
 static void intel_pmu_init(struct kvm_vcpu *vcpu)
 {
 	int i;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (19 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-26 16:51   ` Chen, Zide
  2025-03-24 17:31 ` [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers Mingwei Zhang
                   ` (19 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Intel processor (vmx) provides capability to save/load guest
IA32_PERF_GLOBAL_CTRL at vm-exit/vm-entry by setting
VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL bit in VM-exit-ctrl or
VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL bit in VM-entry-ctrl.

Mediated vPMU leverages both capabilities to save/load guest
IA32_PERF_GLOBAL_CTRL automatically at vm-exit/vm-entry. Note that the
former was introduced in SapphireRapids and later Intel CPUs.

If VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL is unavailable, mediated PMU will be
disabled. Note that mediated PMU can be enabled by falling back to atomic
msr save/retore list. However, that would cause extra overhead per
VM-enter/exit.

Since these VMX capability bits perform automatic saving/restoring of the
PMU global ctrl between VMCS and the HW MSR. No synchronization was
performed betwen HW MSR and pmu->global_ctrli, the KVM cached value .
Therefore, whenever KVM needs to use this variable, it will need to
explicitly read the value from MSR to pmu->global_ctrl. This is especially
so when guest doesn't own all PMU counters, i.e., when
IA32_PERF_GLOBAL_CTRL is interceped by mediated PMU.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/include/asm/vmx.h      |  1 +
 arch/x86/kvm/pmu.c              | 30 ++++++++++++++++++++++++-
 arch/x86/kvm/vmx/capabilities.h |  5 +++++
 arch/x86/kvm/vmx/nested.c       |  3 ++-
 arch/x86/kvm/vmx/pmu_intel.c    | 39 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.c          | 22 ++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h          |  3 ++-
 8 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0b7af5902ff7..4b3bfefc2d05 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -553,6 +553,10 @@ struct kvm_pmu {
 	unsigned available_event_types;
 	u64 fixed_ctr_ctrl;
 	u64 fixed_ctr_ctrl_rsvd;
+	/*
+	 * kvm_pmu_sync_global_ctrl_from_vmcs() must be called to update
+	 * this SW-maintained global_ctrl for mediated vPMU before accessing it.
+	 */
 	u64 global_ctrl;
 	u64 global_status;
 	u64 counter_bitmask[2];
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f7fd4369b821..48e137560f17 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -106,6 +106,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL	0x40000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 6ad71752be4b..4e8cefcce7ab 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -646,6 +646,30 @@ void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void kvm_pmu_sync_global_ctrl_from_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct msr_data msr_info = { .index = MSR_CORE_PERF_GLOBAL_CTRL };
+
+	if (!kvm_mediated_pmu_enabled(vcpu))
+		return;
+
+	/* Sync pmu->global_ctrl from GUEST_IA32_PERF_GLOBAL_CTRL. */
+	kvm_pmu_call(get_msr)(vcpu, &msr_info);
+}
+
+static void kvm_pmu_sync_global_ctrl_to_vmcs(struct kvm_vcpu *vcpu, u64 global_ctrl)
+{
+	struct msr_data msr_info = {
+		.index = MSR_CORE_PERF_GLOBAL_CTRL,
+		.data = global_ctrl };
+
+	if (!kvm_mediated_pmu_enabled(vcpu))
+		return;
+
+	/* Sync pmu->global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
+	kvm_pmu_call(set_msr)(vcpu, &msr_info);
+}
+
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 {
 	switch (msr) {
@@ -680,7 +704,6 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = pmu->global_status;
 		break;
 	case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
-	case MSR_CORE_PERF_GLOBAL_CTRL:
 		msr_info->data = pmu->global_ctrl;
 		break;
 	case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
@@ -731,6 +754,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			diff = pmu->global_ctrl ^ data;
 			pmu->global_ctrl = data;
 			reprogram_counters(pmu, diff);
+
+			/* Propagate guest global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
+			kvm_pmu_sync_global_ctrl_to_vmcs(vcpu, data);
 		}
 		break;
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
@@ -907,6 +933,8 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
 
 	BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX);
 
+	kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
+
 	if (!kvm_pmu_has_perf_global_ctrl(pmu))
 		bitmap_copy(bitmap, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
 	else if (!bitmap_and(bitmap, pmu->all_valid_pmc_idx,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 013536fde10b..cc63bd4ab87c 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -101,6 +101,11 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
 	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 }
 
+static inline bool cpu_has_save_perf_global_ctrl(void)
+{
+	return vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
+}
+
 static inline bool cpu_has_vmx_mpx(void)
 {
 	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 8a7af02d466e..ecf72394684d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -7004,7 +7004,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
 		VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
-		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+		VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
 
 	/* We support free control of debug control saving. */
 	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 2a5f79206b02..04a893e56135 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -294,6 +294,11 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	u32 msr = msr_info->index;
 
 	switch (msr) {
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+		if (kvm_mediated_pmu_enabled(vcpu))
+			pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
+		msr_info->data = pmu->global_ctrl;
+		break;
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
 		msr_info->data = pmu->fixed_ctr_ctrl;
 		break;
@@ -339,6 +344,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	u64 reserved_bits, diff;
 
 	switch (msr) {
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+		if (kvm_mediated_pmu_enabled(vcpu))
+			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
+				     pmu->global_ctrl);
+		break;
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
 		if (data & pmu->fixed_ctr_ctrl_rsvd)
 			return 1;
@@ -558,10 +568,37 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 
 static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	bool mediated;
+
 	__intel_pmu_refresh(vcpu);
 
-	exec_controls_changebit(to_vmx(vcpu), CPU_BASED_RDPMC_EXITING,
+	exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
 				!kvm_rdpmc_in_guest(vcpu));
+
+	mediated = kvm_mediated_pmu_enabled(vcpu);
+	if (cpu_has_load_perf_global_ctrl()) {
+		vm_entry_controls_changebit(vmx,
+			VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, mediated);
+		/*
+		 * Initialize guest PERF_GLOBAL_CTRL to reset value as SDM rules.
+		 *
+		 * Note: GUEST_IA32_PERF_GLOBAL_CTRL must be initialized to
+		 * "BIT_ULL(pmu->nr_arch_gp_counters) - 1" instead of pmu->global_ctrl
+		 * since pmu->global_ctrl is only be initialized when guest
+		 * pmu->version > 1. Otherwise if pmu->version is 1, pmu->global_ctrl
+		 * is 0 and guest counters are never really enabled.
+		 */
+		if (mediated)
+			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
+				     BIT_ULL(pmu->nr_arch_gp_counters) - 1);
+	}
+
+	if (cpu_has_save_perf_global_ctrl())
+		vm_exit_controls_changebit(vmx,
+			VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+			VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, mediated);
 }
 
 static void intel_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ff66f17d6358..38ecf3c116bd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4390,6 +4390,13 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 
 	if (cpu_has_load_ia32_efer())
 		vmcs_write64(HOST_IA32_EFER, kvm_host.efer);
+
+	/*
+	 * Initialize host PERF_GLOBAL_CTRL to 0 to disable all counters
+	 * immediately once VM exits. Mediated vPMU then call perf_guest_exit()
+	 * to re-enable host perf events.
+	 */
+	vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
 }
 
 void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
@@ -4457,7 +4464,8 @@ static u32 vmx_get_initial_vmexit_ctrl(void)
 				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
 	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
 	return vmexit_ctrl &
-		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER |
+		  VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL);
 }
 
 void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -7196,6 +7204,9 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 	struct perf_guest_switch_msr *msrs;
 	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
 
+	if (kvm_mediated_pmu_enabled(&vmx->vcpu))
+		return;
+
 	pmu->host_cross_mapped_mask = 0;
 	if (pmu->pebs_enable & pmu->global_ctrl)
 		intel_pmu_cross_mapped_check(pmu);
@@ -8451,6 +8462,15 @@ __init int vmx_hardware_setup(void)
 		enable_sgx = false;
 #endif
 
+	/*
+	 * All CPUs that support a mediated PMU are expected to support loading
+	 * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
+	 */
+	if (enable_mediated_pmu &&
+	    (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
+			  !cpu_has_save_perf_global_ctrl())))
+		enable_mediated_pmu = false;
+
 	/*
 	 * set_apic_access_page_addr() is used to reload apic access
 	 * page upon invalidation.  No need to do anything if not
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 5c505af553c8..b282165f98a6 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -510,7 +510,8 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
+	       VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (20 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-15  0:37   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs Mingwei Zhang
                   ` (18 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Currently pmu->global_ctrl is initialized in the common kvm_pmu_refresh()
helper since both Intel and AMD CPUs set enable bits for all GP counters
for PERF_GLOBAL_CTRL MSR. But it may be not the best place to initialize
pmu->global_ctrl. Strictly speaking, pmu->global_ctrl is vendor specific
and there are lots of global_ctrl related processing in
intel/amd_pmu_refresh() helpers, so better handle them in same place.
Thus move pmu->global_ctrl initialization into intel/amd_pmu_refresh()
helpers.

Besides, intel_pmu_refresh() doesn't handle global_ctrl_rsvd and
global_status_rsvd properly and fix it.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c           | 10 -------
 arch/x86/kvm/svm/pmu.c       | 14 +++++++--
 arch/x86/kvm/vmx/pmu_intel.c | 55 ++++++++++++++++++------------------
 3 files changed, 39 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 4e8cefcce7ab..2ac4c039de8b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -843,16 +843,6 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 		return;
 
 	kvm_pmu_call(refresh)(vcpu);
-
-	/*
-	 * At RESET, both Intel and AMD CPUs set all enable bits for general
-	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
-	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
-	 * in the global controls).  Emulate that behavior when refreshing the
-	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
-	 */
-	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
-		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
 }
 
 void kvm_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 153972e944eb..eba086ef5eca 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
 					 kvm_pmu_cap.num_counters_gp);
 
-	if (pmu->version > 1) {
-		pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 1);
+	if (kvm_pmu_cap.version > 1) {
+		/*
+		 * At RESET, AMD CPUs set all enable bits for general purpose counters in
+		 * IA32_PERF_GLOBAL_CTRL (so that software that was written for v1 PMUs
+		 * don't unknowingly leave GP counters disabled in the global controls).
+		 * Emulate that behavior when refreshing the PMU so that userspace doesn't
+		 * need to manually set PERF_GLOBAL_CTRL.
+		 */
+		pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
+		pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
 		pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
 	}
 
-	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
+	pmu->counter_bitmask[KVM_PMC_GP] = BIT_ULL(48) - 1;
 	pmu->reserved_bits = 0xfffffff000280000ull;
 	pmu->raw_event_mask = AMD64_RAW_EVENT_MASK;
 	/* not applicable to AMD; but clean them to prevent any fall out */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 04a893e56135..c30c6c5e36c8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -466,7 +466,6 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	union cpuid10_eax eax;
 	union cpuid10_edx edx;
 	u64 perf_capabilities;
-	u64 counter_rsvd;
 
 	memset(&lbr_desc->records, 0, sizeof(lbr_desc->records));
 
@@ -493,11 +492,10 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 					 kvm_pmu_cap.num_counters_gp);
 	eax.split.bit_width = min_t(int, eax.split.bit_width,
 				    kvm_pmu_cap.bit_width_gp);
-	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << eax.split.bit_width) - 1;
+	pmu->counter_bitmask[KVM_PMC_GP] = BIT_ULL(eax.split.bit_width) - 1;
 	eax.split.mask_length = min_t(int, eax.split.mask_length,
 				      kvm_pmu_cap.events_mask_len);
-	pmu->available_event_types = ~entry->ebx &
-					((1ull << eax.split.mask_length) - 1);
+	pmu->available_event_types = ~entry->ebx & (BIT_ULL(eax.split.mask_length) - 1);
 
 	if (pmu->version == 1) {
 		pmu->nr_arch_fixed_counters = 0;
@@ -506,29 +504,34 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 						    kvm_pmu_cap.num_counters_fixed);
 		edx.split.bit_width_fixed = min_t(int, edx.split.bit_width_fixed,
 						  kvm_pmu_cap.bit_width_fixed);
-		pmu->counter_bitmask[KVM_PMC_FIXED] =
-			((u64)1 << edx.split.bit_width_fixed) - 1;
+		pmu->counter_bitmask[KVM_PMC_FIXED] = BIT_ULL(edx.split.bit_width_fixed) - 1;
 	}
 
 	intel_pmu_enable_fixed_counter_bits(pmu, INTEL_FIXED_0_KERNEL |
 						 INTEL_FIXED_0_USER |
 						 INTEL_FIXED_0_ENABLE_PMI);
 
-	counter_rsvd = ~(((1ull << pmu->nr_arch_gp_counters) - 1) |
-		(((1ull << pmu->nr_arch_fixed_counters) - 1) << KVM_FIXED_PMC_BASE_IDX));
-	pmu->global_ctrl_rsvd = counter_rsvd;
+	if (kvm_pmu_has_perf_global_ctrl(pmu)) {
+		/*
+		 * At RESET, Intel CPUs set all enable bits for general purpose counters
+		 * in IA32_PERF_GLOBAL_CTRL. Emulate this behavior.
+		 */
+		pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
+		pmu->global_ctrl_rsvd = ~((BIT_ULL(pmu->nr_arch_gp_counters) - 1) |
+					  ((BIT_ULL(pmu->nr_arch_fixed_counters) - 1) <<
+					   KVM_FIXED_PMC_BASE_IDX));
 
-	/*
-	 * GLOBAL_STATUS and GLOBAL_OVF_CONTROL (a.k.a. GLOBAL_STATUS_RESET)
-	 * share reserved bit definitions.  The kernel just happens to use
-	 * OVF_CTRL for the names.
-	 */
-	pmu->global_status_rsvd = pmu->global_ctrl_rsvd
-			& ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF |
-			    MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD);
-	if (vmx_pt_mode_is_host_guest())
-		pmu->global_status_rsvd &=
-				~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI;
+		/*
+		 * GLOBAL_STATUS and GLOBAL_OVF_CONTROL (a.k.a. GLOBAL_STATUS_RESET)
+		 * share reserved bit definitions.  The kernel just happens to use
+		 * OVF_CTRL for the names.
+		 */
+		pmu->global_status_rsvd = pmu->global_ctrl_rsvd &
+					  ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF |
+					    MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD);
+		if (vmx_pt_mode_is_host_guest())
+			pmu->global_status_rsvd &= ~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI;
+	}
 
 	entry = kvm_find_cpuid_entry_index(vcpu, 7, 0);
 	if (entry &&
@@ -538,10 +541,9 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		pmu->raw_event_mask |= (HSW_IN_TX|HSW_IN_TX_CHECKPOINTED);
 	}
 
-	bitmap_set(pmu->all_valid_pmc_idx,
-		0, pmu->nr_arch_gp_counters);
-	bitmap_set(pmu->all_valid_pmc_idx,
-		INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
+	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+	bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_MAX_GENERIC,
+		   pmu->nr_arch_fixed_counters);
 
 	perf_capabilities = vcpu_get_perf_capabilities(vcpu);
 	if (cpuid_model_is_consistent(vcpu) &&
@@ -555,13 +557,12 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 
 	if (perf_capabilities & PERF_CAP_PEBS_FORMAT) {
 		if (perf_capabilities & PERF_CAP_PEBS_BASELINE) {
-			pmu->pebs_enable_rsvd = counter_rsvd;
+			pmu->pebs_enable_rsvd = pmu->global_ctrl_rsvd;
 			pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
 			pmu->pebs_data_cfg_rsvd = ~0xff00000full;
 			intel_pmu_enable_fixed_counter_bits(pmu, ICL_FIXED_0_ADAPTIVE);
 		} else {
-			pmu->pebs_enable_rsvd =
-				~((1ull << pmu->nr_arch_gp_counters) - 1);
+			pmu->pebs_enable_rsvd = ~(BIT_ULL(pmu->nr_arch_gp_counters) - 1);
 		}
 	}
 }
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (21 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-15  0:41   ` Sean Christopherson
  2025-05-16 13:34   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
                   ` (17 subsequent siblings)
  40 siblings, 2 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add helper intel_pmu_update_msr_intercepts() to configure the interception
of PMU MSRs.

For mediated vPMU, intercept all the guest owned GP counters EVENTSELx MSRs
and fixed counters FIX_CTR_CTRL MSR (Intel only). This is because KVM needs
to intercept the event configuration and filter out malicious guest events
and events that might cause CPU glitches.

In addition, pass through all the guest owned perf counter MSRs to reduce
the performance impact. Note that PMU MSRs that not owned by guest are
always intercepted. Accessing them always cause #GP

As for the global shared MSRs, pass through them to guest only if guest
own all PMU resources. Otherwise, intercept them all to avoid guest to
access host owned counters.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/kvm/svm/pmu.c           | 63 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c     | 44 ++++++++++++++++++++++
 3 files changed, 108 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 337f4b0a2998..a4d8356e9b53 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -719,6 +719,7 @@
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS	0xc0000300
 #define MSR_AMD64_PERF_CNTR_GLOBAL_CTL		0xc0000301
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR	0xc0000302
+#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET	0xc0000303
 
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index eba086ef5eca..4fc809c74ba8 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -220,6 +220,67 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
 }
 
+static void amd_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+	int msr_clear = !!(kvm_mediated_pmu_enabled(vcpu));
+	int i;
+
+	for (i = 0; i < min(pmu->nr_arch_gp_counters, AMD64_NUM_COUNTERS); i++) {
+		/*
+		 * Legacy counters are always available irrespective of any
+		 * CPUID feature bits and when X86_FEATURE_PERFCTR_CORE is set,
+		 * PERF_LEGACY_CTLx and PERF_LEGACY_CTRx registers are mirrored
+		 * with PERF_CTLx and PERF_CTRx respectively.
+		 */
+		set_msr_interception(vcpu, svm->msrpm, MSR_K7_EVNTSEL0 + i, 0, 0);
+		set_msr_interception(vcpu, svm->msrpm, MSR_K7_PERFCTR0 + i,
+				     msr_clear, msr_clear);
+	}
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		/*
+		 * PERF_CTLx registers require interception in order to clear
+		 * HostOnly bit and set GuestOnly bit. This is to prevent the
+		 * PERF_CTRx registers from counting before VM entry and after
+		 * VM exit.
+		 */
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTL + 2 * i, 0, 0);
+		/*
+		 * Pass through counters exposed to the guest and intercept
+		 * counters that are unexposed. Do this explicitly since this
+		 * function may be set multiple times before vcpu runs.
+		 */
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTR + 2 * i,
+				     msr_clear, msr_clear);
+	}
+
+	for ( ; i < kvm_pmu_cap.num_counters_gp; i++) {
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTL + 2 * i, 0, 0);
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTR + 2 * i, 0, 0);
+	}
+
+	/*
+	 * In mediated vPMU, intercept global PMU MSRs when guest PMU only owns
+	 * a subset of counters provided in HW or its version is less than 2.
+	 */
+	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)
+		msr_clear = 1;
+	else
+		msr_clear = 0;
+
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_CTL,
+			     msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,
+			     msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,
+			     msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET,
+			     msr_clear, msr_clear);
+}
+
 static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -230,6 +291,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 		svm_clr_intercept(svm, INTERCEPT_RDPMC);
 	else
 		svm_set_intercept(svm, INTERCEPT_RDPMC);
+
+	amd_pmu_update_msr_intercepts(vcpu);
 }
 
 static void amd_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index c30c6c5e36c8..450f9e5b9e40 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -567,6 +567,48 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	}
 }
 
+static void intel_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
+{
+	bool intercept = !kvm_mediated_pmu_enabled(vcpu);
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i,
+					  MSR_TYPE_RW, intercept);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW,
+					  intercept || !fw_writes_is_enabled(vcpu));
+	}
+	for ( ; i < kvm_pmu_cap.num_counters_gp; i++) {
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i,
+					  MSR_TYPE_RW, true);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i,
+					  MSR_TYPE_RW, true);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i,
+					  MSR_TYPE_RW, intercept);
+	for ( ; i < kvm_pmu_cap.num_counters_fixed; i++)
+		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i,
+					  MSR_TYPE_RW, true);
+
+	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&
+	    vcpu_has_perf_metrics(vcpu) == kvm_host_has_perf_metrics() &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
+	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed)
+		intercept = false;
+	else
+		intercept = true;
+
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_STATUS,
+				  MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
+				  MSR_TYPE_RW, intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL,
+				  MSR_TYPE_RW, intercept);
+}
+
 static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -578,6 +620,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
 				!kvm_rdpmc_in_guest(vcpu));
 
+	intel_pmu_update_msr_intercepts(vcpu);
+
 	mediated = kvm_mediated_pmu_enabled(vcpu);
 	if (cpu_has_load_perf_global_ctrl()) {
 		vm_entry_controls_changebit(vmx,
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (22 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-16 13:35   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 25/38] KVM: x86/pmu: Add AMD PMU registers to direct access list Mingwei Zhang
                   ` (16 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

Reject PMU MSRs interception explicitly in
vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
specially handled in intel_passthrough_pmu_msrs().

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 38ecf3c116bd..7bb16bed08da 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -165,7 +165,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
 
 /*
  * List of MSRs that can be directly passed to the guest.
- * In addition to these x2apic, PT and LBR MSRs are handled specially.
+ * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
  */
 static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_IA32_SPEC_CTRL,
@@ -691,6 +691,16 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
 	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
+	case MSR_IA32_PMC0 ...
+		MSR_IA32_PMC0 + KVM_MAX_NR_GP_COUNTERS - 1:
+	case MSR_IA32_PERFCTR0 ...
+		MSR_IA32_PERFCTR0 + KVM_MAX_NR_GP_COUNTERS - 1:
+	case MSR_CORE_PERF_FIXED_CTR0 ...
+		MSR_CORE_PERF_FIXED_CTR0 + KVM_MAX_NR_FIXED_COUNTERS - 1:
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
 		return -ENOENT;
 	}
 
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 25/38] KVM: x86/pmu: Add AMD PMU registers to direct access list
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (23 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-16 13:36   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering Mingwei Zhang
                   ` (15 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Sandipan Das <sandipan.das@amd.com>

Add all PMU-related MSRs (including legacy K7 MSRs) to the list of
possible direct access MSRs.  Most of them will not be intercepted when
using passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/svm.c | 24 ++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h |  2 +-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a713c803a3a3..bff351992468 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -143,6 +143,30 @@ static const struct svm_direct_access_msrs {
 	{ .index = X2APIC_MSR(APIC_TMICT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TMCCT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TDCR),		.always = false },
+	{ .index = MSR_K7_EVNTSEL0,			.always = false },
+	{ .index = MSR_K7_PERFCTR0,			.always = false },
+	{ .index = MSR_K7_EVNTSEL1,			.always = false },
+	{ .index = MSR_K7_PERFCTR1,			.always = false },
+	{ .index = MSR_K7_EVNTSEL2,			.always = false },
+	{ .index = MSR_K7_PERFCTR2,			.always = false },
+	{ .index = MSR_K7_EVNTSEL3,			.always = false },
+	{ .index = MSR_K7_PERFCTR3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL0,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR0,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL1,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR1,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL2,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR2,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL4,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR4,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL5,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR5,			.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_CTL,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET,	.always = false },
 	{ .index = MSR_INVALID,				.always = false },
 };
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9d7cdb8fbf87..ae71bf5f12d0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -44,7 +44,7 @@ static inline struct page *__sme_pa_to_page(unsigned long pa)
 #define	IOPM_SIZE PAGE_SIZE * 3
 #define	MSRPM_SIZE PAGE_SIZE * 2
 
-#define MAX_DIRECT_ACCESS_MSRS	48
+#define MAX_DIRECT_ACCESS_MSRS	72
 #define MSRPM_OFFSETS	32
 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
 extern bool npt_enabled;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (24 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 25/38] KVM: x86/pmu: Add AMD PMU registers to direct access list Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-15  0:42   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and " Mingwei Zhang
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

Introduce eventsel_hw and fixed_ctr_ctrl_hw to store the actual HW value in
PMU event selector MSRs. In mediated PMU checks events before allowing the
event values written to the PMU MSRs. However, to match the HW behavior,
when PMU event checks fails, KVM should allow guest to read the value back.

This essentially requires an extra variable to separate the guest requested
value from actual PMU MSR value. Note this only applies to event selectors.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/pmu.c              | 7 +++++--
 arch/x86/kvm/svm/pmu.c          | 1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 2 ++
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b3bfefc2d05..7ee74bbbb0aa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -524,6 +524,7 @@ struct kvm_pmc {
 	 */
 	u64 emulated_counter;
 	u64 eventsel;
+	u64 eventsel_hw;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
 	/*
@@ -552,6 +553,7 @@ struct kvm_pmu {
 	unsigned nr_arch_fixed_counters;
 	unsigned available_event_types;
 	u64 fixed_ctr_ctrl;
+	u64 fixed_ctr_ctrl_hw;
 	u64 fixed_ctr_ctrl_rsvd;
 	/*
 	 * kvm_pmu_sync_global_ctrl_from_vmcs() must be called to update
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 2ac4c039de8b..63143eeb5c44 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -794,11 +794,14 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 		pmc->counter = 0;
 		pmc->emulated_counter = 0;
 
-		if (pmc_is_gp(pmc))
+		if (pmc_is_gp(pmc)) {
 			pmc->eventsel = 0;
+			pmc->eventsel_hw = 0;
+		}
 	}
 
-	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
+	pmu->fixed_ctr_ctrl = pmu->fixed_ctr_ctrl_hw = 0;
+	pmu->global_ctrl = pmu->global_status = 0;
 
 	kvm_pmu_call(reset)(vcpu);
 }
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 4fc809c74ba8..9feaca739b96 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -165,6 +165,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		data &= ~pmu->reserved_bits;
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
+			pmc->eventsel_hw = data;
 			kvm_pmu_request_counter_reprogram(pmc);
 		}
 		return 0;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 450f9e5b9e40..796b7bc4affe 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -41,6 +41,7 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 	int i;
 
 	pmu->fixed_ctr_ctrl = data;
+	pmu->fixed_ctr_ctrl_hw = data;
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
 		u8 new_ctrl = fixed_ctrl_field(data, i);
 		u8 old_ctrl = fixed_ctrl_field(old_fixed_ctr_ctrl, i);
@@ -403,6 +404,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 			if (data != pmc->eventsel) {
 				pmc->eventsel = data;
+				pmc->eventsel_hw = data;
 				kvm_pmu_request_counter_reprogram(pmc);
 			}
 			break;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (25 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-15  0:43   ` Sean Christopherson
  2025-05-16  1:26   ` Mi, Dapeng
  2025-03-24 17:31 ` [PATCH v4 28/38] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest writes to event selectors Mingwei Zhang
                   ` (13 subsequent siblings)
  40 siblings, 2 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Mediated vPMU needs to intercept EVENTSELx and FIXED_CNTR_CTRL MSRs to
filter out guest malicious perf events. Either writing these MSRs or
updating event filters would call reprogram_counter() eventually. Thus
check if the guest event should be filtered out in reprogram_counter().
If so, clear corresponding EVENTSELx MSR or FIXED_CNTR_CTRL field to
ensure the guest event won't be really enabled at vm-entry.

Besides, mediated vPMU intercepts the MSRs of these guest not owned
counters and it just needs simply to read/write from/to pmc->counter.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 27 +++++++++++++++++++++++++++
 arch/x86/kvm/pmu.h |  3 +++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 63143eeb5c44..e9100dc49fdc 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -305,6 +305,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
 
 void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
 {
+	if (kvm_mediated_pmu_enabled(pmc->vcpu)) {
+		pmc->counter = val & pmc_bitmask(pmc);
+		return;
+	}
+
 	/*
 	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
 	 * read-modify-write.  Adjust the counter value so that its value is
@@ -455,6 +460,28 @@ static int reprogram_counter(struct kvm_pmc *pmc)
 	bool emulate_overflow;
 	u8 fixed_ctr_ctrl;
 
+	if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
+		bool allowed = check_pmu_event_filter(pmc);
+
+		if (pmc_is_gp(pmc)) {
+			if (allowed)
+				pmc->eventsel_hw |= pmc->eventsel &
+						    ARCH_PERFMON_EVENTSEL_ENABLE;
+			else
+				pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
+		} else {
+			int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
+
+			if (allowed)
+				pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
+			else
+				pmu->fixed_ctr_ctrl_hw &=
+					~intel_fixed_bits_by_idx(idx, 0xf);
+		}
+
+		return 0;
+	}
+
 	emulate_overflow = pmc_pause_counter(pmc);
 
 	if (!pmc_event_is_allowed(pmc))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 509c995b7871..6289f523d893 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -113,6 +113,9 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 {
 	u64 counter, enabled, running;
 
+	if (kvm_mediated_pmu_enabled(pmc->vcpu))
+		return pmc->counter & pmc_bitmask(pmc);
+
 	counter = pmc->counter + pmc->emulated_counter;
 
 	if (pmc->perf_event && !pmc->is_paused)
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 28/38] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest writes to event selectors
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (26 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and " Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry Mingwei Zhang
                   ` (12 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Sandipan Das <sandipan.das@amd.com>

On AMD platforms, there is no way to restore PerfCntrGlobalCtl at
VM-Entry or clear it at VM-Exit. Since the register states will be
restored before entering and saved after exiting guest context, the
counters can keep ticking and even overflow leading to chaos while
still in host context.

To avoid this, intecept event selectors, which is already done by mediated
PMU. In addition, always set the GuestOnly bit and clear the HostOnly bit
for PMU selectors on AMD. Doing so allows the counters run only in guest
context even if their enable bits are still set after VM exit and before
host/guest PMU context switch.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 9feaca739b96..1a7e3a897fdf 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -165,7 +165,8 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		data &= ~pmu->reserved_bits;
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
-			pmc->eventsel_hw = data;
+			pmc->eventsel_hw = (data & ~AMD64_EVENTSEL_HOSTONLY) |
+					   AMD64_EVENTSEL_GUESTONLY;
 			kvm_pmu_request_counter_reprogram(pmc);
 		}
 		return 0;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (27 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 28/38] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest writes to event selectors Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-15 16:29   ` Sean Christopherson
  2025-05-16 13:26   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU Mingwei Zhang
                   ` (11 subsequent siblings)
  40 siblings, 2 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

This patch supports to switch host/guest PMU context at
vm-exit/vm-entry for mediated vPMU.

In details, kvm_pmu_put_guest_context() is called to save guest PMU
context and load host PMU context at VM-exits and
kvm_pmu_load_guest_context() is called to save host PMU context and
load guest PMU context at vm-entries.

A pair of pmu_ops callbacks *put_guest_context() and
*load_guest_context() are added to save/restore vendor specific PMU
MSRs.

Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 +
 arch/x86/include/asm/kvm_host.h        |  4 ++
 arch/x86/include/asm/msr-index.h       |  1 +
 arch/x86/kvm/pmu.c                     | 96 ++++++++++++++++++++++++++
 arch/x86/kvm/pmu.h                     | 11 +++
 arch/x86/kvm/svm/pmu.c                 | 54 +++++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c           | 59 ++++++++++++++++
 arch/x86/kvm/x86.c                     |  4 ++
 8 files changed, 231 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 9159bf1a4730..35f27366c277 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -22,6 +22,8 @@ KVM_X86_PMU_OP(init)
 KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
+KVM_X86_PMU_OP(put_guest_context)
+KVM_X86_PMU_OP(load_guest_context)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7ee74bbbb0aa..4117a382739a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -568,6 +568,10 @@ struct kvm_pmu {
 	u64 raw_event_mask;
 	struct kvm_pmc gp_counters[KVM_MAX_NR_GP_COUNTERS];
 	struct kvm_pmc fixed_counters[KVM_MAX_NR_FIXED_COUNTERS];
+	u32 gp_eventsel_base;
+	u32 gp_counter_base;
+	u32 fixed_base;
+	u32 cntr_shift;
 
 	/*
 	 * Overlay the bitmap with a 64-bit atomic so that all bits can be
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a4d8356e9b53..df33a4f026a1 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1153,6 +1153,7 @@
 #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
 #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL	0x00000390
+#define MSR_CORE_PERF_GLOBAL_STATUS_SET	0x00000391
 
 #define MSR_PERF_METRICS		0x00000329
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e9100dc49fdc..68f203454bbc 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1127,3 +1127,99 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
 	kfree(filter);
 	return r;
 }
+
+void kvm_pmu_put_guest_pmcs(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	u32 eventsel_msr;
+	u32 counter_msr;
+	u32 i;
+
+	/*
+	 * Clear hardware selector MSR content and its counter to avoid
+	 * leakage and also avoid this guest GP counter get accidentally
+	 * enabled during host running when host enable global ctrl.
+	 */
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		eventsel_msr = pmc_msr_addr(pmu, pmu->gp_eventsel_base, i);
+		counter_msr = pmc_msr_addr(pmu, pmu->gp_counter_base, i);
+
+		rdpmcl(i, pmc->counter);
+		rdmsrl(eventsel_msr, pmc->eventsel_hw);
+		if (pmc->counter)
+			wrmsrl(counter_msr, 0);
+		if (pmc->eventsel_hw)
+			wrmsrl(eventsel_msr, 0);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		counter_msr = pmc_msr_addr(pmu, pmu->fixed_base, i);
+
+		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
+		if (pmc->counter)
+			wrmsrl(counter_msr, 0);
+	}
+
+}
+EXPORT_SYMBOL_GPL(kvm_pmu_put_guest_pmcs);
+
+void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	u32 eventsel_msr;
+	u32 counter_msr;
+	u32 i;
+
+	/*
+	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
+	 * in this case will be intercepted. Accessing to these counters and
+	 * selectors will cause #GP in the guest.
+	 */
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		eventsel_msr = pmc_msr_addr(pmu, pmu->gp_eventsel_base, i);
+		counter_msr = pmc_msr_addr(pmu, pmu->gp_counter_base, i);
+
+		wrmsrl(counter_msr, pmc->counter);
+		wrmsrl(eventsel_msr, pmc->eventsel_hw);
+	}
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		counter_msr = pmc_msr_addr(pmu, pmu->fixed_base, i);
+
+		wrmsrl(counter_msr, pmc->counter);
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_pmu_load_guest_pmcs);
+
+void kvm_pmu_put_guest_context(struct kvm_vcpu *vcpu)
+{
+	if (!kvm_mediated_pmu_enabled(vcpu))
+		return;
+
+	lockdep_assert_irqs_disabled();
+
+	kvm_pmu_call(put_guest_context)(vcpu);
+
+	perf_guest_exit();
+}
+
+void kvm_pmu_load_guest_context(struct kvm_vcpu *vcpu)
+{
+	u32 guest_lvtpc;
+
+	if (!kvm_mediated_pmu_enabled(vcpu))
+		return;
+
+	lockdep_assert_irqs_disabled();
+
+	guest_lvtpc = APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
+		(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC) & APIC_LVT_MASKED);
+	perf_guest_enter(guest_lvtpc);
+
+	kvm_pmu_call(load_guest_context)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 6289f523d893..d5da3a9a3bd5 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,8 @@ struct kvm_pmu_ops {
 	void (*reset)(struct kvm_vcpu *vcpu);
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
+	void (*put_guest_context)(struct kvm_vcpu *vcpu);
+	void (*load_guest_context)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -292,6 +294,11 @@ static inline bool kvm_host_has_perf_metrics(void)
 	return !!(kvm_host.perf_capabilities & PERF_CAP_PERF_METRICS);
 }
 
+static inline u32 pmc_msr_addr(struct kvm_pmu *pmu, u32 base, int idx)
+{
+	return base + idx * pmu->cntr_shift;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
@@ -306,6 +313,10 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
 bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
+void kvm_pmu_put_guest_pmcs(struct kvm_vcpu *vcpu);
+void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu);
+void kvm_pmu_put_guest_context(struct kvm_vcpu *vcpu);
+void kvm_pmu_load_guest_context(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 1a7e3a897fdf..7e0d84d50b74 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -175,6 +175,22 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return 1;
 }
 
+static inline void amd_update_msr_base(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	if (kvm_pmu_has_perf_global_ctrl(pmu) ||
+	    guest_cpu_cap_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
+		pmu->gp_eventsel_base = MSR_F15H_PERF_CTL0;
+		pmu->gp_counter_base = MSR_F15H_PERF_CTR0;
+		pmu->cntr_shift = 2;
+	} else {
+		pmu->gp_eventsel_base = MSR_K7_EVNTSEL0;
+		pmu->gp_counter_base = MSR_K7_PERFCTR0;
+		pmu->cntr_shift = 1;
+	}
+}
+
 static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -220,6 +236,8 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
 	pmu->nr_arch_fixed_counters = 0;
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+
+	amd_update_msr_base(vcpu);
 }
 
 static void amd_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
@@ -312,6 +330,40 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
 	}
 }
 
+
+static void amd_put_guest_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, pmu->global_status);
+
+	/* Clear global status bits if non-zero */
+	if (pmu->global_status)
+		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, pmu->global_status);
+
+	kvm_pmu_put_guest_pmcs(vcpu);
+}
+
+static void amd_load_guest_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 global_status;
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+
+	kvm_pmu_load_guest_pmcs(vcpu);
+
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);
+	/* Clear host global_status MSR if non-zero. */
+	if (global_status)
+		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -321,6 +373,8 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.set_msr = amd_pmu_set_msr,
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
+	.put_guest_context = amd_put_guest_context,
+	.load_guest_context = amd_load_guest_context,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_MAX_NR_AMD_GP_COUNTERS,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 796b7bc4affe..ed17ab198dfb 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -460,6 +460,17 @@ static void intel_pmu_enable_fixed_counter_bits(struct kvm_pmu *pmu, u64 bits)
 		pmu->fixed_ctr_ctrl_rsvd &= ~intel_fixed_bits_by_idx(i, bits);
 }
 
+static inline void intel_update_msr_base(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	pmu->gp_eventsel_base = MSR_P6_EVNTSEL0;
+	pmu->gp_counter_base = fw_writes_is_enabled(vcpu) ?
+			       MSR_IA32_PMC0 : MSR_IA32_PERFCTR0;
+	pmu->fixed_base = MSR_CORE_PERF_FIXED_CTR0;
+	pmu->cntr_shift = 1;
+}
+
 static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -567,6 +578,8 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
 			pmu->pebs_enable_rsvd = ~(BIT_ULL(pmu->nr_arch_gp_counters) - 1);
 		}
 	}
+
+	intel_update_msr_base(vcpu);
 }
 
 static void intel_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
@@ -809,6 +822,50 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
 	}
 }
 
+static void intel_put_guest_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	/* Global ctrl register is already saved at VM-exit. */
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
+
+	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
+	if (pmu->global_status)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
+
+	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
+
+	/*
+	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
+	 * also avoid these guest fixed counters get accidentially enabled
+	 * during host running when host enable global ctrl.
+	 */
+	if (pmu->fixed_ctr_ctrl_hw)
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+
+	kvm_pmu_put_guest_pmcs(vcpu);
+}
+
+static void intel_load_guest_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 global_status, toggle;
+
+	/* Clear host global_ctrl MSR if non-zero. */
+	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
+	toggle = pmu->global_status ^ global_status;
+	if (global_status & toggle)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
+	if (pmu->global_status & toggle)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
+
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
+
+	kvm_pmu_load_guest_pmcs(vcpu);
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -820,6 +877,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.reset = intel_pmu_reset,
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
+	.put_guest_context = intel_put_guest_context,
+	.load_guest_context = intel_load_guest_context,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_MAX_NR_INTEL_GP_COUNTERS,
 	.MIN_NR_GP_COUNTERS = 1,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 578e5f110b6c..d35afa8d9cbb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10998,6 +10998,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}
 
+	kvm_pmu_load_guest_context(vcpu);
+
 	guest_timing_enter_irqoff();
 
 	for (;;) {
@@ -11027,6 +11029,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		++vcpu->stat.exits;
 	}
 
+	kvm_pmu_put_guest_context(vcpu);
+
 	/*
 	 * Do this here before restoring debug registers on the host.  And
 	 * since we do this before handling the vmexit, a DR access vmexit
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (28 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-16  1:10   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 31/38] KVM: nVMX: Add macros to simplify nested MSR interception setting Mingwei Zhang
                   ` (10 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Mediated vPMU needs to accumulate the emulated instructions into counter
and load the counter into HW at vm-entry.

Moreover, if the accumulation leads to counter overflow, KVM needs to
update GLOBAL_STATUS and inject PMI into guest as well.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 68f203454bbc..f71009ec92cf 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -911,10 +911,50 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 	kvm_pmu_reset(vcpu);
 }
 
+static bool pmc_pmi_enabled(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+	u8 fixed_ctr_ctrl;
+	bool pmi_enabled;
+
+	if (pmc_is_gp(pmc)) {
+		pmi_enabled = pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT;
+	} else {
+		fixed_ctr_ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+						  pmc->idx - KVM_FIXED_PMC_BASE_IDX);
+		pmi_enabled = fixed_ctr_ctrl & INTEL_FIXED_0_ENABLE_PMI;
+	}
+
+	return pmi_enabled;
+}
+
 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
 {
-	pmc->emulated_counter++;
-	kvm_pmu_request_counter_reprogram(pmc);
+	struct kvm_vcpu *vcpu = pmc->vcpu;
+
+	/*
+	 * For perf-based PMUs, accumulate software-emulated events separately
+	 * from pmc->counter, as pmc->counter is offset by the count of the
+	 * associated perf event. Request reprogramming, which will consult
+	 * both emulated and hardware-generated events to detect overflow.
+	 */
+	if (!kvm_mediated_pmu_enabled(vcpu)) {
+		pmc->emulated_counter++;
+		kvm_pmu_request_counter_reprogram(pmc);
+		return;
+	}
+
+	/*
+	 * For mediated PMUs, pmc->counter is updated when the vCPU's PMU is
+	 * put, and will be loaded into hardware when the PMU is loaded. Simply
+	 * increment the counter and signal overflow if it wraps to zero.
+	 */
+	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
+	if (!pmc->counter) {
+		pmc_to_pmu(pmc)->global_status |= BIT_ULL(pmc->idx);
+		if (pmc_pmi_enabled(pmc))
+			kvm_make_request(KVM_REQ_PMI, vcpu);
+	}
 }
 
 static inline bool cpl_is_matched(struct kvm_pmc *pmc)
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 31/38] KVM: nVMX: Add macros to simplify nested MSR interception setting
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (29 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU Mingwei Zhang
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add macros nested_vmx_merge_msr_bitmaps_xxx() to simplify nested MSR
interception setting. No function change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/nested.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ecf72394684d..cf557acf91f8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -613,6 +613,19 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
 						   msr_bitmap_l0, msr);
 }
 
+#define nested_vmx_merge_msr_bitmaps(msr, type)	\
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,	\
+					 msr_bitmap_l0, msr, type)
+
+#define nested_vmx_merge_msr_bitmaps_read(msr) \
+	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_R)
+
+#define nested_vmx_merge_msr_bitmaps_write(msr) \
+	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_W)
+
+#define nested_vmx_merge_msr_bitmaps_rw(msr) \
+	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_RW)
+
 /*
  * Merge L0's and L1's MSR bitmap, return false to indicate that
  * we do not use the hardware.
@@ -696,23 +709,13 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	 * other runtime changes to vmcs01's bitmap, e.g. dynamic pass-through.
 	 */
 #ifdef CONFIG_X86_64
-	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
-					 MSR_FS_BASE, MSR_TYPE_RW);
-
-	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
-					 MSR_GS_BASE, MSR_TYPE_RW);
-
-	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
-					 MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+	nested_vmx_merge_msr_bitmaps_rw(MSR_FS_BASE);
+	nested_vmx_merge_msr_bitmaps_rw(MSR_GS_BASE);
+	nested_vmx_merge_msr_bitmaps_rw(MSR_KERNEL_GS_BASE);
 #endif
-	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
-					 MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
-
-	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
-					 MSR_IA32_PRED_CMD, MSR_TYPE_W);
-
-	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
-					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
+	nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_SPEC_CTRL);
+	nested_vmx_merge_msr_bitmaps_write(MSR_IA32_PRED_CMD);
+	nested_vmx_merge_msr_bitmaps_write(MSR_IA32_FLUSH_CMD);
 
 	kvm_vcpu_unmap(vcpu, &map);
 
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (30 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 31/38] KVM: nVMX: Add macros to simplify nested MSR interception setting Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-16 13:33   ` Sean Christopherson
  2025-03-24 17:31 ` [PATCH v4 33/38] perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU Mingwei Zhang
                   ` (8 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

Add nested virtualization support for mediated PMU by combining the MSR
interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
this patch, nested virtualization works for mediated PMU because L1 will
see Perfmon v2 and will have to use legacy vPMU implementation if it is
Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
even be Linux.

If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
the access.

However, in current implementation, if without adding anything for nested,
KVM always set MSR interception bits in vmcs02. This leads to the fact that
L0 will emulate all MSR read/writes for L2, leading to errors, since the
current mediated vPMU never implements set_msr() and get_msr() for any
counter access except counter accesses from the VMM side.

So fix the issue by setting up the correct MSR interception for PMU MSRs.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index cf557acf91f8..dbec40cb55bc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -626,6 +626,36 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
 #define nested_vmx_merge_msr_bitmaps_rw(msr) \
 	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_RW)
 
+/*
+ * Disable PMU MSRs interception for nested VM if L0 and L1 are
+ * both mediated vPMU.
+ */
+static void nested_vmx_merge_pmu_msr_bitmaps(struct kvm_vcpu *vcpu,
+					     unsigned long *msr_bitmap_l1,
+					     unsigned long *msr_bitmap_l0)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int i;
+
+	if (!kvm_mediated_pmu_enabled(vcpu))
+		return;
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		nested_vmx_merge_msr_bitmaps_rw(MSR_ARCH_PERFMON_EVENTSEL0 + i);
+		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PERFCTR0 + i);
+		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PMC0 + i);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+		nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR0 + i);
+
+	nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR_CTRL);
+	nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_GLOBAL_CTRL);
+	nested_vmx_merge_msr_bitmaps_read(MSR_CORE_PERF_GLOBAL_STATUS);
+	nested_vmx_merge_msr_bitmaps_write(MSR_CORE_PERF_GLOBAL_OVF_CTRL);
+}
+
 /*
  * Merge L0's and L1's MSR bitmap, return false to indicate that
  * we do not use the hardware.
@@ -717,6 +747,8 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	nested_vmx_merge_msr_bitmaps_write(MSR_IA32_PRED_CMD);
 	nested_vmx_merge_msr_bitmaps_write(MSR_IA32_FLUSH_CMD);
 
+	nested_vmx_merge_pmu_msr_bitmaps(vcpu, msr_bitmap_l1, msr_bitmap_l0);
+
 	kvm_vcpu_unmap(vcpu, &map);
 
 	vmx->nested.force_msr_bitmap_recalc = false;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 33/38] perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (31 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 34/38] perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host Mingwei Zhang
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Kan Liang <kan.liang@linux.intel.com>

Apply the PERF_PMU_CAP_MEDIATED_VPMU for Intel core PMU. It only
indicates that the perf side of core PMU is ready to support the
passthrough vPMU.  Besides the capability, the hypervisor should still need
to check the PMU version and other capabilities to decide whether to enable
the mediated vPMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/intel/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e86333eee266..ab74fdfa6a66 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4943,6 +4943,8 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
 	else
 		pmu->intel_ctrl &= ~(1ULL << GLOBAL_CTRL_EN_PERF_METRICS);
 
+	pmu->pmu.capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
+
 	intel_pmu_check_event_constraints(pmu->event_constraints,
 					  pmu->cntr_mask64,
 					  pmu->fixed_cntr_mask64,
@@ -6535,6 +6537,9 @@ __init int intel_pmu_init(void)
 			pr_cont(" AnyThread deprecated, ");
 	}
 
+	/* The perf side of core PMU is ready to support the mediated vPMU. */
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
+
 	/*
 	 * Install the hw-cache-events table:
 	 */
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 34/38] perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (32 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 33/38] perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-05-21 20:00   ` Namhyung Kim
  2025-03-24 17:31 ` [PATCH v4 35/38] KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space Mingwei Zhang
                   ` (6 subsequent siblings)
  40 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Sandipan Das <sandipan.das@amd.com>

Apply the PERF_PMU_CAP_MEDIATED_VPMU flag for version 2 and later
implementations of the core PMU. Aside from having Global Control and
Status registers, virtualizing the PMU using the passthrough model
requires an interface to set or clear the overflow bits in the Global
Status MSRs while restoring or saving the PMU context of a vCPU.

PerfMonV2-capable hardware has additional MSRs for this purpose namely,
PerfCntrGlobalStatusSet and PerfCntrGlobalStatusClr, thereby making it
suitable for use with mediated vPMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/amd/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 30d6ceb4c8ad..a8b537dd2ddb 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1433,6 +1433,8 @@ static int __init amd_core_pmu_init(void)
 
 		amd_pmu_global_cntr_mask = x86_pmu.cntr_mask64;
 
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
+
 		/* Update PMC handling functions */
 		x86_pmu.enable_all = amd_pmu_v2_enable_all;
 		x86_pmu.disable_all = amd_pmu_v2_disable_all;
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 35/38] KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (33 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 34/38] perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 36/38] KVM: selftests: Add mediated vPMU supported for pmu tests Mingwei Zhang
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Expose enable_mediated_pmu parameter to user space, then users can
enable/disable mediated vPMU on demand.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/svm.c | 2 ++
 arch/x86/kvm/vmx/vmx.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index bff351992468..a7ccac624dd3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -265,6 +265,8 @@ module_param(intercept_smi, bool, 0444);
 bool vnmi = true;
 module_param(vnmi, bool, 0444);
 
+module_param(enable_mediated_pmu, bool, 0444);
+
 static bool svm_gp_erratum_intercept = true;
 
 static u8 rsm_ins_bytes[] = "\x0f\xaa";
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7bb16bed08da..af9e7b917335 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -147,6 +147,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 extern bool __read_mostly allow_smaller_maxphyaddr;
 module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
 
+module_param(enable_mediated_pmu, bool, 0444);
+
 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
 #define KVM_VM_CR0_ALWAYS_ON				\
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 36/38] KVM: selftests: Add mediated vPMU supported for pmu tests
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (34 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 35/38] KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 37/38] KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test Mingwei Zhang
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Mediated vPMU needs to call KVM_CAP_PMU_CAPABILITY ioctl to enable it.
Thus add a helper vm_create_with_one_vcpu_with_pmu() to create PMU
enabled VM and replace vm_create_with_one_vcpu() helper with this new
helper in pmu tests.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  |  3 +++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 23 +++++++++++++++++++
 .../selftests/kvm/x86/pmu_counters_test.c     |  4 +++-
 .../selftests/kvm/x86/pmu_event_filter_test.c |  8 ++++---
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 4c4e5a847f67..a73b0b98be5e 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -961,6 +961,9 @@ static inline struct kvm_vm *vm_create_shape_with_one_vcpu(struct vm_shape shape
 	return __vm_create_shape_with_one_vcpu(shape, vcpu, 0, guest_code);
 }
 
+struct kvm_vm *vm_create_with_one_vcpu_with_pmu(struct kvm_vcpu **vcpu,
+						void *guest_code);
+
 struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm);
 
 void kvm_pin_this_task_to_pcpu(uint32_t pcpu);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 33fefeb3ca44..18143ec2e751 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -545,6 +545,29 @@ struct kvm_vcpu *vm_recreate_with_one_vcpu(struct kvm_vm *vm)
 	return vm_vcpu_recreate(vm, 0);
 }
 
+struct kvm_vm *vm_create_with_one_vcpu_with_pmu(struct kvm_vcpu **vcpu,
+						void *guest_code)
+{
+	struct kvm_vm *vm;
+	int r;
+
+	r = kvm_check_cap(KVM_CAP_PMU_CAPABILITY);
+	if (!(r & KVM_PMU_CAP_DISABLE))
+		return NULL;
+
+	vm = vm_create(1);
+
+	/*
+	 * KVM_CAP_PMU_CAPABILITY ioctl must be explicitly called to enable
+	 * mediated vPMU.
+	 */
+	vm_enable_cap(vm, KVM_CAP_PMU_CAPABILITY, !KVM_PMU_CAP_DISABLE);
+
+	*vcpu = vm_vcpu_add(vm, 0, guest_code);
+
+	return vm;
+}
+
 void kvm_pin_this_task_to_pcpu(uint32_t pcpu)
 {
 	cpu_set_t mask;
diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
index 698cb36989db..441c66f314fb 100644
--- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
+++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
@@ -40,7 +40,9 @@ static struct kvm_vm *pmu_vm_create_with_one_vcpu(struct kvm_vcpu **vcpu,
 {
 	struct kvm_vm *vm;
 
-	vm = vm_create_with_one_vcpu(vcpu, guest_code);
+	vm = vm_create_with_one_vcpu_with_pmu(vcpu, guest_code);
+	assert(vm);
+
 	sync_global_to_guest(vm, kvm_pmu_version);
 
 	/*
diff --git a/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c b/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c
index c15513cd74d1..1c7d265a0003 100644
--- a/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c
+++ b/tools/testing/selftests/kvm/x86/pmu_event_filter_test.c
@@ -822,8 +822,9 @@ static void test_fixed_counter_bitmap(void)
 	 * fixed performance counters.
 	 */
 	for (idx = 0; idx < nr_fixed_counters; idx++) {
-		vm = vm_create_with_one_vcpu(&vcpu,
-					     intel_run_fixed_counter_guest_code);
+		vm = vm_create_with_one_vcpu_with_pmu(&vcpu,
+						      intel_run_fixed_counter_guest_code);
+		assert(vm);
 		vcpu_args_set(vcpu, 1, idx);
 		__test_fixed_counter_bitmap(vcpu, idx, nr_fixed_counters);
 		kvm_vm_free(vm);
@@ -843,7 +844,8 @@ int main(int argc, char *argv[])
 	TEST_REQUIRE(use_intel_pmu() || use_amd_pmu());
 	guest_code = use_intel_pmu() ? intel_guest_code : amd_guest_code;
 
-	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+	vm = vm_create_with_one_vcpu_with_pmu(&vcpu, guest_code);
+	assert(vm);
 
 	TEST_REQUIRE(sanity_check_pmu(vcpu));
 
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 37/38] KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (35 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 36/38] KVM: selftests: Add mediated vPMU supported for pmu tests Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-03-24 17:31 ` [PATCH v4 38/38] KVM: Selftests: Fix pmu_counters_test error for mediated vPMU Mingwei Zhang
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Define KVM_ONE_VCPU_PMU_TEST_SUITE macro which calls
vm_create_with_one_vcpu_with_pmu() to create mediated vPMU enabled VM.

Then vmx_pmu_caps_test can supported mediated vPMU's validation.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 .../selftests/kvm/include/kvm_test_harness.h        | 13 +++++++++++++
 tools/testing/selftests/kvm/x86/vmx_pmu_caps_test.c |  2 +-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_test_harness.h b/tools/testing/selftests/kvm/include/kvm_test_harness.h
index 8f7c6858e8e2..4efde79708ce 100644
--- a/tools/testing/selftests/kvm/include/kvm_test_harness.h
+++ b/tools/testing/selftests/kvm/include/kvm_test_harness.h
@@ -23,6 +23,19 @@
 		kvm_vm_free(self->vcpu->vm);				\
 	}
 
+#define KVM_ONE_VCPU_PMU_TEST_SUITE(name)					\
+	FIXTURE(name) {								\
+		struct kvm_vcpu *vcpu;						\
+	};									\
+										\
+	FIXTURE_SETUP(name) {							\
+		(void)vm_create_with_one_vcpu_with_pmu(&self->vcpu, NULL);	\
+	}									\
+										\
+	FIXTURE_TEARDOWN(name) {						\
+		kvm_vm_free(self->vcpu->vm);					\
+	}
+
 #define KVM_ONE_VCPU_TEST(suite, test, guestcode)			\
 static void __suite##_##test(struct kvm_vcpu *vcpu);			\
 									\
diff --git a/tools/testing/selftests/kvm/x86/vmx_pmu_caps_test.c b/tools/testing/selftests/kvm/x86/vmx_pmu_caps_test.c
index a1f5ff45d518..d23610131acb 100644
--- a/tools/testing/selftests/kvm/x86/vmx_pmu_caps_test.c
+++ b/tools/testing/selftests/kvm/x86/vmx_pmu_caps_test.c
@@ -73,7 +73,7 @@ static void guest_code(uint64_t current_val)
 	GUEST_DONE();
 }
 
-KVM_ONE_VCPU_TEST_SUITE(vmx_pmu_caps);
+KVM_ONE_VCPU_PMU_TEST_SUITE(vmx_pmu_caps);
 
 /*
  * Verify that guest WRMSRs to PERF_CAPABILITIES #GP regardless of the value
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* [PATCH v4 38/38] KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (36 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 37/38] KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test Mingwei Zhang
@ 2025-03-24 17:31 ` Mingwei Zhang
  2025-04-16  7:22 ` [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mi, Dapeng
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-24 17:31 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Mingwei Zhang, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Das Sandipan, Shukla Manali, Nikunj Dadhania

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

As previous patch commit 'f8905c638eb7 ("KVM: x86/pmu: Check PMU cpuid
configuration from user space")', KVM would check if user space configured
pmu version is larger than KVM supported maximum pmu version for
mediated vPMU, or if fixed counter bitmap is configured incorrectly,
if so, KVM would return an error.

This enhanced check would lead to pmu_counters_test fails, thus limit
pmu_counters_test only validate KVM supported pmu versions for mediated
vPMU and only validate 0 fixed counter bitmap if pmu version is less than
5.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 .../selftests/kvm/include/x86/processor.h     |  8 ++++++++
 .../selftests/kvm/x86/pmu_counters_test.c     | 20 ++++++++++++++++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index d60da8966772..7db34f48427a 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -1311,6 +1311,14 @@ static inline bool kvm_is_pmu_enabled(void)
 	return get_kvm_param_bool("enable_pmu");
 }
 
+static inline bool kvm_is_mediated_pmu_enabled(void)
+{
+	if (host_cpu_is_intel)
+		return get_kvm_intel_param_bool("enable_mediated_pmu");
+	else
+		return get_kvm_amd_param_bool("enable_mediated_pmu");
+}
+
 static inline bool kvm_is_forced_emulation_enabled(void)
 {
 	return !!get_kvm_param_integer("force_emulation_prefix");
diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
index 441c66f314fb..4745f82ce860 100644
--- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
+++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
@@ -564,8 +564,14 @@ static void test_intel_counters(void)
 	 * Test up to PMU v5, which is the current maximum version defined by
 	 * Intel, i.e. is the last version that is guaranteed to be backwards
 	 * compatible with KVM's existing behavior.
+	 *
+	 * Whereas for mediated vPMU, limit max_pmu_version to KVM supported
+	 * maximum pmu version since KVM rejects PMU versions larger than KVM
+	 * supported maximum PMU version to avoid guest to manipulate unsupported
+	 * or unallowed PMU MSRs directly.
 	 */
-	uint8_t max_pmu_version = max_t(typeof(pmu_version), pmu_version, 5);
+	uint8_t max_pmu_version = kvm_is_mediated_pmu_enabled() ?
+				  pmu_version : max_t(typeof(pmu_version), pmu_version, 5);
 
 	/*
 	 * Detect the existence of events that aren't supported by selftests.
@@ -622,8 +628,16 @@ static void test_intel_counters(void)
 			pr_info("Testing fixed counters, PMU version %u, perf_caps = %lx\n",
 				v, perf_caps[i]);
 			for (j = 0; j <= nr_fixed_counters; j++) {
-				for (k = 0; k <= (BIT(nr_fixed_counters) - 1); k++)
-					test_fixed_counters(v, perf_caps[i], j, k);
+				/*
+				 * pmu version less than 5 doesn't support fixed counter
+				 * bitmap, so only set fixed counter bitamp to 0.
+				 */
+				if (v < 5) {
+					test_fixed_counters(v, perf_caps[i], j, 0);
+				} else {
+					for (k = 0; k <= (BIT(nr_fixed_counters) - 1); k++)
+						test_fixed_counters(v, perf_caps[i], j, k);
+				}
 			}
 		}
 	}
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl
  2025-03-24 17:31 ` [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl Mingwei Zhang
@ 2025-03-26 16:51   ` Chen, Zide
  2025-03-26 20:09     ` Mingwei Zhang
  0 siblings, 1 reply; 127+ messages in thread
From: Chen, Zide @ 2025-03-26 16:51 UTC (permalink / raw)
  To: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Sean Christopherson,
	Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Sandipan Das, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania



On 3/24/2025 10:31 AM, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Intel processor (vmx) provides capability to save/load guest
> IA32_PERF_GLOBAL_CTRL at vm-exit/vm-entry by setting
> VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL bit in VM-exit-ctrl or
> VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL bit in VM-entry-ctrl.
> 
> Mediated vPMU leverages both capabilities to save/load guest
> IA32_PERF_GLOBAL_CTRL automatically at vm-exit/vm-entry. Note that the
> former was introduced in SapphireRapids and later Intel CPUs.
> 
> If VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL is unavailable, mediated PMU will be
> disabled. Note that mediated PMU can be enabled by falling back to atomic
> msr save/retore list. However, that would cause extra overhead per
> VM-enter/exit.
> 
> Since these VMX capability bits perform automatic saving/restoring of the
> PMU global ctrl between VMCS and the HW MSR. No synchronization was
> performed betwen HW MSR and pmu->global_ctrli, the KVM cached value .
> Therefore, whenever KVM needs to use this variable, it will need to
> explicitly read the value from MSR to pmu->global_ctrl. This is especially
> so when guest doesn't own all PMU counters, i.e., when
> IA32_PERF_GLOBAL_CTRL is interceped by mediated PMU.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 ++++
>  arch/x86/include/asm/vmx.h      |  1 +
>  arch/x86/kvm/pmu.c              | 30 ++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/capabilities.h |  5 +++++
>  arch/x86/kvm/vmx/nested.c       |  3 ++-
>  arch/x86/kvm/vmx/pmu_intel.c    | 39 ++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/vmx.c          | 22 ++++++++++++++++++-
>  arch/x86/kvm/vmx/vmx.h          |  3 ++-
>  8 files changed, 102 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0b7af5902ff7..4b3bfefc2d05 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -553,6 +553,10 @@ struct kvm_pmu {
>  	unsigned available_event_types;
>  	u64 fixed_ctr_ctrl;
>  	u64 fixed_ctr_ctrl_rsvd;
> +	/*
> +	 * kvm_pmu_sync_global_ctrl_from_vmcs() must be called to update
> +	 * this SW-maintained global_ctrl for mediated vPMU before accessing it.
> +	 */
>  	u64 global_ctrl;
>  	u64 global_status;
>  	u64 counter_bitmask[2];
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index f7fd4369b821..48e137560f17 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -106,6 +106,7 @@
>  #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
>  #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
>  #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
> +#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL	0x40000000
>  
>  #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
>  
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 6ad71752be4b..4e8cefcce7ab 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -646,6 +646,30 @@ void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static void kvm_pmu_sync_global_ctrl_from_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct msr_data msr_info = { .index = MSR_CORE_PERF_GLOBAL_CTRL };
> +
> +	if (!kvm_mediated_pmu_enabled(vcpu))
> +		return;
> +
> +	/* Sync pmu->global_ctrl from GUEST_IA32_PERF_GLOBAL_CTRL. */
> +	kvm_pmu_call(get_msr)(vcpu, &msr_info);
> +}
> +
> +static void kvm_pmu_sync_global_ctrl_to_vmcs(struct kvm_vcpu *vcpu, u64 global_ctrl)
> +{
> +	struct msr_data msr_info = {
> +		.index = MSR_CORE_PERF_GLOBAL_CTRL,
> +		.data = global_ctrl };
> +
> +	if (!kvm_mediated_pmu_enabled(vcpu))
> +		return;
> +
> +	/* Sync pmu->global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
> +	kvm_pmu_call(set_msr)(vcpu, &msr_info);
> +}
> +
>  bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
>  {
>  	switch (msr) {
> @@ -680,7 +704,6 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		msr_info->data = pmu->global_status;
>  		break;
>  	case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
> -	case MSR_CORE_PERF_GLOBAL_CTRL:
>  		msr_info->data = pmu->global_ctrl;
>  		break;
>  	case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
> @@ -731,6 +754,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)


pmu->global_ctrl doesn't always have the up-to-date guest value, need to
sync from vmcs/vmbc before comparing it against 'data'.

+               kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
                if (pmu->global_ctrl != data) {

>  			diff = pmu->global_ctrl ^ data;
>  			pmu->global_ctrl = data;
>  			reprogram_counters(pmu, diff);
> +
> +			/* Propagate guest global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
> +			kvm_pmu_sync_global_ctrl_to_vmcs(vcpu, data);
>  		}
>  		break;
>  	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> @@ -907,6 +933,8 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>  
>  	BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX);
>  
> +	kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
> +
>  	if (!kvm_pmu_has_perf_global_ctrl(pmu))
>  		bitmap_copy(bitmap, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>  	else if (!bitmap_and(bitmap, pmu->all_valid_pmc_idx,
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 013536fde10b..cc63bd4ab87c 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -101,6 +101,11 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
>  	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
>  }
>  
> +static inline bool cpu_has_save_perf_global_ctrl(void)
> +{
> +	return vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> +}
> +
>  static inline bool cpu_has_vmx_mpx(void)
>  {
>  	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 8a7af02d466e..ecf72394684d 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -7004,7 +7004,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>  		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>  		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>  		VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
> -		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> +		VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
>  
>  	/* We support free control of debug control saving. */
>  	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 2a5f79206b02..04a893e56135 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -294,6 +294,11 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	u32 msr = msr_info->index;
>  
>  	switch (msr) {
> +	case MSR_CORE_PERF_GLOBAL_CTRL:
> +		if (kvm_mediated_pmu_enabled(vcpu))
> +			pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> +		msr_info->data = pmu->global_ctrl;
> +		break;
>  	case MSR_CORE_PERF_FIXED_CTR_CTRL:
>  		msr_info->data = pmu->fixed_ctr_ctrl;
>  		break;
> @@ -339,6 +344,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	u64 reserved_bits, diff;
>  
>  	switch (msr) {
> +	case MSR_CORE_PERF_GLOBAL_CTRL:
> +		if (kvm_mediated_pmu_enabled(vcpu))
> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> +				     pmu->global_ctrl);
> +		break;
>  	case MSR_CORE_PERF_FIXED_CTR_CTRL:
>  		if (data & pmu->fixed_ctr_ctrl_rsvd)
>  			return 1;
> @@ -558,10 +568,37 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  
>  static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	bool mediated;
> +
>  	__intel_pmu_refresh(vcpu);
>  
> -	exec_controls_changebit(to_vmx(vcpu), CPU_BASED_RDPMC_EXITING,
> +	exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
>  				!kvm_rdpmc_in_guest(vcpu));
> +
> +	mediated = kvm_mediated_pmu_enabled(vcpu);
> +	if (cpu_has_load_perf_global_ctrl()) {
> +		vm_entry_controls_changebit(vmx,
> +			VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, mediated);
> +		/*
> +		 * Initialize guest PERF_GLOBAL_CTRL to reset value as SDM rules.
> +		 *
> +		 * Note: GUEST_IA32_PERF_GLOBAL_CTRL must be initialized to
> +		 * "BIT_ULL(pmu->nr_arch_gp_counters) - 1" instead of pmu->global_ctrl
> +		 * since pmu->global_ctrl is only be initialized when guest
> +		 * pmu->version > 1. Otherwise if pmu->version is 1, pmu->global_ctrl
> +		 * is 0 and guest counters are never really enabled.
> +		 */
> +		if (mediated)
> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> +				     BIT_ULL(pmu->nr_arch_gp_counters) - 1);
> +	}
> +
> +	if (cpu_has_save_perf_global_ctrl())
> +		vm_exit_controls_changebit(vmx,
> +			VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> +			VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, mediated);
>  }
>  
>  static void intel_pmu_init(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ff66f17d6358..38ecf3c116bd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4390,6 +4390,13 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
>  
>  	if (cpu_has_load_ia32_efer())
>  		vmcs_write64(HOST_IA32_EFER, kvm_host.efer);
> +
> +	/*
> +	 * Initialize host PERF_GLOBAL_CTRL to 0 to disable all counters
> +	 * immediately once VM exits. Mediated vPMU then call perf_guest_exit()
> +	 * to re-enable host perf events.
> +	 */
> +	vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
>  }
>  
>  void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
> @@ -4457,7 +4464,8 @@ static u32 vmx_get_initial_vmexit_ctrl(void)
>  				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
>  	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
>  	return vmexit_ctrl &
> -		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
> +		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER |
> +		  VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL);
>  }
>  
>  void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> @@ -7196,6 +7204,9 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>  	struct perf_guest_switch_msr *msrs;
>  	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>  
> +	if (kvm_mediated_pmu_enabled(&vmx->vcpu))
> +		return;
> +
>  	pmu->host_cross_mapped_mask = 0;
>  	if (pmu->pebs_enable & pmu->global_ctrl)
>  		intel_pmu_cross_mapped_check(pmu);
> @@ -8451,6 +8462,15 @@ __init int vmx_hardware_setup(void)
>  		enable_sgx = false;
>  #endif
>  
> +	/*
> +	 * All CPUs that support a mediated PMU are expected to support loading
> +	 * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
> +	 */
> +	if (enable_mediated_pmu &&
> +	    (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
> +			  !cpu_has_save_perf_global_ctrl())))
> +		enable_mediated_pmu = false;
> +
>  	/*
>  	 * set_apic_access_page_addr() is used to reload apic access
>  	 * page upon invalidation.  No need to do anything if not
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 5c505af553c8..b282165f98a6 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -510,7 +510,8 @@ static inline u8 vmx_get_rvi(void)
>  	       VM_EXIT_LOAD_IA32_EFER |					\
>  	       VM_EXIT_CLEAR_BNDCFGS |					\
>  	       VM_EXIT_PT_CONCEAL_PIP |					\
> -	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
> +	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
> +	       VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
>  
>  #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
>  	(PIN_BASED_EXT_INTR_MASK |					\


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl
  2025-03-26 16:51   ` Chen, Zide
@ 2025-03-26 20:09     ` Mingwei Zhang
  2025-05-15  0:33       ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Mingwei Zhang @ 2025-03-26 20:09 UTC (permalink / raw)
  To: Chen, Zide
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Wed, Mar 26, 2025 at 9:51 AM Chen, Zide <zide.chen@intel.com> wrote:
>
>
>
> On 3/24/2025 10:31 AM, Mingwei Zhang wrote:
> > From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >
> > Intel processor (vmx) provides capability to save/load guest
> > IA32_PERF_GLOBAL_CTRL at vm-exit/vm-entry by setting
> > VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL bit in VM-exit-ctrl or
> > VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL bit in VM-entry-ctrl.
> >
> > Mediated vPMU leverages both capabilities to save/load guest
> > IA32_PERF_GLOBAL_CTRL automatically at vm-exit/vm-entry. Note that the
> > former was introduced in SapphireRapids and later Intel CPUs.
> >
> > If VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL is unavailable, mediated PMU will be
> > disabled. Note that mediated PMU can be enabled by falling back to atomic
> > msr save/retore list. However, that would cause extra overhead per
> > VM-enter/exit.
> >
> > Since these VMX capability bits perform automatic saving/restoring of the
> > PMU global ctrl between VMCS and the HW MSR. No synchronization was
> > performed betwen HW MSR and pmu->global_ctrli, the KVM cached value .
> > Therefore, whenever KVM needs to use this variable, it will need to
> > explicitly read the value from MSR to pmu->global_ctrl. This is especially
> > so when guest doesn't own all PMU counters, i.e., when
> > IA32_PERF_GLOBAL_CTRL is interceped by mediated PMU.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > Co-developed-by: Mingwei Zhang <mizhang@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  4 ++++
> >  arch/x86/include/asm/vmx.h      |  1 +
> >  arch/x86/kvm/pmu.c              | 30 ++++++++++++++++++++++++-
> >  arch/x86/kvm/vmx/capabilities.h |  5 +++++
> >  arch/x86/kvm/vmx/nested.c       |  3 ++-
> >  arch/x86/kvm/vmx/pmu_intel.c    | 39 ++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/vmx/vmx.c          | 22 ++++++++++++++++++-
> >  arch/x86/kvm/vmx/vmx.h          |  3 ++-
> >  8 files changed, 102 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 0b7af5902ff7..4b3bfefc2d05 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -553,6 +553,10 @@ struct kvm_pmu {
> >       unsigned available_event_types;
> >       u64 fixed_ctr_ctrl;
> >       u64 fixed_ctr_ctrl_rsvd;
> > +     /*
> > +      * kvm_pmu_sync_global_ctrl_from_vmcs() must be called to update
> > +      * this SW-maintained global_ctrl for mediated vPMU before accessing it.
> > +      */
> >       u64 global_ctrl;
> >       u64 global_status;
> >       u64 counter_bitmask[2];
> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > index f7fd4369b821..48e137560f17 100644
> > --- a/arch/x86/include/asm/vmx.h
> > +++ b/arch/x86/include/asm/vmx.h
> > @@ -106,6 +106,7 @@
> >  #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
> >  #define VM_EXIT_PT_CONCEAL_PIP                       0x01000000
> >  #define VM_EXIT_CLEAR_IA32_RTIT_CTL          0x02000000
> > +#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL   0x40000000
> >
> >  #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR    0x00036dff
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 6ad71752be4b..4e8cefcce7ab 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -646,6 +646,30 @@ void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
> >       }
> >  }
> >
> > +static void kvm_pmu_sync_global_ctrl_from_vmcs(struct kvm_vcpu *vcpu)
> > +{
> > +     struct msr_data msr_info = { .index = MSR_CORE_PERF_GLOBAL_CTRL };
> > +
> > +     if (!kvm_mediated_pmu_enabled(vcpu))
> > +             return;
> > +
> > +     /* Sync pmu->global_ctrl from GUEST_IA32_PERF_GLOBAL_CTRL. */
> > +     kvm_pmu_call(get_msr)(vcpu, &msr_info);
> > +}
> > +
> > +static void kvm_pmu_sync_global_ctrl_to_vmcs(struct kvm_vcpu *vcpu, u64 global_ctrl)
> > +{
> > +     struct msr_data msr_info = {
> > +             .index = MSR_CORE_PERF_GLOBAL_CTRL,
> > +             .data = global_ctrl };
> > +
> > +     if (!kvm_mediated_pmu_enabled(vcpu))
> > +             return;
> > +
> > +     /* Sync pmu->global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
> > +     kvm_pmu_call(set_msr)(vcpu, &msr_info);
> > +}
> > +
> >  bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
> >  {
> >       switch (msr) {
> > @@ -680,7 +704,6 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >               msr_info->data = pmu->global_status;
> >               break;
> >       case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
> > -     case MSR_CORE_PERF_GLOBAL_CTRL:
> >               msr_info->data = pmu->global_ctrl;
> >               break;
> >       case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
> > @@ -731,6 +754,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>
>
> pmu->global_ctrl doesn't always have the up-to-date guest value, need to
> sync from vmcs/vmbc before comparing it against 'data'.
>
> +               kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
>                 if (pmu->global_ctrl != data) {

Good catch. Thanks!

This is why I really prefer just unconditionally syncing the global
ctrl from VMCS to pmu->global_ctrl and vice versa.

We might get into similar problems as well in the future.

>
> >                       diff = pmu->global_ctrl ^ data;
> >                       pmu->global_ctrl = data;
> >                       reprogram_counters(pmu, diff);
> > +
> > +                     /* Propagate guest global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
> > +                     kvm_pmu_sync_global_ctrl_to_vmcs(vcpu, data);
> >               }
> >               break;
> >       case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> > @@ -907,6 +933,8 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
> >
> >       BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX);
> >
> > +     kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
> > +
> >       if (!kvm_pmu_has_perf_global_ctrl(pmu))
> >               bitmap_copy(bitmap, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
> >       else if (!bitmap_and(bitmap, pmu->all_valid_pmc_idx,
> > diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> > index 013536fde10b..cc63bd4ab87c 100644
> > --- a/arch/x86/kvm/vmx/capabilities.h
> > +++ b/arch/x86/kvm/vmx/capabilities.h
> > @@ -101,6 +101,11 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
> >       return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> >  }
> >
> > +static inline bool cpu_has_save_perf_global_ctrl(void)
> > +{
> > +     return vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> > +}
> > +
> >  static inline bool cpu_has_vmx_mpx(void)
> >  {
> >       return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 8a7af02d466e..ecf72394684d 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -7004,7 +7004,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> >               VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> >               VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> >               VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
> > -             VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> > +             VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> > +             VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> >
> >       /* We support free control of debug control saving. */
> >       msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index 2a5f79206b02..04a893e56135 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -294,6 +294,11 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >       u32 msr = msr_info->index;
> >
> >       switch (msr) {
> > +     case MSR_CORE_PERF_GLOBAL_CTRL:
> > +             if (kvm_mediated_pmu_enabled(vcpu))
> > +                     pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> > +             msr_info->data = pmu->global_ctrl;
> > +             break;
> >       case MSR_CORE_PERF_FIXED_CTR_CTRL:
> >               msr_info->data = pmu->fixed_ctr_ctrl;
> >               break;
> > @@ -339,6 +344,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >       u64 reserved_bits, diff;
> >
> >       switch (msr) {
> > +     case MSR_CORE_PERF_GLOBAL_CTRL:
> > +             if (kvm_mediated_pmu_enabled(vcpu))
> > +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> > +                                  pmu->global_ctrl);
> > +             break;
> >       case MSR_CORE_PERF_FIXED_CTR_CTRL:
> >               if (data & pmu->fixed_ctr_ctrl_rsvd)
> >                       return 1;
> > @@ -558,10 +568,37 @@ static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
> >
> >  static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> >  {
> > +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +     struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +     bool mediated;
> > +
> >       __intel_pmu_refresh(vcpu);
> >
> > -     exec_controls_changebit(to_vmx(vcpu), CPU_BASED_RDPMC_EXITING,
> > +     exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
> >                               !kvm_rdpmc_in_guest(vcpu));
> > +
> > +     mediated = kvm_mediated_pmu_enabled(vcpu);
> > +     if (cpu_has_load_perf_global_ctrl()) {
> > +             vm_entry_controls_changebit(vmx,
> > +                     VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, mediated);
> > +             /*
> > +              * Initialize guest PERF_GLOBAL_CTRL to reset value as SDM rules.
> > +              *
> > +              * Note: GUEST_IA32_PERF_GLOBAL_CTRL must be initialized to
> > +              * "BIT_ULL(pmu->nr_arch_gp_counters) - 1" instead of pmu->global_ctrl
> > +              * since pmu->global_ctrl is only be initialized when guest
> > +              * pmu->version > 1. Otherwise if pmu->version is 1, pmu->global_ctrl
> > +              * is 0 and guest counters are never really enabled.
> > +              */
> > +             if (mediated)
> > +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> > +                                  BIT_ULL(pmu->nr_arch_gp_counters) - 1);
> > +     }
> > +
> > +     if (cpu_has_save_perf_global_ctrl())
> > +             vm_exit_controls_changebit(vmx,
> > +                     VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> > +                     VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, mediated);
> >  }
> >
> >  static void intel_pmu_init(struct kvm_vcpu *vcpu)
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index ff66f17d6358..38ecf3c116bd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -4390,6 +4390,13 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
> >
> >       if (cpu_has_load_ia32_efer())
> >               vmcs_write64(HOST_IA32_EFER, kvm_host.efer);
> > +
> > +     /*
> > +      * Initialize host PERF_GLOBAL_CTRL to 0 to disable all counters
> > +      * immediately once VM exits. Mediated vPMU then call perf_guest_exit()
> > +      * to re-enable host perf events.
> > +      */
> > +     vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
> >  }
> >
> >  void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
> > @@ -4457,7 +4464,8 @@ static u32 vmx_get_initial_vmexit_ctrl(void)
> >                                VM_EXIT_CLEAR_IA32_RTIT_CTL);
> >       /* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
> >       return vmexit_ctrl &
> > -             ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
> > +             ~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER |
> > +               VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL);
> >  }
> >
> >  void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> > @@ -7196,6 +7204,9 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> >       struct perf_guest_switch_msr *msrs;
> >       struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >
> > +     if (kvm_mediated_pmu_enabled(&vmx->vcpu))
> > +             return;
> > +
> >       pmu->host_cross_mapped_mask = 0;
> >       if (pmu->pebs_enable & pmu->global_ctrl)
> >               intel_pmu_cross_mapped_check(pmu);
> > @@ -8451,6 +8462,15 @@ __init int vmx_hardware_setup(void)
> >               enable_sgx = false;
> >  #endif
> >
> > +     /*
> > +      * All CPUs that support a mediated PMU are expected to support loading
> > +      * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
> > +      */
> > +     if (enable_mediated_pmu &&
> > +         (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
> > +                       !cpu_has_save_perf_global_ctrl())))
> > +             enable_mediated_pmu = false;
> > +
> >       /*
> >        * set_apic_access_page_addr() is used to reload apic access
> >        * page upon invalidation.  No need to do anything if not
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index 5c505af553c8..b282165f98a6 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -510,7 +510,8 @@ static inline u8 vmx_get_rvi(void)
> >              VM_EXIT_LOAD_IA32_EFER |                                 \
> >              VM_EXIT_CLEAR_BNDCFGS |                                  \
> >              VM_EXIT_PT_CONCEAL_PIP |                                 \
> > -            VM_EXIT_CLEAR_IA32_RTIT_CTL)
> > +            VM_EXIT_CLEAR_IA32_RTIT_CTL |                            \
> > +            VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
> >
> >  #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL                   \
> >       (PIN_BASED_EXT_INTR_MASK |                                      \
>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (37 preceding siblings ...)
  2025-03-24 17:31 ` [PATCH v4 38/38] KVM: Selftests: Fix pmu_counters_test error for mediated vPMU Mingwei Zhang
@ 2025-04-16  7:22 ` Mi, Dapeng
  2025-04-25 12:27   ` Peter Zijlstra
  2025-05-06  9:57 ` Mi, Dapeng
  2025-05-15  0:49 ` Sean Christopherson
  40 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-04-16  7:22 UTC (permalink / raw)
  To: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Sean Christopherson,
	Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

Kindly ping... Any comments on this patch series? Thanks.

On 3/25/2025 1:30 AM, Mingwei Zhang wrote:
> With joint effort from the upstream KVM community, we come up with the
> 4th version of mediated vPMU for x86. We have made the following changes
> on top of the previous RFC v3.
>
> v3 -> v4
>  - Rebase whole patchset on 6.14-rc3 base.
>  - Address Peter's comments on Perf part.
>  - Address Sean's comments on KVM part.
>    * Change key word "passthrough" to "mediated" in all patches
>    * Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY.
>    * Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR
>      save/retore list support for GLOBAL_CTRL, thus the support of mediated
>      vPMU is constrained to SapphireRapids and later CPUs on Intel side.
>    * Merge some small changes into a single patch.
>  - Address Sandipan's comment on invalid pmu pointer.
>  - Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly
>    manipulate pmc->eventsel and pmu->fixed_ctr_ctrl.
>
>
> Testing (Intel side):
>  - Perf-based legacy vPMU (force emulation on/off)
>    * Kselftests pmu_counters_test, pmu_event_filter_test and
>      vmx_pmu_caps_test pass.
>    * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
>    * Basic perf counting/sampling tests in 3 scenarios, guest-only,
>      host-only and host-guest coexistence all pass.
>
>  - Mediated vPMU (force emulation on/off)
>    * Kselftests pmu_counters_test, pmu_event_filter_test and
>      vmx_pmu_caps_test pass.
>    * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
>    * Basic perf counting/sampling tests in 3 scenarios, guest-only,
>      host-only and host-guest coexistence all pass.
>
>  - Failures. All above tests passed on Intel Granite Rapids as well
>    except a failure on KUT/pmu_pebs.
>    * GP counter 0 (0xfffffffffffe): PEBS record (written seq 0)
>      is verified (including size, counters and cfg).
>    * The pebs_data_cfg (0xb500000000) doesn't match with the
>      effective MSR_PEBS_DATA_CFG (0x0).
>    * This failure has nothing to do with this mediated vPMU patch set. The
>      failure is caused by Granite Rapids supported timed PEBS which needs
>      extra support on Qemu and KUT/pmu_pebs. These extra support would be
>      sent in separate patches later.
>
>
> Testing (AMD side):
>  - Kselftests pmu_counters_test, pmu_event_filter_test and
>    vmx_pmu_caps_test all pass
>
>  - legacy guest with KUT/pmu:
>    * qmeu option: -cpu host, -perfctr-core
>    * when set force_emulation_prefix=1, passes
>    * when set force_emulation_prefix=0, passes
>  - perfmon-v1 guest with KUT/pmu:
>    * qmeu option: -cpu host, -perfmon-v2
>    * when set force_emulation_prefix=1, passes
>    * when set force_emulation_prefix=0, passes
>  - perfmon-v2 guest with KUT/pmu:
>    * qmeu option: -cpu host
>    * when set force_emulation_prefix=1, passes
>    * when set force_emulation_prefix=0, passes
>
>  - perf_fuzzer (perfmon-v2):
>    * fails with soft lockup in guest in current version.
>    * culprit could be between 6.13 ~ 6.14-rc3 within KVM
>    * Series tested on 6.12 and 6.13 without issue.
>
> Note: a QEMU series is needed to run mediated vPMU v4:
>  - https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.com/
>
> History:
>  - RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/
>  - RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
>  - RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.intel.com/
>
>
> Dapeng Mi (18):
>   KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
>   KVM: x86/pmu: Check PMU cpuid configuration from user space
>   KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
>   KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
>   KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
>   KVM: VMX: Add macros to wrap around
>     {secondary,tertiary}_exec_controls_changebit()
>   KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
>   KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
>     vm_exit/entry_ctrl
>   KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
>   KVM: x86/pmu: Setup PMU MSRs' interception mode
>   KVM: x86/pmu: Handle PMU MSRs interception and event filtering
>   KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
>   KVM: x86/pmu: Handle emulated instruction for mediated vPMU
>   KVM: nVMX: Add macros to simplify nested MSR interception setting
>   KVM: selftests: Add mediated vPMU supported for pmu tests
>   KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
>   KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
>   KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
>
> Kan Liang (8):
>   perf: Support get/put mediated PMU interfaces
>   perf: Skip pmu_ctx based on event_type
>   perf: Clean up perf ctx time
>   perf: Add a EVENT_GUEST flag
>   perf: Add generic exclude_guest support
>   perf: Add switch_guest_ctx() interface
>   perf/x86: Support switch_guest_ctx interface
>   perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
>
> Mingwei Zhang (5):
>   perf/x86: Forbid PMI handler when guest own PMU
>   perf/x86/core: Plumb mediated PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>   KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
>   KVM: nVMX: Add nested virtualization support for mediated PMU
>
> Sandipan Das (4):
>   perf/x86/core: Do not set bit width for unavailable counters
>   KVM: x86/pmu: Add AMD PMU registers to direct access list
>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>     write to event selectors
>   perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
>
> Xiong Zhang (3):
>   x86/irq: Factor out common code for installing kvm irq handler
>   perf: core/x86: Register a new vector for KVM GUEST PMI
>   KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
>
>  arch/x86/events/amd/core.c                    |   2 +
>  arch/x86/events/core.c                        |  40 +-
>  arch/x86/events/intel/core.c                  |   5 +
>  arch/x86/include/asm/hardirq.h                |   1 +
>  arch/x86/include/asm/idtentry.h               |   1 +
>  arch/x86/include/asm/irq.h                    |   2 +-
>  arch/x86/include/asm/irq_vectors.h            |   5 +-
>  arch/x86/include/asm/kvm-x86-pmu-ops.h        |   2 +
>  arch/x86/include/asm/kvm_host.h               |  10 +
>  arch/x86/include/asm/msr-index.h              |  18 +-
>  arch/x86/include/asm/perf_event.h             |   1 +
>  arch/x86/include/asm/vmx.h                    |   1 +
>  arch/x86/kernel/idt.c                         |   1 +
>  arch/x86/kernel/irq.c                         |  39 +-
>  arch/x86/kvm/cpuid.c                          |  15 +
>  arch/x86/kvm/pmu.c                            | 254 ++++++++-
>  arch/x86/kvm/pmu.h                            |  45 ++
>  arch/x86/kvm/svm/pmu.c                        | 148 ++++-
>  arch/x86/kvm/svm/svm.c                        |  26 +
>  arch/x86/kvm/svm/svm.h                        |   2 +-
>  arch/x86/kvm/vmx/capabilities.h               |  11 +-
>  arch/x86/kvm/vmx/nested.c                     |  68 ++-
>  arch/x86/kvm/vmx/pmu_intel.c                  | 224 ++++++--
>  arch/x86/kvm/vmx/vmx.c                        |  89 +--
>  arch/x86/kvm/vmx/vmx.h                        |  11 +-
>  arch/x86/kvm/x86.c                            |  63 ++-
>  arch/x86/kvm/x86.h                            |   2 +
>  include/linux/perf_event.h                    |  47 +-
>  kernel/events/core.c                          | 519 ++++++++++++++----
>  .../beauty/arch/x86/include/asm/irq_vectors.h |   5 +-
>  .../selftests/kvm/include/kvm_test_harness.h  |  13 +
>  .../testing/selftests/kvm/include/kvm_util.h  |   3 +
>  .../selftests/kvm/include/x86/processor.h     |   8 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  23 +
>  .../selftests/kvm/x86/pmu_counters_test.c     |  24 +-
>  .../selftests/kvm/x86/pmu_event_filter_test.c |   8 +-
>  .../selftests/kvm/x86/vmx_pmu_caps_test.c     |   2 +-
>  37 files changed, 1480 insertions(+), 258 deletions(-)
>
>
> base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 09/38] perf: Add switch_guest_ctx() interface
  2025-03-24 17:30 ` [PATCH v4 09/38] perf: Add switch_guest_ctx() interface Mingwei Zhang
@ 2025-04-25 11:12   ` Peter Zijlstra
  2025-05-14 23:30   ` Sean Christopherson
  2025-05-21 20:01   ` Namhyung Kim
  2 siblings, 0 replies; 127+ messages in thread
From: Peter Zijlstra @ 2025-04-25 11:12 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:30:49PM +0000, Mingwei Zhang wrote:

> @@ -1822,7 +1835,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  int perf_get_mediated_pmu(void);
>  void perf_put_mediated_pmu(void);
> -void perf_guest_enter(void);
> +void perf_guest_enter(u32 guest_lvtpc);
>  void perf_guest_exit(void);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
> @@ -1921,7 +1934,7 @@ static inline int perf_get_mediated_pmu(void)
>  }
>  
>  static inline void perf_put_mediated_pmu(void)			{ }
> -static inline void perf_guest_enter(void)			{ }
> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>  static inline void perf_guest_exit(void)			{ }
>  #endif
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d05487d465c9..406b86641f02 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -451,6 +451,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>  static LIST_HEAD(pmus);
>  static DEFINE_MUTEX(pmus_lock);
>  static struct srcu_struct pmus_srcu;
> +static DEFINE_PER_CPU(struct mediated_pmus_list, mediated_pmus);
>  static cpumask_var_t perf_online_mask;
>  static cpumask_var_t perf_online_core_mask;
>  static cpumask_var_t perf_online_die_mask;
> @@ -6053,8 +6054,26 @@ static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
>  	}
>  }
>  
> +static void perf_switch_guest_ctx(bool enter, u32 guest_lvtpc)
> +{
> +	struct mediated_pmus_list *pmus = this_cpu_ptr(&mediated_pmus);
> +	struct perf_cpu_pmu_context *cpc;
> +	struct pmu *pmu;
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(cpc, &pmus->list, mediated_entry) {
> +		pmu = cpc->epc.pmu;
> +
> +		if (pmu->switch_guest_ctx)
> +			pmu->switch_guest_ctx(enter, (void *)&guest_lvtpc);
> +	}
> +	rcu_read_unlock();
> +}
> +
>  /* When entering a guest, schedule out all exclude_guest events. */
> -void perf_guest_enter(void)
> +void perf_guest_enter(u32 guest_lvtpc)
>  {
>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>  
> @@ -6067,6 +6086,8 @@ void perf_guest_enter(void)
>  
>  	perf_host_exit(cpuctx);
>  
> +	perf_switch_guest_ctx(true, guest_lvtpc);
> +
>  	__this_cpu_write(perf_in_guest, true);
>  
>  unlock:

This, I'm still utterly hating on that lvtpc argument. That doesn't
belong here. Make it go away.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-03-24 17:30 ` [PATCH v4 05/38] perf: Add generic exclude_guest support Mingwei Zhang
@ 2025-04-25 11:13   ` Peter Zijlstra
  2025-05-14 23:19     ` Sean Christopherson
  2025-05-21 19:55   ` Namhyung Kim
  1 sibling, 1 reply; 127+ messages in thread
From: Peter Zijlstra @ 2025-04-25 11:13 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:30:45PM +0000, Mingwei Zhang wrote:

> @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
> +{
> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> +	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> +	if (cpuctx->task_ctx) {
> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> +		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> +	}
> +}
> +
> +/* When entering a guest, schedule out all exclude_guest events. */
> +void perf_guest_enter(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
> +		goto unlock;
> +
> +	perf_host_exit(cpuctx);
> +
> +	__this_cpu_write(perf_in_guest, true);
> +
> +unlock:
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_enter);
> +
> +static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
> +{
> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> +	if (cpuctx->task_ctx)
> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> +
> +	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
> +
> +	if (cpuctx->task_ctx)
> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> +}
> +
> +void perf_guest_exit(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> +		goto unlock;
> +
> +	perf_host_enter(cpuctx);
> +
> +	__this_cpu_write(perf_in_guest, false);
> +unlock:
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_exit);

This naming is confusing on purpose? Pick either guest/host and stick
with it.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface
  2025-03-24 17:30 ` [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface Mingwei Zhang
@ 2025-04-25 11:15   ` Peter Zijlstra
  2025-04-25 13:06     ` Liang, Kan
  0 siblings, 1 reply; 127+ messages in thread
From: Peter Zijlstra @ 2025-04-25 11:15 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:30:50PM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Implement switch_guest_ctx interface for x86 PMU, switch PMI to dedicated
> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
> NMI at perf guest exit.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/events/core.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 8f218ac0d445..28161d6ff26d 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2677,6 +2677,16 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
>  	return ret;
>  }
>  
> +static void x86_pmu_switch_guest_ctx(bool enter, void *data)
> +{
> +	u32 guest_lvtpc = *(u32 *)data;
> +
> +	if (enter)
> +		apic_write(APIC_LVTPC, guest_lvtpc);
> +	else
> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> +}

This, why can't it use x86_pmu.guest_lvtpc here and call it a day? Why
is that argument passed around through the generic code only to get back
here?

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-04-16  7:22 ` [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mi, Dapeng
@ 2025-04-25 12:27   ` Peter Zijlstra
  0 siblings, 0 replies; 127+ messages in thread
From: Peter Zijlstra @ 2025-04-25 12:27 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Mingwei Zhang, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Wed, Apr 16, 2025 at 03:22:02PM +0800, Mi, Dapeng wrote:
> Kindly ping... Any comments on this patch series? Thanks.

I suppose its mostly okay, just a few nits.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface
  2025-04-25 11:15   ` Peter Zijlstra
@ 2025-04-25 13:06     ` Liang, Kan
  2025-04-25 13:43       ` Peter Zijlstra
  0 siblings, 1 reply; 127+ messages in thread
From: Liang, Kan @ 2025-04-25 13:06 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania



On 2025-04-25 7:15 a.m., Peter Zijlstra wrote:
> On Mon, Mar 24, 2025 at 05:30:50PM +0000, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Implement switch_guest_ctx interface for x86 PMU, switch PMI to dedicated
>> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
>> NMI at perf guest exit.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/events/core.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 8f218ac0d445..28161d6ff26d 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2677,6 +2677,16 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
>>  	return ret;
>>  }
>>  
>> +static void x86_pmu_switch_guest_ctx(bool enter, void *data)
>> +{
>> +	u32 guest_lvtpc = *(u32 *)data;
>> +
>> +	if (enter)
>> +		apic_write(APIC_LVTPC, guest_lvtpc);
>> +	else
>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +}
> 
> This, why can't it use x86_pmu.guest_lvtpc here and call it a day? Why
> is that argument passed around through the generic code only to get back
> here?

The vector has to be from the KVM. However, the current interfaces only
support KVM read perf variables, e.g., perf_get_x86_pmu_capability and
perf_get_hw_event_config.
We need to add an new interface to allow the KVM write a perf variable,
e.g., perf_set_guest_lvtpc.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface
  2025-04-25 13:06     ` Liang, Kan
@ 2025-04-25 13:43       ` Peter Zijlstra
  2025-04-25 13:56         ` Liang, Kan
  0 siblings, 1 reply; 127+ messages in thread
From: Peter Zijlstra @ 2025-04-25 13:43 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Fri, Apr 25, 2025 at 09:06:26AM -0400, Liang, Kan wrote:
> 
> 
> On 2025-04-25 7:15 a.m., Peter Zijlstra wrote:
> > On Mon, Mar 24, 2025 at 05:30:50PM +0000, Mingwei Zhang wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Implement switch_guest_ctx interface for x86 PMU, switch PMI to dedicated
> >> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
> >> NMI at perf guest exit.
> >>
> >> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> ---
> >>  arch/x86/events/core.c | 12 ++++++++++++
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> >> index 8f218ac0d445..28161d6ff26d 100644
> >> --- a/arch/x86/events/core.c
> >> +++ b/arch/x86/events/core.c
> >> @@ -2677,6 +2677,16 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
> >>  	return ret;
> >>  }
> >>  
> >> +static void x86_pmu_switch_guest_ctx(bool enter, void *data)
> >> +{
> >> +	u32 guest_lvtpc = *(u32 *)data;
> >> +
> >> +	if (enter)
> >> +		apic_write(APIC_LVTPC, guest_lvtpc);
> >> +	else
> >> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> >> +}
> > 
> > This, why can't it use x86_pmu.guest_lvtpc here and call it a day? Why
> > is that argument passed around through the generic code only to get back
> > here?
> 
> The vector has to be from the KVM. However, the current interfaces only
> support KVM read perf variables, e.g., perf_get_x86_pmu_capability and
> perf_get_hw_event_config.
> We need to add an new interface to allow the KVM write a perf variable,
> e.g., perf_set_guest_lvtpc.

But all that should remain in x86, there is no reason what so ever to
leak this into generic code.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface
  2025-04-25 13:43       ` Peter Zijlstra
@ 2025-04-25 13:56         ` Liang, Kan
  2025-07-30  0:31           ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Liang, Kan @ 2025-04-25 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania



On 2025-04-25 9:43 a.m., Peter Zijlstra wrote:
> On Fri, Apr 25, 2025 at 09:06:26AM -0400, Liang, Kan wrote:
>>
>>
>> On 2025-04-25 7:15 a.m., Peter Zijlstra wrote:
>>> On Mon, Mar 24, 2025 at 05:30:50PM +0000, Mingwei Zhang wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> Implement switch_guest_ctx interface for x86 PMU, switch PMI to dedicated
>>>> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
>>>> NMI at perf guest exit.
>>>>
>>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>> ---
>>>>  arch/x86/events/core.c | 12 ++++++++++++
>>>>  1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>>> index 8f218ac0d445..28161d6ff26d 100644
>>>> --- a/arch/x86/events/core.c
>>>> +++ b/arch/x86/events/core.c
>>>> @@ -2677,6 +2677,16 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static void x86_pmu_switch_guest_ctx(bool enter, void *data)
>>>> +{
>>>> +	u32 guest_lvtpc = *(u32 *)data;
>>>> +
>>>> +	if (enter)
>>>> +		apic_write(APIC_LVTPC, guest_lvtpc);
>>>> +	else
>>>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>>>> +}
>>>
>>> This, why can't it use x86_pmu.guest_lvtpc here and call it a day? Why
>>> is that argument passed around through the generic code only to get back
>>> here?
>>
>> The vector has to be from the KVM. However, the current interfaces only
>> support KVM read perf variables, e.g., perf_get_x86_pmu_capability and
>> perf_get_hw_event_config.
>> We need to add an new interface to allow the KVM write a perf variable,
>> e.g., perf_set_guest_lvtpc.
> 
> But all that should remain in x86, there is no reason what so ever to
> leak this into generic code.
> 

Sure. I will change it in the V5.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (38 preceding siblings ...)
  2025-04-16  7:22 ` [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mi, Dapeng
@ 2025-05-06  9:57 ` Mi, Dapeng
  2025-05-06 19:45   ` Sean Christopherson
  2025-05-15  0:49 ` Sean Christopherson
  40 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-06  9:57 UTC (permalink / raw)
  To: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Sean Christopherson,
	Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

Hi Sean,

Not sure if you have bandwidth to review this mediated vPMU v4 patchset?
All your comments in v3 patchset have been addressed.

Thanks.

Dapeng Mi

On 3/25/2025 1:30 AM, Mingwei Zhang wrote:
> With joint effort from the upstream KVM community, we come up with the
> 4th version of mediated vPMU for x86. We have made the following changes
> on top of the previous RFC v3.
>
> v3 -> v4
>  - Rebase whole patchset on 6.14-rc3 base.
>  - Address Peter's comments on Perf part.
>  - Address Sean's comments on KVM part.
>    * Change key word "passthrough" to "mediated" in all patches
>    * Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY.
>    * Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR
>      save/retore list support for GLOBAL_CTRL, thus the support of mediated
>      vPMU is constrained to SapphireRapids and later CPUs on Intel side.
>    * Merge some small changes into a single patch.
>  - Address Sandipan's comment on invalid pmu pointer.
>  - Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly
>    manipulate pmc->eventsel and pmu->fixed_ctr_ctrl.
>
>
> Testing (Intel side):
>  - Perf-based legacy vPMU (force emulation on/off)
>    * Kselftests pmu_counters_test, pmu_event_filter_test and
>      vmx_pmu_caps_test pass.
>    * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
>    * Basic perf counting/sampling tests in 3 scenarios, guest-only,
>      host-only and host-guest coexistence all pass.
>
>  - Mediated vPMU (force emulation on/off)
>    * Kselftests pmu_counters_test, pmu_event_filter_test and
>      vmx_pmu_caps_test pass.
>    * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
>    * Basic perf counting/sampling tests in 3 scenarios, guest-only,
>      host-only and host-guest coexistence all pass.
>
>  - Failures. All above tests passed on Intel Granite Rapids as well
>    except a failure on KUT/pmu_pebs.
>    * GP counter 0 (0xfffffffffffe): PEBS record (written seq 0)
>      is verified (including size, counters and cfg).
>    * The pebs_data_cfg (0xb500000000) doesn't match with the
>      effective MSR_PEBS_DATA_CFG (0x0).
>    * This failure has nothing to do with this mediated vPMU patch set. The
>      failure is caused by Granite Rapids supported timed PEBS which needs
>      extra support on Qemu and KUT/pmu_pebs. These extra support would be
>      sent in separate patches later.
>
>
> Testing (AMD side):
>  - Kselftests pmu_counters_test, pmu_event_filter_test and
>    vmx_pmu_caps_test all pass
>
>  - legacy guest with KUT/pmu:
>    * qmeu option: -cpu host, -perfctr-core
>    * when set force_emulation_prefix=1, passes
>    * when set force_emulation_prefix=0, passes
>  - perfmon-v1 guest with KUT/pmu:
>    * qmeu option: -cpu host, -perfmon-v2
>    * when set force_emulation_prefix=1, passes
>    * when set force_emulation_prefix=0, passes
>  - perfmon-v2 guest with KUT/pmu:
>    * qmeu option: -cpu host
>    * when set force_emulation_prefix=1, passes
>    * when set force_emulation_prefix=0, passes
>
>  - perf_fuzzer (perfmon-v2):
>    * fails with soft lockup in guest in current version.
>    * culprit could be between 6.13 ~ 6.14-rc3 within KVM
>    * Series tested on 6.12 and 6.13 without issue.
>
> Note: a QEMU series is needed to run mediated vPMU v4:
>  - https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.com/
>
> History:
>  - RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/
>  - RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
>  - RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.intel.com/
>
>
> Dapeng Mi (18):
>   KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
>   KVM: x86/pmu: Check PMU cpuid configuration from user space
>   KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
>   KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
>   KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
>   KVM: VMX: Add macros to wrap around
>     {secondary,tertiary}_exec_controls_changebit()
>   KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
>   KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
>     vm_exit/entry_ctrl
>   KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
>   KVM: x86/pmu: Setup PMU MSRs' interception mode
>   KVM: x86/pmu: Handle PMU MSRs interception and event filtering
>   KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
>   KVM: x86/pmu: Handle emulated instruction for mediated vPMU
>   KVM: nVMX: Add macros to simplify nested MSR interception setting
>   KVM: selftests: Add mediated vPMU supported for pmu tests
>   KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
>   KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
>   KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
>
> Kan Liang (8):
>   perf: Support get/put mediated PMU interfaces
>   perf: Skip pmu_ctx based on event_type
>   perf: Clean up perf ctx time
>   perf: Add a EVENT_GUEST flag
>   perf: Add generic exclude_guest support
>   perf: Add switch_guest_ctx() interface
>   perf/x86: Support switch_guest_ctx interface
>   perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
>
> Mingwei Zhang (5):
>   perf/x86: Forbid PMI handler when guest own PMU
>   perf/x86/core: Plumb mediated PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>   KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
>   KVM: nVMX: Add nested virtualization support for mediated PMU
>
> Sandipan Das (4):
>   perf/x86/core: Do not set bit width for unavailable counters
>   KVM: x86/pmu: Add AMD PMU registers to direct access list
>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>     write to event selectors
>   perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
>
> Xiong Zhang (3):
>   x86/irq: Factor out common code for installing kvm irq handler
>   perf: core/x86: Register a new vector for KVM GUEST PMI
>   KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
>
>  arch/x86/events/amd/core.c                    |   2 +
>  arch/x86/events/core.c                        |  40 +-
>  arch/x86/events/intel/core.c                  |   5 +
>  arch/x86/include/asm/hardirq.h                |   1 +
>  arch/x86/include/asm/idtentry.h               |   1 +
>  arch/x86/include/asm/irq.h                    |   2 +-
>  arch/x86/include/asm/irq_vectors.h            |   5 +-
>  arch/x86/include/asm/kvm-x86-pmu-ops.h        |   2 +
>  arch/x86/include/asm/kvm_host.h               |  10 +
>  arch/x86/include/asm/msr-index.h              |  18 +-
>  arch/x86/include/asm/perf_event.h             |   1 +
>  arch/x86/include/asm/vmx.h                    |   1 +
>  arch/x86/kernel/idt.c                         |   1 +
>  arch/x86/kernel/irq.c                         |  39 +-
>  arch/x86/kvm/cpuid.c                          |  15 +
>  arch/x86/kvm/pmu.c                            | 254 ++++++++-
>  arch/x86/kvm/pmu.h                            |  45 ++
>  arch/x86/kvm/svm/pmu.c                        | 148 ++++-
>  arch/x86/kvm/svm/svm.c                        |  26 +
>  arch/x86/kvm/svm/svm.h                        |   2 +-
>  arch/x86/kvm/vmx/capabilities.h               |  11 +-
>  arch/x86/kvm/vmx/nested.c                     |  68 ++-
>  arch/x86/kvm/vmx/pmu_intel.c                  | 224 ++++++--
>  arch/x86/kvm/vmx/vmx.c                        |  89 +--
>  arch/x86/kvm/vmx/vmx.h                        |  11 +-
>  arch/x86/kvm/x86.c                            |  63 ++-
>  arch/x86/kvm/x86.h                            |   2 +
>  include/linux/perf_event.h                    |  47 +-
>  kernel/events/core.c                          | 519 ++++++++++++++----
>  .../beauty/arch/x86/include/asm/irq_vectors.h |   5 +-
>  .../selftests/kvm/include/kvm_test_harness.h  |  13 +
>  .../testing/selftests/kvm/include/kvm_util.h  |   3 +
>  .../selftests/kvm/include/x86/processor.h     |   8 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  23 +
>  .../selftests/kvm/x86/pmu_counters_test.c     |  24 +-
>  .../selftests/kvm/x86/pmu_event_filter_test.c |   8 +-
>  .../selftests/kvm/x86/vmx_pmu_caps_test.c     |   2 +-
>  37 files changed, 1480 insertions(+), 258 deletions(-)
>
>
> base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-05-06  9:57 ` Mi, Dapeng
@ 2025-05-06 19:45   ` Sean Christopherson
  2025-05-07  0:46     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-06 19:45 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Tue, May 06, 2025, Dapeng Mi wrote:
> Hi Sean,
> 
> Not sure if you have bandwidth to review this mediated vPMU v4 patchset?

I'm getting there.  I wanted to get through all the stuff I thought would likely
be ready for 6.16 as-is before moving onto the larger series.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-05-06 19:45   ` Sean Christopherson
@ 2025-05-07  0:46     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-07  0:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania


On 5/7/2025 3:45 AM, Sean Christopherson wrote:
> On Tue, May 06, 2025, Dapeng Mi wrote:
>> Hi Sean,
>>
>> Not sure if you have bandwidth to review this mediated vPMU v4 patchset?
> I'm getting there.  I wanted to get through all the stuff I thought would likely
> be ready for 6.16 as-is before moving onto the larger series.

Got it. Thanks.


>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces
  2025-03-24 17:30 ` [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces Mingwei Zhang
@ 2025-05-14 22:48   ` Sean Christopherson
  2025-05-15  1:31     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-14 22:48 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> +/*
> + * Currently invoked at VM creation to
> + * - Check whether there are existing !exclude_guest events of PMU with
> + *   PERF_PMU_CAP_MEDIATED_VPMU
> + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
> + *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
> + *
> + * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
> + * still owns all the PMU resources.
> + */
> +int perf_get_mediated_pmu(void)
> +{
> +	guard(mutex)(&perf_mediated_pmu_mutex);
> +	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
> +		return 0;
> +
> +	if (atomic_read(&nr_include_guest_events))
> +		return -EBUSY;
> +
> +	atomic_inc(&nr_mediated_pmu_vms);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);

IMO, all of the mediated PMU logic should be guarded with a Kconfig.  I strongly
suspect KVM x86 will be the only user for the foreseeable, e.g. arm64 is trending
toward a partioned PMU approach, and subjecting other architectures to the (minor)
overhead associated with e.g. nr_mediated_pmu_vms seems pointless.  The other
nicety is that it helps encapsulate the mediated PMU code, which for those of us
that haven't been living and breathing this for the last few months, is immensely
helpful.

> +void perf_put_mediated_pmu(void)

To avoid confusion with perf_put_guest_context() in future patches, I think it
makes sense to go with something like perf_{create,release}_mediated_pmu().  I
actually like the get/put terminology in isolation, but they look weird side-by-side.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-03-24 17:30 ` [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Mingwei Zhang
@ 2025-05-14 22:51   ` Sean Christopherson
  2025-05-15  1:35     ` Mi, Dapeng
  2025-05-19  6:58   ` Namhyung Kim
  2025-05-21 19:46   ` Namhyung Kim
  2 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-14 22:51 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> +{
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {

My vote is for s/perf_in_guest/guest_ctx_loaded, because "perf in guest" doesn't
accurately describe just the mediated PMU case.  E.g. perf itself is running in
KVM guests when using an emulated vPMU, or no vPMU at all.

And with a Kconfig to guard the mediated PMU, this check (and others) can be
elided at compile time for architectures that don't support a mediated PMU (or
if KVM is disabled).

> +		/*
> +		 * (now + times[total].offset) - (now + times[guest].offset) :=
> +		 * times[total].offset - times[guest].offset
> +		 */
> +		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);
> +	}
> +
> +	return now + READ_ONCE(times[T_TOTAL].offset);
> +}
> +

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-04-25 11:13   ` Peter Zijlstra
@ 2025-05-14 23:19     ` Sean Christopherson
  2025-05-15  1:37       ` Mi, Dapeng
  2025-05-15 18:39       ` Liang, Kan
  0 siblings, 2 replies; 127+ messages in thread
From: Sean Christopherson @ 2025-05-14 23:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Fri, Apr 25, 2025, Peter Zijlstra wrote:
> On Mon, Mar 24, 2025 at 05:30:45PM +0000, Mingwei Zhang wrote:
> 
> > @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
> >  }
> >  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> >  
> > +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
> > +{
> > +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> > +	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> > +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> > +	if (cpuctx->task_ctx) {
> > +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> > +		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> > +	}
> > +}
> > +
> > +/* When entering a guest, schedule out all exclude_guest events. */
> > +void perf_guest_enter(void)
> > +{
> > +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> > +
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > +
> > +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
> > +		goto unlock;
> > +
> > +	perf_host_exit(cpuctx);
> > +
> > +	__this_cpu_write(perf_in_guest, true);
> > +
> > +unlock:
> > +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> > +}
> > +EXPORT_SYMBOL_GPL(perf_guest_enter);
> > +
> > +static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
> > +{
> > +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> > +	if (cpuctx->task_ctx)
> > +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> > +
> > +	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
> > +
> > +	if (cpuctx->task_ctx)
> > +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> > +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> > +}
> > +
> > +void perf_guest_exit(void)
> > +{
> > +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> > +
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > +
> > +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> > +		goto unlock;
> > +
> > +	perf_host_enter(cpuctx);
> > +
> > +	__this_cpu_write(perf_in_guest, false);
> > +unlock:
> > +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> > +}
> > +EXPORT_SYMBOL_GPL(perf_guest_exit);
> 
> This naming is confusing on purpose? Pick either guest/host and stick
> with it.

+1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
These flows

After a bit of hacking, and with a few spoilers, this is what I ended up with
(not anywhere near fully tested).  I like following KVM's kvm_xxx_{load,put}()
nomenclature to tie everything together, so I went with "guest" instead of "host"
even though the majority of work being down is to shedule out/in host context.

/* When loading a guest's mediated PMU, schedule out all exclude_guest events. */
void perf_load_guest_context(unsigned long data)
{
	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);

	lockdep_assert_irqs_disabled();

	perf_ctx_lock(cpuctx, cpuctx->task_ctx);

	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
		goto unlock;

	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
	if (cpuctx->task_ctx) {
		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
	}

	arch_perf_load_guest_context(data);

	__this_cpu_write(guest_ctx_loaded, true);

unlock:
	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
EXPORT_SYMBOL_GPL(perf_load_guest_context);

void perf_put_guest_context(void)
{
	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);

	lockdep_assert_irqs_disabled();

	perf_ctx_lock(cpuctx, cpuctx->task_ctx);

	if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
		goto unlock;

	arch_perf_put_guest_context();

	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
	if (cpuctx->task_ctx)
		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);

	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);

	if (cpuctx->task_ctx)
		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);

	__this_cpu_write(guest_ctx_loaded, false);
unlock:
	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
EXPORT_SYMBOL_GPL(perf_put_guest_context);

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler
  2025-03-24 17:30 ` [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
@ 2025-05-14 23:21   ` Sean Christopherson
  2025-05-15  2:10     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-14 23:21 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 385e3a5fc304..18cd418fe106 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -312,16 +312,22 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
>  static void dummy_handler(void) {}
>  static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
>  
> -void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>  {
> -	if (handler)
> +	if (!handler)
> +		handler = dummy_handler;
> +
> +	if (vector == POSTED_INTR_WAKEUP_VECTOR &&
> +	    (handler == dummy_handler ||
> +	     kvm_posted_intr_wakeup_handler == dummy_handler))
>  		kvm_posted_intr_wakeup_handler = handler;
> -	else {
> -		kvm_posted_intr_wakeup_handler = dummy_handler;
> +	else
> +		WARN_ON_ONCE(1);
> +
> +	if (handler == dummy_handler)

Eww.  Aside from the fact that the dummy_handler implementation is pointless
overhead, I don't think KVM should own the IRQ vector.  Given that perf owns the
LVTPC, i.e. responsible for switching between NMI and the medited PMI IRQ, I
think perf should also own the vector.  KVM can then use the existing perf guest
callbacks to wire up its PMI handler.

And with that, this patch can be dropped.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI
  2025-03-24 17:30 ` [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
@ 2025-05-14 23:24   ` Sean Christopherson
  2025-05-15  1:40     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-14 23:24 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index ad5c68f0509d..b0cb3220e1bb 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
> +DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	        sysvec_kvm_guest_pmi_handler);

I would prefer to keep KVM out of the name, and as mentioned in the previous patch,
route this through perf.

>  #else
>  # define fred_sysvec_kvm_posted_intr_ipi		NULL
>  # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL

Y'all forgot to wire up the FRED handling.  I.e. the mediated PMI IRQs would get
treated as spurious when running with FRED.

> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 47051871b436..250cdab11306 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -77,7 +77,10 @@
>   */
>  #define IRQ_WORK_VECTOR			0xf6
>  
> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
> +#if IS_ENABLED(CONFIG_KVM)
> +#define KVM_GUEST_PMI_VECTOR		0xf5
> +#endif

Conditionally defining the vector sounds good on paper, but its problematic, e.g.
for connecting the handler to FRED's array, and doesn't really add much value.

>  #define DEFERRED_ERROR_VECTOR		0xf4
>  
>  /* Vector on which hypervisor callbacks will be delivered */
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index f445bec516a0..0bec4c7e2308 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
>  	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>  	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>  	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 09/38] perf: Add switch_guest_ctx() interface
  2025-03-24 17:30 ` [PATCH v4 09/38] perf: Add switch_guest_ctx() interface Mingwei Zhang
  2025-04-25 11:12   ` Peter Zijlstra
@ 2025-05-14 23:30   ` Sean Christopherson
  2025-05-15  1:45     ` Mi, Dapeng
  2025-05-21 20:01   ` Namhyung Kim
  2 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-14 23:30 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> When entering/exiting a guest, some contexts for a guest have to be
> switched. For examples, there is a dedicated interrupt vector for
> guests on Intel platforms.
> 
> When PMI switch into a new guest vector, guest_lvtpc value need to be
> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> bit should be cleared also, then PMI can be generated continuously
> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> and switch_guest_ctx().
> 
> Add a dedicated list to track all the pmus with the PASSTHROUGH cap, which
> may require switching the guest context. It can avoid going through the
> huge pmus list.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h | 17 +++++++++++--
>  kernel/events/core.c       | 51 +++++++++++++++++++++++++++++++++++++-
>  2 files changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 37187ee8e226..58c1cf6939bf 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -584,6 +584,11 @@ struct pmu {
>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>  	 */
>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> +
> +	/*
> +	 * Switch guest context when a guest enter/exit, e.g., interrupt vectors.
> +	 */
> +	void (*switch_guest_ctx)	(bool enter, void *data); /* optional */

IMO, putting this in "struct pmu" is unnecessarily convoluted and complex, and a
poor fit for what needs to be done.  The only usage of the hook is for the CPU to
swap the LVTPC, and the @data payload communicates exactly that.  I.e. this has
one user, and can't reasonably be extended to other users without some ugliness.

And if by some miracle there's no CPU pmu in perf, KVM's mediated PMU still needs
to swap to its PMI IRQ.  So rather than per-PMU hook along with a list and a
spinlock, just make this an arch hook.  And if all of the mediated PMU code is
guarded by a Kconfig, then perf doesn't even needs __weak stubs.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU
  2025-03-24 17:30 ` [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
@ 2025-05-15  0:00   ` Sean Christopherson
  2025-05-15  1:52     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:00 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
> be held pending until EFLAGS.IF is set. In the meantime, if the logical
> processor receives an NMI for any reason at all, perf_event_nmi_handler()
> will be invoked. If there is any active perf event anywhere on the system,
> x86_pmu_handle_irq() will be invoked, and it will clear
> IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
> be a mystery which counter(s) overflowed.
> 
> When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
> x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
> and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
> environment.
> 
> So modify perf_event_nmi_handler() to check perf_in_guest per cpu variable,
> and if so, to simply return without calling x86_pmu_handle_irq().
> 
> Suggested-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 28161d6ff26d..96a173bbbec2 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -54,6 +54,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
>  	.pmu = &pmu,
>  };
>  
> +static DEFINE_PER_CPU(bool, pmi_vector_is_nmi) = true;

I strongly prefer guest_ctx_loaded.  pmi_vector_is_nmi very inflexible and
doesn't communicate *why* perf's NMI handler needs to ignore NMIs

>  DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
>  DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
>  DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
> @@ -1737,6 +1739,24 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  	u64 finish_clock;
>  	int ret;
>  
> +	/*
> +	 * When guest pmu context is loaded this handler should be forbidden from
> +	 * running, the reasons are:
> +	 * 1. After perf_guest_enter() is called, and before cpu enter into
> +	 *    non-root mode, host non-PMI NMI could happen, but x86_pmu_handle_irq()
> +	 *    restore PMU to use NMI vector, which destroy KVM PMI vector setting.
> +	 * 2. When VM is running, host non-PMI NMI causes VM exit, KVM will
> +	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
> +	 *    guest PMU context (kvm_pmu_put_guest_context()), as x86_pmu_handle_irq()
> +	 *    clear global_status MSR which has guest status now, then this destroy
> +	 *    guest PMU status.
> +	 * 3. After VM exit, but before KVM save guest PMU context, host non-PMI NMI
> +	 *    could happen, x86_pmu_handle_irq() clear global_status MSR which has
> +	 *    guest status now, then this destroy guest PMU status.
> +	 */

This *might* be useful for a changelog, but even then it's probably overkill.
NMIs can happen at any time, that's the full the story.  Enumerating the exact
edge cases adds a lot of noise and not much value.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
  2025-03-24 17:30 ` [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter Mingwei Zhang
@ 2025-05-15  0:09   ` Sean Christopherson
  2025-05-15  2:53     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:09 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Introduce enable_mediated_pmu global parameter to control if mediated
> vPMU can be enabled on KVM level. Even enable_mediated_pmu is set to
> true in KVM, user space hypervisor still need to enable mediated vPMU
> explicitly by calling KVM_CAP_PMU_CAPABILITY ioctl. This gives
> hypervisor flexibility to enable or disable mediated vPMU for each VM.
> 
> Mediated vPMU depends on some PMU features on higher PMU version, like
> PERF_GLOBAL_STATUS_SET MSR in v4+ for Intel PMU. Thus introduce a
> pmu_ops variable MIN_MEDIATED_PMU_VERSION to indicates the minimum host
> PMU version which mediated vPMU needs.
> 
> Currently enable_mediated_pmu is not exposed to user space as a module
> parameter until all mediated vPMU code are in place.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/pmu.c              |  3 ++-
>  arch/x86/kvm/pmu.h              | 11 +++++++++
>  arch/x86/kvm/svm/pmu.c          |  1 +
>  arch/x86/kvm/vmx/capabilities.h |  3 ++-
>  arch/x86/kvm/vmx/pmu_intel.c    |  5 ++++
>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>  arch/x86/kvm/x86.c              | 44 ++++++++++++++++++++++++++++++---
>  arch/x86/kvm/x86.h              |  1 +
>  8 files changed, 64 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 75e9cfc689f8..4f455afe4009 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -775,7 +775,8 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  	pmu->pebs_data_cfg_rsvd = ~0ull;
>  	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>  
> -	if (!vcpu->kvm->arch.enable_pmu)
> +	if (!vcpu->kvm->arch.enable_pmu ||
> +	    (!lapic_in_kernel(vcpu) && enable_mediated_pmu))

This check belongs in KVM_CAP_PMU_CAPABILITY, i.e. KVM needs to reject enabling
a mediated PMU without an in-kernel local APIC, not silently drop the PMU.

>  		return;
>  
>  	kvm_pmu_call(refresh)(vcpu);
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index ad89d0bd6005..dd45a0c6be74 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -45,6 +45,7 @@ struct kvm_pmu_ops {
>  	const u64 EVENTSEL_EVENT;
>  	const int MAX_NR_GP_COUNTERS;
>  	const int MIN_NR_GP_COUNTERS;
> +	const int MIN_MEDIATED_PMU_VERSION;

I like the idea, but simply checking the PMU version is insufficient on Intel,
i.e. just add a callback.

>  };
>  
>  void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
> @@ -63,6 +64,12 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
>  	return pmu->version > 1;
>  }
>  
> +static inline bool kvm_mediated_pmu_enabled(struct kvm_vcpu *vcpu)

kvm_vcpu_has_mediated_pmu() to align with e.g. guest_cpu_cap_has(), and because
kvm_mediated_pmu_enabled() sounds like a VM-scoped or module-scoped helper.

> +{
> +	return vcpu->kvm->arch.enable_pmu &&

This is superfluous, pmu->version should never be non-zero without the PMU being
enabled at the VM level.

> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 77012b2eca0e..425e93d4b1c6 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -739,4 +739,9 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_MAX_NR_INTEL_GP_COUNTERS,
>  	.MIN_NR_GP_COUNTERS = 1,
> +	/*
> +	 * Intel mediated vPMU support depends on
> +	 * MSR_CORE_PERF_GLOBAL_STATUS_SET which is supported from 4+.
> +	 */
> +	.MIN_MEDIATED_PMU_VERSION = 4,
>  };
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 00ac94535c21..a4b5b6455c7b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7916,7 +7916,8 @@ static __init u64 vmx_get_perf_capabilities(void)
>  	if (boot_cpu_has(X86_FEATURE_PDCM))
>  		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
>  
> -	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
> +	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
> +	    !enable_mediated_pmu) {
>  		x86_perf_get_lbr(&vmx_lbr_caps);
>  
>  		/*

There's a bit too much going on in this patch.  I think it makes sense to split
the vendor chunks out to separate patches, so that each can elaborate on the
exact requirements.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 72995952978a..1ebe169b88b6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,6 +188,14 @@ bool __read_mostly enable_pmu = true;
>  EXPORT_SYMBOL_GPL(enable_pmu);
>  module_param(enable_pmu, bool, 0444);
>  
> +/*
> + * Enable/disable mediated passthrough PMU virtualization.
> + * Don't expose it to userspace as a module paramerter until
> + * all mediated vPMU code is in place.
> + */

No need for the comment, documenting this in the changelog is sufficient.

> +bool __read_mostly enable_mediated_pmu;
> +EXPORT_SYMBOL_GPL(enable_mediated_pmu);
> +
>  bool __read_mostly eager_page_split = true;
>  module_param(eager_page_split, bool, 0644);
>  
> @@ -6643,9 +6651,28 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			break;
>  
>  		mutex_lock(&kvm->lock);
> -		if (!kvm->created_vcpus) {
> -			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
> -			r = 0;
> +		/*
> +		 * To keep PMU configuration "simple", setting vPMU support is
> +		 * disallowed if vCPUs are created, or if mediated PMU support
> +		 * was already enabled for the VM.
> +		 */
> +		if (!kvm->created_vcpus &&
> +		    (!enable_mediated_pmu || !kvm->arch.enable_pmu)) {
> +			bool pmu_enable = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
> +
> +			if (enable_mediated_pmu && pmu_enable) {

Local APIC check goes here.

> +				char *err_msg = "Fail to enable mediated vPMU, " \
> +					"please disable system wide perf events or nmi_watchdog " \
> +					"(echo 0 > /proc/sys/kernel/nmi_watchdog).\n";
> +
> +				r = perf_get_mediated_pmu();
> +				if (r)
> +					kvm_err("%s", err_msg);


#define MEDIATED_PMU_MSG "Fail to enable mediated vPMU, disable system wide perf events and nmi_watchdog.\n"

				r = perf_create_mediated_pmu();
				if (r)
					kvm_err(MEDIATED_PMU_MSG);

> +			} else
> +				r = 0;
> +
> +			if (!r)
> +				kvm->arch.enable_pmu = pmu_enable;
>  		}
>  		mutex_unlock(&kvm->lock);
>  		break;
> @@ -12723,7 +12750,14 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
>  	kvm->arch.apic_bus_cycle_ns = APIC_BUS_CYCLE_NS_DEFAULT;
>  	kvm->arch.guest_can_read_msr_platform_info = true;
> -	kvm->arch.enable_pmu = enable_pmu;
> +
> +	/*
> +	 * PMU virtualization is opt-in when mediated PMU support is enabled.
> +	 * KVM_CAP_PMU_CAPABILITY ioctl must be called explicitly to enable
> +	 * mediated vPMU. For legacy perf-based vPMU, its behavior isn't changed,
> +	 * KVM_CAP_PMU_CAPABILITY ioctl is optional.
> +	 */

Again, too much extraneous info, the exception proves the rule.  I.e. by calling
out that mediated PMU is special, it's clear the rule is that PMUs are enabled by
default in the !mediated case.

	/*
	 * Userspace must explicitly opt-in to PMU virtualization when mediated
	 * PMU support is enabled (see KVM_CAP_PMU_CAPABILITY).
	 */

> +	kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;

So I tried to run a QEMU with this and it failed, because QEMU expected the PMU
to be enabled and tried to write to PMU MSRs.  I haven't dug through the QEMU
code, but I assume that QEMU rightly expects that passing in PMU in CPUID when
KVM_GET_SUPPORTED_CPUID says its supported will result in the VM having a PMU.

I.e. by trying to get cute with backwards compatibility, I think we broke backwards
compatiblity.  At this point, I'm leaning toward making the module param off-by-default,
but otherwise not messing with the behavior of kvm->arch.enable_pmu.  Not sure if
that has implications for KVM_PMU_CAP_DISABLE though.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space
  2025-03-24 17:30 ` [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space Mingwei Zhang
@ 2025-05-15  0:12   ` Sean Christopherson
  2025-05-15  3:00     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:12 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Check user space's PMU cpuid configuration and filter the invalid
> configuration.
> 
> Either legacy perf-based vPMU or mediated vPMU needs kernel to support
> local APIC, otherwise PMI has no way to be injected into guest. If
> kernel doesn't support local APIC, reject user space to enable PMU
> cpuid.
> 
> User space configured PMU version must be no larger than KVM supported
> maximum pmu version for mediated vPMU, otherwise guest may manipulate
> some unsupported or unallowed PMU MSRs, this is dangerous and harmful.
> 
> If the pmu version is larger than 1 but smaller than 5, CPUID.AH.ECX
> must be 0 as well which is required by SDM.
> 
> Suggested-by: Zide Chen <zide.chen@intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/cpuid.c | 15 +++++++++++++++
>  arch/x86/kvm/pmu.c   |  7 +++++--
>  arch/x86/kvm/pmu.h   |  1 +
>  3 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 8eb3a88707f2..f849ced9deba 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -179,6 +179,21 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu)
>  			return -EINVAL;
>  	}
>  
> +	best = kvm_find_cpuid_entry(vcpu, 0xa);
> +	if (vcpu->kvm->arch.enable_pmu && best) {
> +		union cpuid10_eax eax;
> +
> +		eax.full = best->eax;
> +		if (enable_mediated_pmu &&
> +		    eax.split.version_id > kvm_pmu_cap.version)
> +			return -EINVAL;
> +		if (eax.split.version_id > 0 && !vcpu_pmu_can_enable(vcpu))
> +			return -EINVAL;
> +		if (eax.split.version_id > 1 && eax.split.version_id < 5 &&
> +		    best->ecx != 0)
> +			return -EINVAL;

NAK, unless there is a really, *really* strong need for this.  I do not want to
get in the business of vetting the vCPU model presented to the guest.  If KVM
needs to constrain things for its own safety, then by all means, but AFAICT these
are nothing more than sanity checks on userspace.

> +	}
> +
>  	/*
>  	 * Exposing dynamic xfeatures to the guest requires additional
>  	 * enabling in the FPU, e.g. to expand the guest XSAVE state size.
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 4f455afe4009..92c742ead663 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -743,6 +743,10 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>  	kvm_pmu_call(reset)(vcpu);
>  }
>  
> +inline bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->kvm->arch.enable_pmu && lapic_in_kernel(vcpu);

Again, the APIC check belongs in the VM enablement path, not here.  Hmm, that
may require more thought with respect to enabling the PMU by default.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
  2025-03-24 17:30 ` [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{} Mingwei Zhang
@ 2025-05-15  0:12   ` Sean Christopherson
  2025-05-15  3:04     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:12 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Add perf_capabilities in kvm_host_values{} structure to record host perf
> capabilities. KVM needs to know if host supports some PMU capabilities
> and then decide if passthrough or intercept some PMU MSRs or instruction
> like rdpmc, e.g. If host supports PERF_METRICES, but guest is configured
> not to support it, then rdpmc instruction needs to be intercepted.

This is wrong (spoiler alert).  This patch can be dropped.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-03-24 17:31 ` [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc Mingwei Zhang
@ 2025-05-15  0:19   ` Sean Christopherson
  2025-05-15  3:23     ` Mi, Dapeng
  2025-05-26  6:15   ` Sandipan Das
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:19 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

The shortlog is wildly inaccurate.  KVM is not simply checking, KVM is actively
disabling RDPMC interception.  *That* needs to be the focus of the shortlog and
changelog.

> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 92c742ead663..6ad71752be4b 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -604,6 +604,40 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
>  	return 0;
>  }
>  
> +inline bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu)

Strongly prefer kvm_need_rdpmc_intercept(), e.g. to follow vmx_need_pf_intercept(),
and because it makes the users more obviously correct.  The "in_guest" terminology
from kvm_{hlt,mwait,pause,cstate}_in_guest() isn't great, but at least in those
flows it's not awful because they are very direct reflections of knobs that control
interception, whereas this helper is making a variety of runtime checks.

> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	if (!kvm_mediated_pmu_enabled(vcpu))
> +		return false;
> +
> +	/*
> +	 * VMware allows access to these Pseduo-PMCs even when read via RDPMC
> +	 * in Ring3 when CR4.PCE=0.
> +	 */
> +	if (enable_vmware_backdoor)
> +		return false;
> +
> +	/*
> +	 * FIXME: In theory, perf metrics is always combined with fixed
> +	 *	  counter 3. it's fair enough to compare the guest and host
> +	 *	  fixed counter number and don't need to check perf metrics
> +	 *	  explicitly. However kvm_pmu_cap.num_counters_fixed is limited
> +	 *	  KVM_MAX_NR_FIXED_COUNTERS (3) as fixed counter 3 is not
> +	 *	  supported now. perf metrics is still needed to be checked
> +	 *	  explicitly here. Once fixed counter 3 is supported, the perf
> +	 *	  metrics checking can be removed.
> +	 */

And then what happens when hardware supported fixed counter #4?  KVM has the same
problem, and we can't check for features that KVM doesn't know about.

The entire problem is that this code is checking for *KVM* support, but what the
guest can see and access needs to be checked against *hardware* support.  Handling
that is simple, just take a snapshot of the host PMU capabilities before KVM
generates kvm_pmu_cap, and use the unadulterated snapshot here (and everywhere
else with similar checks).

> +	return pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
> +	       pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
> +	       vcpu_has_perf_metrics(vcpu) == kvm_host_has_perf_metrics() &&
> +	       pmu->counter_bitmask[KVM_PMC_GP] ==
> +				(BIT_ULL(kvm_pmu_cap.bit_width_gp) - 1) &&
> +	       pmu->counter_bitmask[KVM_PMC_FIXED] ==
> +				(BIT_ULL(kvm_pmu_cap.bit_width_fixed) - 1);
> +}
> @@ -212,6 +212,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>  	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
>  }
>  
> +static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	__amd_pmu_refresh(vcpu);

To better communicate the roles of the two paths to refresh():

	amd_pmu_refresh_capabilities(vcpu);

	amd_pmu_refresh_controls(vcpu);

Ditto for Intel.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl
  2025-03-26 20:09     ` Mingwei Zhang
@ 2025-05-15  0:33       ` Sean Christopherson
  2025-05-15  3:45         ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:33 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Zide Chen, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Wed, Mar 26, 2025, Mingwei Zhang wrote:
> On Wed, Mar 26, 2025 at 9:51 AM Chen, Zide <zide.chen@intel.com> wrote:
> > > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > > index 6ad71752be4b..4e8cefcce7ab 100644
> > > --- a/arch/x86/kvm/pmu.c
> > > +++ b/arch/x86/kvm/pmu.c
> > > @@ -646,6 +646,30 @@ void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
> > >       }
> > >  }
> > >
> > > +static void kvm_pmu_sync_global_ctrl_from_vmcs(struct kvm_vcpu *vcpu)
> > > +{
> > > +     struct msr_data msr_info = { .index = MSR_CORE_PERF_GLOBAL_CTRL };
> > > +
> > > +     if (!kvm_mediated_pmu_enabled(vcpu))
> > > +             return;
> > > +
> > > +     /* Sync pmu->global_ctrl from GUEST_IA32_PERF_GLOBAL_CTRL. */
> > > +     kvm_pmu_call(get_msr)(vcpu, &msr_info);
> > > +}
> > > +
> > > +static void kvm_pmu_sync_global_ctrl_to_vmcs(struct kvm_vcpu *vcpu, u64 global_ctrl)
> > > +{
> > > +     struct msr_data msr_info = {
> > > +             .index = MSR_CORE_PERF_GLOBAL_CTRL,
> > > +             .data = global_ctrl };
> > > +
> > > +     if (!kvm_mediated_pmu_enabled(vcpu))
> > > +             return;
> > > +
> > > +     /* Sync pmu->global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
> > > +     kvm_pmu_call(set_msr)(vcpu, &msr_info);

Eh, just add a dedicated kvm_pmu_ops hook.  Feeding this through set_msr() avoids
adding another hook, but makes the code hard to follow and requires the above
ugly boilerplate.

> > > +}
> > > +
> > >  bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
> > >  {
> > >       switch (msr) {
> > > @@ -680,7 +704,6 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >               msr_info->data = pmu->global_status;
> > >               break;
> > >       case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
> > > -     case MSR_CORE_PERF_GLOBAL_CTRL:
> > >               msr_info->data = pmu->global_ctrl;
> > >               break;
> > >       case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
> > > @@ -731,6 +754,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >
> >
> > pmu->global_ctrl doesn't always have the up-to-date guest value, need to
> > sync from vmcs/vmbc before comparing it against 'data'.
> >
> > +               kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
> >                 if (pmu->global_ctrl != data) {
> 
> Good catch. Thanks!
> 
> This is why I really prefer just unconditionally syncing the global
> ctrl from VMCS to pmu->global_ctrl and vice versa.
> 
> We might get into similar problems as well in the future.

The problem isn't conditional synchronization, it's that y'all reinvented the
wheel, poorly.  This is a solved problem via EXREG and wrappers.

That said, I went through the exercise of adding a PERF_GLOBAL_CTRL EXREG and
associated wrappers, and didn't love the result.  Host writes should be rare, so
the dirty tracking is overkill.  For reads, the cost of VMREAD is lower than
VMWRITE (doesn't trigger consistency check re-evaluation on VM-Enter), and is
dwarfed by the cost of switching all other PMU state.

So I think for the initial implementation, it makes sense to propagated writes
to the VMCS on demand, but do VMREAD after VM-Exit (if VM-Enter was successful).
We can always revisit the optimization if/when we optimize the PMU world switches,
e.g. to defer them if there are no active host events.

> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index 8a7af02d466e..ecf72394684d 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -7004,7 +7004,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> > >               VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > >               VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> > >               VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
> > > -             VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> > > +             VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> > > +             VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;

This is completely wrong.  Stuffing VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL here
advertises support for KVM emulation of the control, and that support is non-existent
in this patch (and series).

Just drop this, emulation of VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL can be done
separately.

> > > +     mediated = kvm_mediated_pmu_enabled(vcpu);
> > > +     if (cpu_has_load_perf_global_ctrl()) {
> > > +             vm_entry_controls_changebit(vmx,
> > > +                     VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, mediated);
> > > +             /*
> > > +              * Initialize guest PERF_GLOBAL_CTRL to reset value as SDM rules.
> > > +              *
> > > +              * Note: GUEST_IA32_PERF_GLOBAL_CTRL must be initialized to
> > > +              * "BIT_ULL(pmu->nr_arch_gp_counters) - 1" instead of pmu->global_ctrl
> > > +              * since pmu->global_ctrl is only be initialized when guest
> > > +              * pmu->version > 1. Otherwise if pmu->version is 1, pmu->global_ctrl
> > > +              * is 0 and guest counters are never really enabled.
> > > +              */
> > > +             if (mediated)
> > > +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> > > +                                  BIT_ULL(pmu->nr_arch_gp_counters) - 1);

This belongs in common code, as a call to the aforementioned hook to propagate
PERF_GLOBAL_CTRL to hardware.

> > > +     }
> > > +
> > > +     if (cpu_has_save_perf_global_ctrl())
> > > +             vm_exit_controls_changebit(vmx,
> > > +                     VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> > > +                     VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, mediated);
> > >  }
> > >
> > >  static void intel_pmu_init(struct kvm_vcpu *vcpu)
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index ff66f17d6358..38ecf3c116bd 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -4390,6 +4390,13 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
> > >
> > >       if (cpu_has_load_ia32_efer())
> > >               vmcs_write64(HOST_IA32_EFER, kvm_host.efer);
> > > +
> > > +     /*
> > > +      * Initialize host PERF_GLOBAL_CTRL to 0 to disable all counters
> > > +      * immediately once VM exits. Mediated vPMU then call perf_guest_exit()
> > > +      * to re-enable host perf events.
> > > +      */
> > > +     vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);

This needs to be conditioned on the mediated PMU being enabled, because this field
is not constant when using the emulated PMU (or no vPMU).

> > > @@ -8451,6 +8462,15 @@ __init int vmx_hardware_setup(void)
> > >               enable_sgx = false;
> > >  #endif
> > >
> > > +     /*
> > > +      * All CPUs that support a mediated PMU are expected to support loading
> > > +      * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
> > > +      */
> > > +     if (enable_mediated_pmu &&
> > > +         (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
> > > +                       !cpu_has_save_perf_global_ctrl())))

This needs to be conditioned on !HYPERVISOR, or it *will* fire.

And placing this check here, without *any* mention of *why* you did so, is evil
and made me very grumpy.  I had to discover the hard way that you checked the
VMCS fields here, instead of in kvm_init_pmu_capability() where it logically
belongs, because the VMCS configuration isn't yet initialized.

Grumpiness aside, I don't like this late clear of enable_mediated_pmu, as it risks
a variation of the problem you're trying to avoid, i.e. risks consuming the variable
between kvm_init_pmu_capability() and here.

I don't see any reason why setup_vmcs_config() can't be called before
kvm_x86_vendor_init(), so unless I'm missing/forgetting something, let's just do
that, and move these checks where they belong.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
  2025-03-24 17:31 ` [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers Mingwei Zhang
@ 2025-05-15  0:37   ` Sean Christopherson
  2025-05-15  5:09     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:37 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

This is not an optimization in any sane interpretation of that word.

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Currently pmu->global_ctrl is initialized in the common kvm_pmu_refresh()
> helper since both Intel and AMD CPUs set enable bits for all GP counters
> for PERF_GLOBAL_CTRL MSR. But it may be not the best place to initialize
> pmu->global_ctrl. Strictly speaking, pmu->global_ctrl is vendor specific

And?  There's mounds of KVM code that show it's very, very easy to manage
global_ctrl in common code.

> and there are lots of global_ctrl related processing in
> intel/amd_pmu_refresh() helpers, so better handle them in same place.
> Thus move pmu->global_ctrl initialization into intel/amd_pmu_refresh()
> helpers.
> 
> Besides, intel_pmu_refresh() doesn't handle global_ctrl_rsvd and
> global_status_rsvd properly and fix it.

Really?  You mention a bug fix in passing, and squash it into an opinionated
refactoring that is advertised as "optimizations" without even stating what the
bug is?  C'mon.

> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/pmu.c           | 10 -------
>  arch/x86/kvm/svm/pmu.c       | 14 +++++++--
>  arch/x86/kvm/vmx/pmu_intel.c | 55 ++++++++++++++++++------------------
>  3 files changed, 39 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 4e8cefcce7ab..2ac4c039de8b 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -843,16 +843,6 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	kvm_pmu_call(refresh)(vcpu);
> -
> -	/*
> -	 * At RESET, both Intel and AMD CPUs set all enable bits for general
> -	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
> -	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
> -	 * in the global controls).  Emulate that behavior when refreshing the
> -	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
> -	 */
> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
> -		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);

Absolutely not, this code stays where it is.

>  }
>  
>  void kvm_pmu_init(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 153972e944eb..eba086ef5eca 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
>  	pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
>  					 kvm_pmu_cap.num_counters_gp);
>  
> -	if (pmu->version > 1) {
> -		pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> +	if (kvm_pmu_cap.version > 1) {

It's not just global_ctrl.  PEBS and the fixed counters also depend on v2+ (the
SDM contradicts itself; KVM's ABI is that they're v2+).

> +		/*
> +		 * At RESET, AMD CPUs set all enable bits for general purpose counters in
> +		 * IA32_PERF_GLOBAL_CTRL (so that software that was written for v1 PMUs
> +		 * don't unknowingly leave GP counters disabled in the global controls).
> +		 * Emulate that behavior when refreshing the PMU so that userspace doesn't
> +		 * need to manually set PERF_GLOBAL_CTRL.
> +		 */
> +		pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
> +		pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
>  		pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
>  	}
>  
> -	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
> +	pmu->counter_bitmask[KVM_PMC_GP] = BIT_ULL(48) - 1;

I like these cleanups, but they too belong in a separate patch.

>  	pmu->reserved_bits = 0xfffffff000280000ull;
>  	pmu->raw_event_mask = AMD64_RAW_EVENT_MASK;
>  	/* not applicable to AMD; but clean them to prevent any fall out */

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs
  2025-03-24 17:31 ` [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs Mingwei Zhang
@ 2025-05-15  0:41   ` Sean Christopherson
  2025-05-15  5:37     ` Mi, Dapeng
  2025-05-16 13:34   ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:41 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

Again, use more precise language.  "Configure interceptions" is akin to "do work".
It gives readers a vague idea of what's going on, but this

  KVM: x86/pmu: Disable interception of select PMU MSRs for mediated vPMUs

is just as concise, and more descriptive.

> +	/*
> +	 * In mediated vPMU, intercept global PMU MSRs when guest PMU only owns
> +	 * a subset of counters provided in HW or its version is less than 2.
> +	 */
> +	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&
> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)

This logic belongs in common code.  Just because AMD doesn't have fixed counters
doesn't mean KVM can't have a superfluous "0 == 0" check.

> +	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&

Just require the guest to have PERF_GLOBAL_CTRL, I don't see any reason to support
v1 PMUs.  It adds complexity and weirdness, and I can't imagine there's a use case.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering
  2025-03-24 17:31 ` [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering Mingwei Zhang
@ 2025-05-15  0:42   ` Sean Christopherson
  2025-05-15  5:34     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:42 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> -	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
> +	pmu->fixed_ctr_ctrl = pmu->fixed_ctr_ctrl_hw = 0;
> +	pmu->global_ctrl = pmu->global_status = 0;

VMCS needs to be updated.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  2025-03-24 17:31 ` [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and " Mingwei Zhang
@ 2025-05-15  0:43   ` Sean Christopherson
  2025-05-15  5:38     ` Mi, Dapeng
  2025-05-16  1:26   ` Mi, Dapeng
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:43 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

Again, be more precise.

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Mediated vPMU needs to intercept EVENTSELx and FIXED_CNTR_CTRL MSRs to
> filter out guest malicious perf events. Either writing these MSRs or
> updating event filters would call reprogram_counter() eventually. Thus
> check if the guest event should be filtered out in reprogram_counter().
> If so, clear corresponding EVENTSELx MSR or FIXED_CNTR_CTRL field to
> ensure the guest event won't be really enabled at vm-entry.
> 
> Besides, mediated vPMU intercepts the MSRs of these guest not owned
> counters and it just needs simply to read/write from/to pmc->counter.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/pmu.c | 27 +++++++++++++++++++++++++++
>  arch/x86/kvm/pmu.h |  3 +++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 63143eeb5c44..e9100dc49fdc 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -305,6 +305,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>  
>  void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
>  {
> +	if (kvm_mediated_pmu_enabled(pmc->vcpu)) {
> +		pmc->counter = val & pmc_bitmask(pmc);
> +		return;
> +	}
> +
>  	/*
>  	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
>  	 * read-modify-write.  Adjust the counter value so that its value is
> @@ -455,6 +460,28 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>  	bool emulate_overflow;
>  	u8 fixed_ctr_ctrl;
>  
> +	if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
> +		bool allowed = check_pmu_event_filter(pmc);
> +
> +		if (pmc_is_gp(pmc)) {
> +			if (allowed)
> +				pmc->eventsel_hw |= pmc->eventsel &
> +						    ARCH_PERFMON_EVENTSEL_ENABLE;
> +			else
> +				pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> +		} else {
> +			int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
> +
> +			if (allowed)
> +				pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
> +			else
> +				pmu->fixed_ctr_ctrl_hw &=
> +					~intel_fixed_bits_by_idx(idx, 0xf);
> +		}
> +
> +		return 0;

I think it's worth adding a helper for this, as it makes things a bit more
self-documenting in terms of when KVM needs to "reprogram" mediated PMU PMCs.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
                   ` (39 preceding siblings ...)
  2025-05-06  9:57 ` Mi, Dapeng
@ 2025-05-15  0:49 ` Sean Christopherson
  2025-05-15  5:45   ` Mi, Dapeng
  40 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15  0:49 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> Dapeng Mi (18):
>   KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
>   KVM: x86/pmu: Check PMU cpuid configuration from user space
>   KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
>   KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
>   KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
>   KVM: VMX: Add macros to wrap around
>     {secondary,tertiary}_exec_controls_changebit()
>   KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
>   KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
>     vm_exit/entry_ctrl
>   KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
>   KVM: x86/pmu: Setup PMU MSRs' interception mode
>   KVM: x86/pmu: Handle PMU MSRs interception and event filtering
>   KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
>   KVM: x86/pmu: Handle emulated instruction for mediated vPMU
>   KVM: nVMX: Add macros to simplify nested MSR interception setting
>   KVM: selftests: Add mediated vPMU supported for pmu tests
>   KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
>   KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
>   KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
> 
> Kan Liang (8):
>   perf: Support get/put mediated PMU interfaces
>   perf: Skip pmu_ctx based on event_type
>   perf: Clean up perf ctx time
>   perf: Add a EVENT_GUEST flag
>   perf: Add generic exclude_guest support
>   perf: Add switch_guest_ctx() interface
>   perf/x86: Support switch_guest_ctx interface
>   perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
> 
> Mingwei Zhang (5):
>   perf/x86: Forbid PMI handler when guest own PMU
>   perf/x86/core: Plumb mediated PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>   KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
>   KVM: nVMX: Add nested virtualization support for mediated PMU
> 
> Sandipan Das (4):
>   perf/x86/core: Do not set bit width for unavailable counters
>   KVM: x86/pmu: Add AMD PMU registers to direct access list
>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>     write to event selectors
>   perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
> 
> Xiong Zhang (3):
>   x86/irq: Factor out common code for installing kvm irq handler
>   perf: core/x86: Register a new vector for KVM GUEST PMI
>   KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler

I ran out of time today and didn't get emails send for all patches.  I'm planning
on getting that done tomorrow.

I already have most of the proposed changes implemented:

  https://github.com/sean-jc/linux.git x86/mediated_pmu

It compiles and doesn't explode, but it's not fully functional (PMU tests fail).
I'll poke at it over the next few days, but if someone is itching to figure out
what I broke, then by all means.

Given that I've already made many modifications (I have a hard time reviewing a
series this big without editing as I go), unless someone objects, I'll post v5
(and v6+ as needed), though that'll like be days/weeks as I need to get it working,
and want to do more passes over the code, shortlogs, and changelogs. 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces
  2025-05-14 22:48   ` Sean Christopherson
@ 2025-05-15  1:31     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  1:31 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 6:48 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> +/*
>> + * Currently invoked at VM creation to
>> + * - Check whether there are existing !exclude_guest events of PMU with
>> + *   PERF_PMU_CAP_MEDIATED_VPMU
>> + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
>> + *   PMUs with PERF_PMU_CAP_MEDIATED_VPMU
>> + *
>> + * No impact for the PMU without PERF_PMU_CAP_MEDIATED_VPMU. The perf
>> + * still owns all the PMU resources.
>> + */
>> +int perf_get_mediated_pmu(void)
>> +{
>> +	guard(mutex)(&perf_mediated_pmu_mutex);
>> +	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
>> +		return 0;
>> +
>> +	if (atomic_read(&nr_include_guest_events))
>> +		return -EBUSY;
>> +
>> +	atomic_inc(&nr_mediated_pmu_vms);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
> IMO, all of the mediated PMU logic should be guarded with a Kconfig.  I strongly
> suspect KVM x86 will be the only user for the foreseeable, e.g. arm64 is trending
> toward a partioned PMU approach, and subjecting other architectures to the (minor)
> overhead associated with e.g. nr_mediated_pmu_vms seems pointless.  The other
> nicety is that it helps encapsulate the mediated PMU code, which for those of us
> that haven't been living and breathing this for the last few months, is immensely
> helpful.

I'm fine with this.


>
>> +void perf_put_mediated_pmu(void)
> To avoid confusion with perf_put_guest_context() in future patches, I think it
> makes sense to go with something like perf_{create,release}_mediated_pmu().  I
> actually like the get/put terminology in isolation, but they look weird side-by-side.

Agree.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-05-14 22:51   ` Sean Christopherson
@ 2025-05-15  1:35     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  1:35 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 6:51 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> +{
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
> My vote is for s/perf_in_guest/guest_ctx_loaded, because "perf in guest" doesn't
> accurately describe just the mediated PMU case.  E.g. perf itself is running in
> KVM guests when using an emulated vPMU, or no vPMU at all.

Agree.


>
> And with a Kconfig to guard the mediated PMU, this check (and others) can be
> elided at compile time for architectures that don't support a mediated PMU (or
> if KVM is disabled).
>
>> +		/*
>> +		 * (now + times[total].offset) - (now + times[guest].offset) :=
>> +		 * times[total].offset - times[guest].offset
>> +		 */
>> +		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);
>> +	}
>> +
>> +	return now + READ_ONCE(times[T_TOTAL].offset);
>> +}
>> +

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-05-14 23:19     ` Sean Christopherson
@ 2025-05-15  1:37       ` Mi, Dapeng
  2025-05-15 18:39       ` Liang, Kan
  1 sibling, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  1:37 UTC (permalink / raw)
  To: Sean Christopherson, Peter Zijlstra
  Cc: Mingwei Zhang, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 7:19 AM, Sean Christopherson wrote:
> On Fri, Apr 25, 2025, Peter Zijlstra wrote:
>> On Mon, Mar 24, 2025 at 05:30:45PM +0000, Mingwei Zhang wrote:
>>
>>> @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
>>>  }
>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>  
>>> +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
>>> +{
>>> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> +	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>>> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> +	if (cpuctx->task_ctx) {
>>> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> +		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> +	}
>>> +}
>>> +
>>> +/* When entering a guest, schedule out all exclude_guest events. */
>>> +void perf_guest_enter(void)
>>> +{
>>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>> +
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
>>> +		goto unlock;
>>> +
>>> +	perf_host_exit(cpuctx);
>>> +
>>> +	__this_cpu_write(perf_in_guest, true);
>>> +
>>> +unlock:
>>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_enter);
>>> +
>>> +static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
>>> +{
>>> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> +	if (cpuctx->task_ctx)
>>> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> +
>>> +	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> +
>>> +	if (cpuctx->task_ctx)
>>> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> +}
>>> +
>>> +void perf_guest_exit(void)
>>> +{
>>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>> +
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>> +		goto unlock;
>>> +
>>> +	perf_host_enter(cpuctx);
>>> +
>>> +	__this_cpu_write(perf_in_guest, false);
>>> +unlock:
>>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_exit);
>> This naming is confusing on purpose? Pick either guest/host and stick
>> with it.
> +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
> These flows
>
> After a bit of hacking, and with a few spoilers, this is what I ended up with
> (not anywhere near fully tested).  I like following KVM's kvm_xxx_{load,put}()
> nomenclature to tie everything together, so I went with "guest" instead of "host"
> even though the majority of work being down is to shedule out/in host context.
>
> /* When loading a guest's mediated PMU, schedule out all exclude_guest events. */
> void perf_load_guest_context(unsigned long data)
> {
> 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>
> 	lockdep_assert_irqs_disabled();
>
> 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>
> 	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
> 		goto unlock;
>
> 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> 	if (cpuctx->task_ctx) {
> 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> 		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> 	}
>
> 	arch_perf_load_guest_context(data);
>
> 	__this_cpu_write(guest_ctx_loaded, true);
>
> unlock:
> 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> }
> EXPORT_SYMBOL_GPL(perf_load_guest_context);
>
> void perf_put_guest_context(void)
> {
> 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>
> 	lockdep_assert_irqs_disabled();
>
> 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>
> 	if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
> 		goto unlock;
>
> 	arch_perf_put_guest_context();
>
> 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> 	if (cpuctx->task_ctx)
> 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>
> 	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
>
> 	if (cpuctx->task_ctx)
> 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>
> 	__this_cpu_write(guest_ctx_loaded, false);
> unlock:
> 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> }
> EXPORT_SYMBOL_GPL(perf_put_guest_context);

I'm fine with the name. Thanks.


>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI
  2025-05-14 23:24   ` Sean Christopherson
@ 2025-05-15  1:40     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  1:40 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 7:24 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
>> index ad5c68f0509d..b0cb3220e1bb 100644
>> --- a/arch/x86/include/asm/idtentry.h
>> +++ b/arch/x86/include/asm/idtentry.h
>> @@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
>>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
>>  DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
>> +DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	        sysvec_kvm_guest_pmi_handler);
> I would prefer to keep KVM out of the name, and as mentioned in the previous patch,
> route this through perf.

Sure.


>
>>  #else
>>  # define fred_sysvec_kvm_posted_intr_ipi		NULL
>>  # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
> Y'all forgot to wire up the FRED handling.  I.e. the mediated PMI IRQs would get
> treated as spurious when running with FRED.

Oh, yes. we missed that. we would look at it. Thanks for reminding.


>
>> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
>> index 47051871b436..250cdab11306 100644
>> --- a/arch/x86/include/asm/irq_vectors.h
>> +++ b/arch/x86/include/asm/irq_vectors.h
>> @@ -77,7 +77,10 @@
>>   */
>>  #define IRQ_WORK_VECTOR			0xf6
>>  
>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>> +#if IS_ENABLED(CONFIG_KVM)
>> +#define KVM_GUEST_PMI_VECTOR		0xf5
>> +#endif
> Conditionally defining the vector sounds good on paper, but its problematic, e.g.
> for connecting the handler to FRED's array, and doesn't really add much value.
>
>>  #define DEFERRED_ERROR_VECTOR		0xf4
>>  
>>  /* Vector on which hypervisor callbacks will be delivered */
>> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
>> index f445bec516a0..0bec4c7e2308 100644
>> --- a/arch/x86/kernel/idt.c
>> +++ b/arch/x86/kernel/idt.c
>> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
>>  	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>>  	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>>  	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 09/38] perf: Add switch_guest_ctx() interface
  2025-05-14 23:30   ` Sean Christopherson
@ 2025-05-15  1:45     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  1:45 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 7:30 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> When entering/exiting a guest, some contexts for a guest have to be
>> switched. For examples, there is a dedicated interrupt vector for
>> guests on Intel platforms.
>>
>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>> bit should be cleared also, then PMI can be generated continuously
>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>> and switch_guest_ctx().
>>
>> Add a dedicated list to track all the pmus with the PASSTHROUGH cap, which
>> may require switching the guest context. It can avoid going through the
>> huge pmus list.
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h | 17 +++++++++++--
>>  kernel/events/core.c       | 51 +++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 65 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 37187ee8e226..58c1cf6939bf 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -584,6 +584,11 @@ struct pmu {
>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>  	 */
>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>> +
>> +	/*
>> +	 * Switch guest context when a guest enter/exit, e.g., interrupt vectors.
>> +	 */
>> +	void (*switch_guest_ctx)	(bool enter, void *data); /* optional */
> IMO, putting this in "struct pmu" is unnecessarily convoluted and complex, and a
> poor fit for what needs to be done.  The only usage of the hook is for the CPU to
> swap the LVTPC, and the @data payload communicates exactly that.  I.e. this has
> one user, and can't reasonably be extended to other users without some ugliness.
>
> And if by some miracle there's no CPU pmu in perf, KVM's mediated PMU still needs
> to swap to its PMI IRQ.  So rather than per-PMU hook along with a list and a
> spinlock, just make this an arch hook.  And if all of the mediated PMU code is
> guarded by a Kconfig, then perf doesn't even needs __weak stubs.

Sounds good for me.


>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU
  2025-05-15  0:00   ` Sean Christopherson
@ 2025-05-15  1:52     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  1:52 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:00 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
>> be held pending until EFLAGS.IF is set. In the meantime, if the logical
>> processor receives an NMI for any reason at all, perf_event_nmi_handler()
>> will be invoked. If there is any active perf event anywhere on the system,
>> x86_pmu_handle_irq() will be invoked, and it will clear
>> IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
>> be a mystery which counter(s) overflowed.
>>
>> When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
>> x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
>> and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
>> environment.
>>
>> So modify perf_event_nmi_handler() to check perf_in_guest per cpu variable,
>> and if so, to simply return without calling x86_pmu_handle_irq().
>>
>> Suggested-by: Jim Mattson <jmattson@google.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/events/core.c | 27 +++++++++++++++++++++++++--
>>  1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 28161d6ff26d..96a173bbbec2 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -54,6 +54,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
>>  	.pmu = &pmu,
>>  };
>>  
>> +static DEFINE_PER_CPU(bool, pmi_vector_is_nmi) = true;
> I strongly prefer guest_ctx_loaded.  pmi_vector_is_nmi very inflexible and
> doesn't communicate *why* perf's NMI handler needs to ignore NMIs

Sure.


>
>>  DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
>>  DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
>>  DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
>> @@ -1737,6 +1739,24 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>>  	u64 finish_clock;
>>  	int ret;
>>  
>> +	/*
>> +	 * When guest pmu context is loaded this handler should be forbidden from
>> +	 * running, the reasons are:
>> +	 * 1. After perf_guest_enter() is called, and before cpu enter into
>> +	 *    non-root mode, host non-PMI NMI could happen, but x86_pmu_handle_irq()
>> +	 *    restore PMU to use NMI vector, which destroy KVM PMI vector setting.
>> +	 * 2. When VM is running, host non-PMI NMI causes VM exit, KVM will
>> +	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
>> +	 *    guest PMU context (kvm_pmu_put_guest_context()), as x86_pmu_handle_irq()
>> +	 *    clear global_status MSR which has guest status now, then this destroy
>> +	 *    guest PMU status.
>> +	 * 3. After VM exit, but before KVM save guest PMU context, host non-PMI NMI
>> +	 *    could happen, x86_pmu_handle_irq() clear global_status MSR which has
>> +	 *    guest status now, then this destroy guest PMU status.
>> +	 */
> This *might* be useful for a changelog, but even then it's probably overkill.
> NMIs can happen at any time, that's the full the story.  Enumerating the exact
> edge cases adds a lot of noise and not much value.

OK, we just want it to be understood more easily. :)


>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler
  2025-05-14 23:21   ` Sean Christopherson
@ 2025-05-15  2:10     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  2:10 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 7:21 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index 385e3a5fc304..18cd418fe106 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -312,16 +312,22 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
>>  static void dummy_handler(void) {}
>>  static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
>>  
>> -void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
>> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>>  {
>> -	if (handler)
>> +	if (!handler)
>> +		handler = dummy_handler;
>> +
>> +	if (vector == POSTED_INTR_WAKEUP_VECTOR &&
>> +	    (handler == dummy_handler ||
>> +	     kvm_posted_intr_wakeup_handler == dummy_handler))
>>  		kvm_posted_intr_wakeup_handler = handler;
>> -	else {
>> -		kvm_posted_intr_wakeup_handler = dummy_handler;
>> +	else
>> +		WARN_ON_ONCE(1);
>> +
>> +	if (handler == dummy_handler)
> Eww.  Aside from the fact that the dummy_handler implementation is pointless
> overhead, I don't think KVM should own the IRQ vector.  Given that perf owns the
> LVTPC, i.e. responsible for switching between NMI and the medited PMI IRQ, I
> think perf should also own the vector.  KVM can then use the existing perf guest
> callbacks to wire up its PMI handler.

Hmm, yes, make sense.


>
> And with that, this patch can be dropped.
>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
  2025-05-15  0:09   ` Sean Christopherson
@ 2025-05-15  2:53     ` Mi, Dapeng
  2025-05-21 18:43       ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  2:53 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:09 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Introduce enable_mediated_pmu global parameter to control if mediated
>> vPMU can be enabled on KVM level. Even enable_mediated_pmu is set to
>> true in KVM, user space hypervisor still need to enable mediated vPMU
>> explicitly by calling KVM_CAP_PMU_CAPABILITY ioctl. This gives
>> hypervisor flexibility to enable or disable mediated vPMU for each VM.
>>
>> Mediated vPMU depends on some PMU features on higher PMU version, like
>> PERF_GLOBAL_STATUS_SET MSR in v4+ for Intel PMU. Thus introduce a
>> pmu_ops variable MIN_MEDIATED_PMU_VERSION to indicates the minimum host
>> PMU version which mediated vPMU needs.
>>
>> Currently enable_mediated_pmu is not exposed to user space as a module
>> parameter until all mediated vPMU code are in place.
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/kvm/pmu.c              |  3 ++-
>>  arch/x86/kvm/pmu.h              | 11 +++++++++
>>  arch/x86/kvm/svm/pmu.c          |  1 +
>>  arch/x86/kvm/vmx/capabilities.h |  3 ++-
>>  arch/x86/kvm/vmx/pmu_intel.c    |  5 ++++
>>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>>  arch/x86/kvm/x86.c              | 44 ++++++++++++++++++++++++++++++---
>>  arch/x86/kvm/x86.h              |  1 +
>>  8 files changed, 64 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 75e9cfc689f8..4f455afe4009 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -775,7 +775,8 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>  	pmu->pebs_data_cfg_rsvd = ~0ull;
>>  	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>>  
>> -	if (!vcpu->kvm->arch.enable_pmu)
>> +	if (!vcpu->kvm->arch.enable_pmu ||
>> +	    (!lapic_in_kernel(vcpu) && enable_mediated_pmu))
> This check belongs in KVM_CAP_PMU_CAPABILITY, i.e. KVM needs to reject enabling
> a mediated PMU without an in-kernel local APIC, not silently drop the PMU.

Good idea.


>
>>  		return;
>>  
>>  	kvm_pmu_call(refresh)(vcpu);
>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>> index ad89d0bd6005..dd45a0c6be74 100644
>> --- a/arch/x86/kvm/pmu.h
>> +++ b/arch/x86/kvm/pmu.h
>> @@ -45,6 +45,7 @@ struct kvm_pmu_ops {
>>  	const u64 EVENTSEL_EVENT;
>>  	const int MAX_NR_GP_COUNTERS;
>>  	const int MIN_NR_GP_COUNTERS;
>> +	const int MIN_MEDIATED_PMU_VERSION;
> I like the idea, but simply checking the PMU version is insufficient on Intel,
> i.e. just add a callback.

sure.


>
>>  };
>>  
>>  void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
>> @@ -63,6 +64,12 @@ static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
>>  	return pmu->version > 1;
>>  }
>>  
>> +static inline bool kvm_mediated_pmu_enabled(struct kvm_vcpu *vcpu)
> kvm_vcpu_has_mediated_pmu() to align with e.g. guest_cpu_cap_has(), and because
> kvm_mediated_pmu_enabled() sounds like a VM-scoped or module-scoped helper.

exactly.


>
>> +{
>> +	return vcpu->kvm->arch.enable_pmu &&
> This is superfluous, pmu->version should never be non-zero without the PMU being
> enabled at the VM level.

Strictly speaking, "arch.enable_pmu" and pmu->version doesn't indicates
fully same thing.  "arch.enable_pmu" indicates whether PMU function is
enabled in KVM, but the "pmu->version" comes from user space configuration.
In theory user space could configure a "0"  PMU version just like
pmu_counters_test does. Currently I'm not sure if the check for
"pmu->version" can be removed, let me have a double check.


>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 77012b2eca0e..425e93d4b1c6 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -739,4 +739,9 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>>  	.MAX_NR_GP_COUNTERS = KVM_MAX_NR_INTEL_GP_COUNTERS,
>>  	.MIN_NR_GP_COUNTERS = 1,
>> +	/*
>> +	 * Intel mediated vPMU support depends on
>> +	 * MSR_CORE_PERF_GLOBAL_STATUS_SET which is supported from 4+.
>> +	 */
>> +	.MIN_MEDIATED_PMU_VERSION = 4,
>>  };
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 00ac94535c21..a4b5b6455c7b 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7916,7 +7916,8 @@ static __init u64 vmx_get_perf_capabilities(void)
>>  	if (boot_cpu_has(X86_FEATURE_PDCM))
>>  		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
>>  
>> -	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
>> +	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
>> +	    !enable_mediated_pmu) {
>>  		x86_perf_get_lbr(&vmx_lbr_caps);
>>  
>>  		/*
> There's a bit too much going on in this patch.  I think it makes sense to split
> the vendor chunks out to separate patches, so that each can elaborate on the
> exact requirements.

Sure.


>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 72995952978a..1ebe169b88b6 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -188,6 +188,14 @@ bool __read_mostly enable_pmu = true;
>>  EXPORT_SYMBOL_GPL(enable_pmu);
>>  module_param(enable_pmu, bool, 0444);
>>  
>> +/*
>> + * Enable/disable mediated passthrough PMU virtualization.
>> + * Don't expose it to userspace as a module paramerter until
>> + * all mediated vPMU code is in place.
>> + */
> No need for the comment, documenting this in the changelog is sufficient.

Sure.


>
>> +bool __read_mostly enable_mediated_pmu;
>> +EXPORT_SYMBOL_GPL(enable_mediated_pmu);
>> +
>>  bool __read_mostly eager_page_split = true;
>>  module_param(eager_page_split, bool, 0644);
>>  
>> @@ -6643,9 +6651,28 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>  			break;
>>  
>>  		mutex_lock(&kvm->lock);
>> -		if (!kvm->created_vcpus) {
>> -			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
>> -			r = 0;
>> +		/*
>> +		 * To keep PMU configuration "simple", setting vPMU support is
>> +		 * disallowed if vCPUs are created, or if mediated PMU support
>> +		 * was already enabled for the VM.
>> +		 */
>> +		if (!kvm->created_vcpus &&
>> +		    (!enable_mediated_pmu || !kvm->arch.enable_pmu)) {
>> +			bool pmu_enable = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
>> +
>> +			if (enable_mediated_pmu && pmu_enable) {
> Local APIC check goes here.

Yes.


>
>> +				char *err_msg = "Fail to enable mediated vPMU, " \
>> +					"please disable system wide perf events or nmi_watchdog " \
>> +					"(echo 0 > /proc/sys/kernel/nmi_watchdog).\n";
>> +
>> +				r = perf_get_mediated_pmu();
>> +				if (r)
>> +					kvm_err("%s", err_msg);
>
> #define MEDIATED_PMU_MSG "Fail to enable mediated vPMU, disable system wide perf events and nmi_watchdog.\n"

Sure.


>
> 				r = perf_create_mediated_pmu();
> 				if (r)
> 					kvm_err(MEDIATED_PMU_MSG);
>
>> +			} else
>> +				r = 0;
>> +
>> +			if (!r)
>> +				kvm->arch.enable_pmu = pmu_enable;
>>  		}
>>  		mutex_unlock(&kvm->lock);
>>  		break;
>> @@ -12723,7 +12750,14 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>  	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
>>  	kvm->arch.apic_bus_cycle_ns = APIC_BUS_CYCLE_NS_DEFAULT;
>>  	kvm->arch.guest_can_read_msr_platform_info = true;
>> -	kvm->arch.enable_pmu = enable_pmu;
>> +
>> +	/*
>> +	 * PMU virtualization is opt-in when mediated PMU support is enabled.
>> +	 * KVM_CAP_PMU_CAPABILITY ioctl must be called explicitly to enable
>> +	 * mediated vPMU. For legacy perf-based vPMU, its behavior isn't changed,
>> +	 * KVM_CAP_PMU_CAPABILITY ioctl is optional.
>> +	 */
> Again, too much extraneous info, the exception proves the rule.  I.e. by calling
> out that mediated PMU is special, it's clear the rule is that PMUs are enabled by
> default in the !mediated case.

Sure.


>
> 	/*
> 	 * Userspace must explicitly opt-in to PMU virtualization when mediated
> 	 * PMU support is enabled (see KVM_CAP_PMU_CAPABILITY).
> 	 */
>
>> +	kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;
> So I tried to run a QEMU with this and it failed, because QEMU expected the PMU
> to be enabled and tried to write to PMU MSRs.  I haven't dug through the QEMU
> code, but I assume that QEMU rightly expects that passing in PMU in CPUID when
> KVM_GET_SUPPORTED_CPUID says its supported will result in the VM having a PMU.

As long as the module parameter "enable_mediated_pmu" is enabled, qemu
needs below extra code to enable mediated vPMU, otherwise PMU is disabled
in KVM.

https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.com/

> I.e. by trying to get cute with backwards compatibility, I think we broke backwards
> compatiblity.  At this point, I'm leaning toward making the module param off-by-default,
> but otherwise not messing with the behavior of kvm->arch.enable_pmu.  Not sure if
> that has implications for KVM_PMU_CAP_DISABLE though.

I'm not sure if it's a kind of break for backwards compatibility.  As long
as "enable_mediated_pmu" is not enabled, the qemu doesn't need any changes,
the legacy vPMU can still be enabled by old qemu version. But if user want
to enable mediated vPMU, so they should use the new version qemu which has
the capability to enable mediated vPMU, it sounds reasonable for me.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space
  2025-05-15  0:12   ` Sean Christopherson
@ 2025-05-15  3:00     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  3:00 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:12 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Check user space's PMU cpuid configuration and filter the invalid
>> configuration.
>>
>> Either legacy perf-based vPMU or mediated vPMU needs kernel to support
>> local APIC, otherwise PMI has no way to be injected into guest. If
>> kernel doesn't support local APIC, reject user space to enable PMU
>> cpuid.
>>
>> User space configured PMU version must be no larger than KVM supported
>> maximum pmu version for mediated vPMU, otherwise guest may manipulate
>> some unsupported or unallowed PMU MSRs, this is dangerous and harmful.
>>
>> If the pmu version is larger than 1 but smaller than 5, CPUID.AH.ECX
>> must be 0 as well which is required by SDM.
>>
>> Suggested-by: Zide Chen <zide.chen@intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/cpuid.c | 15 +++++++++++++++
>>  arch/x86/kvm/pmu.c   |  7 +++++--
>>  arch/x86/kvm/pmu.h   |  1 +
>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 8eb3a88707f2..f849ced9deba 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -179,6 +179,21 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu)
>>  			return -EINVAL;
>>  	}
>>  
>> +	best = kvm_find_cpuid_entry(vcpu, 0xa);
>> +	if (vcpu->kvm->arch.enable_pmu && best) {
>> +		union cpuid10_eax eax;
>> +
>> +		eax.full = best->eax;
>> +		if (enable_mediated_pmu &&
>> +		    eax.split.version_id > kvm_pmu_cap.version)
>> +			return -EINVAL;
>> +		if (eax.split.version_id > 0 && !vcpu_pmu_can_enable(vcpu))
>> +			return -EINVAL;
>> +		if (eax.split.version_id > 1 && eax.split.version_id < 5 &&
>> +		    best->ecx != 0)
>> +			return -EINVAL;
> NAK, unless there is a really, *really* strong need for this.  I do not want to
> get in the business of vetting the vCPU model presented to the guest.  If KVM
> needs to constrain things for its own safety, then by all means, but AFAICT these
> are nothing more than sanity checks on userspace.

Ok.


>
>> +	}
>> +
>>  	/*
>>  	 * Exposing dynamic xfeatures to the guest requires additional
>>  	 * enabling in the FPU, e.g. to expand the guest XSAVE state size.
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 4f455afe4009..92c742ead663 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -743,6 +743,10 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>>  	kvm_pmu_call(reset)(vcpu);
>>  }
>>  
>> +inline bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu)
>> +{
>> +	return vcpu->kvm->arch.enable_pmu && lapic_in_kernel(vcpu);
> Again, the APIC check belongs in the VM enablement path, not here.  Hmm, that
> may require more thought with respect to enabling the PMU by default.

Sure.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
  2025-05-15  0:12   ` Sean Christopherson
@ 2025-05-15  3:04     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  3:04 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:12 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Add perf_capabilities in kvm_host_values{} structure to record host perf
>> capabilities. KVM needs to know if host supports some PMU capabilities
>> and then decide if passthrough or intercept some PMU MSRs or instruction
>> like rdpmc, e.g. If host supports PERF_METRICES, but guest is configured
>> not to support it, then rdpmc instruction needs to be intercepted.
> This is wrong (spoiler alert).  This patch can be dropped.

Sean, why? I don't get your point here. Could you please give more
information? Thanks.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-05-15  0:19   ` Sean Christopherson
@ 2025-05-15  3:23     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  3:23 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:19 AM, Sean Christopherson wrote:
> The shortlog is wildly inaccurate.  KVM is not simply checking, KVM is actively
> disabling RDPMC interception.  *That* needs to be the focus of the shortlog and
> changelog.

Sure.


>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 92c742ead663..6ad71752be4b 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -604,6 +604,40 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
>>  	return 0;
>>  }
>>  
>> +inline bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu)
> Strongly prefer kvm_need_rdpmc_intercept(), e.g. to follow vmx_need_pf_intercept(),
> and because it makes the users more obviously correct.  The "in_guest" terminology
> from kvm_{hlt,mwait,pause,cstate}_in_guest() isn't great, but at least in those
> flows it's not awful because they are very direct reflections of knobs that control
> interception, whereas this helper is making a variety of runtime checks.

Sure.


>
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	if (!kvm_mediated_pmu_enabled(vcpu))
>> +		return false;
>> +
>> +	/*
>> +	 * VMware allows access to these Pseduo-PMCs even when read via RDPMC
>> +	 * in Ring3 when CR4.PCE=0.
>> +	 */
>> +	if (enable_vmware_backdoor)
>> +		return false;
>> +
>> +	/*
>> +	 * FIXME: In theory, perf metrics is always combined with fixed
>> +	 *	  counter 3. it's fair enough to compare the guest and host
>> +	 *	  fixed counter number and don't need to check perf metrics
>> +	 *	  explicitly. However kvm_pmu_cap.num_counters_fixed is limited
>> +	 *	  KVM_MAX_NR_FIXED_COUNTERS (3) as fixed counter 3 is not
>> +	 *	  supported now. perf metrics is still needed to be checked
>> +	 *	  explicitly here. Once fixed counter 3 is supported, the perf
>> +	 *	  metrics checking can be removed.
>> +	 */
> And then what happens when hardware supported fixed counter #4?  KVM has the same
> problem, and we can't check for features that KVM doesn't know about.
>
> The entire problem is that this code is checking for *KVM* support, but what the
> guest can see and access needs to be checked against *hardware* support.  Handling
> that is simple, just take a snapshot of the host PMU capabilities before KVM
> generates kvm_pmu_cap, and use the unadulterated snapshot here (and everywhere
> else with similar checks).

Yes. That's correct. Whether disabling intercept should check against  HW 
instead of KVM PMU capability since host perf subsystem may hide some PMU
features.


>
>> +	return pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
>> +	       pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
>> +	       vcpu_has_perf_metrics(vcpu) == kvm_host_has_perf_metrics() &&
>> +	       pmu->counter_bitmask[KVM_PMC_GP] ==
>> +				(BIT_ULL(kvm_pmu_cap.bit_width_gp) - 1) &&
>> +	       pmu->counter_bitmask[KVM_PMC_FIXED] ==
>> +				(BIT_ULL(kvm_pmu_cap.bit_width_fixed) - 1);
>> +}
>> @@ -212,6 +212,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>  	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
>>  }
>>  
>> +static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> +	__amd_pmu_refresh(vcpu);
> To better communicate the roles of the two paths to refresh():
>
> 	amd_pmu_refresh_capabilities(vcpu);
>
> 	amd_pmu_refresh_controls(vcpu);
>
> Ditto for Intel.

Sure.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl
  2025-05-15  0:33       ` Sean Christopherson
@ 2025-05-15  3:45         ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  3:45 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Zide Chen, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:33 AM, Sean Christopherson wrote:
> On Wed, Mar 26, 2025, Mingwei Zhang wrote:
>> On Wed, Mar 26, 2025 at 9:51 AM Chen, Zide <zide.chen@intel.com> wrote:
>>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>>> index 6ad71752be4b..4e8cefcce7ab 100644
>>>> --- a/arch/x86/kvm/pmu.c
>>>> +++ b/arch/x86/kvm/pmu.c
>>>> @@ -646,6 +646,30 @@ void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
>>>>       }
>>>>  }
>>>>
>>>> +static void kvm_pmu_sync_global_ctrl_from_vmcs(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +     struct msr_data msr_info = { .index = MSR_CORE_PERF_GLOBAL_CTRL };
>>>> +
>>>> +     if (!kvm_mediated_pmu_enabled(vcpu))
>>>> +             return;
>>>> +
>>>> +     /* Sync pmu->global_ctrl from GUEST_IA32_PERF_GLOBAL_CTRL. */
>>>> +     kvm_pmu_call(get_msr)(vcpu, &msr_info);
>>>> +}
>>>> +
>>>> +static void kvm_pmu_sync_global_ctrl_to_vmcs(struct kvm_vcpu *vcpu, u64 global_ctrl)
>>>> +{
>>>> +     struct msr_data msr_info = {
>>>> +             .index = MSR_CORE_PERF_GLOBAL_CTRL,
>>>> +             .data = global_ctrl };
>>>> +
>>>> +     if (!kvm_mediated_pmu_enabled(vcpu))
>>>> +             return;
>>>> +
>>>> +     /* Sync pmu->global_ctrl to GUEST_IA32_PERF_GLOBAL_CTRL. */
>>>> +     kvm_pmu_call(set_msr)(vcpu, &msr_info);
> Eh, just add a dedicated kvm_pmu_ops hook.  Feeding this through set_msr() avoids
> adding another hook, but makes the code hard to follow and requires the above
> ugly boilerplate.

Sure. I originally thought if it's worthy to add a new kvm_pmu_ops hook
since only Intel platforms need it.


>
>>>> +}
>>>> +
>>>>  bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
>>>>  {
>>>>       switch (msr) {
>>>> @@ -680,7 +704,6 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>>               msr_info->data = pmu->global_status;
>>>>               break;
>>>>       case MSR_AMD64_PERF_CNTR_GLOBAL_CTL:
>>>> -     case MSR_CORE_PERF_GLOBAL_CTRL:
>>>>               msr_info->data = pmu->global_ctrl;
>>>>               break;
>>>>       case MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR:
>>>> @@ -731,6 +754,9 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>
>>> pmu->global_ctrl doesn't always have the up-to-date guest value, need to
>>> sync from vmcs/vmbc before comparing it against 'data'.
>>>
>>> +               kvm_pmu_sync_global_ctrl_from_vmcs(vcpu);
>>>                 if (pmu->global_ctrl != data) {
>> Good catch. Thanks!
>>
>> This is why I really prefer just unconditionally syncing the global
>> ctrl from VMCS to pmu->global_ctrl and vice versa.
>>
>> We might get into similar problems as well in the future.
> The problem isn't conditional synchronization, it's that y'all reinvented the
> wheel, poorly.  This is a solved problem via EXREG and wrappers.
>
> That said, I went through the exercise of adding a PERF_GLOBAL_CTRL EXREG and
> associated wrappers, and didn't love the result.  Host writes should be rare, so
> the dirty tracking is overkill.  For reads, the cost of VMREAD is lower than
> VMWRITE (doesn't trigger consistency check re-evaluation on VM-Enter), and is
> dwarfed by the cost of switching all other PMU state.
>
> So I think for the initial implementation, it makes sense to propagated writes
> to the VMCS on demand, but do VMREAD after VM-Exit (if VM-Enter was successful).
> We can always revisit the optimization if/when we optimize the PMU world switches,
> e.g. to defer them if there are no active host events.

Sure.


>
>>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>>> index 8a7af02d466e..ecf72394684d 100644
>>>> --- a/arch/x86/kvm/vmx/nested.c
>>>> +++ b/arch/x86/kvm/vmx/nested.c
>>>> @@ -7004,7 +7004,8 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>>>>               VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>>>               VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>>>>               VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
>>>> -             VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
>>>> +             VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
>>>> +             VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> This is completely wrong.  Stuffing VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL here
> advertises support for KVM emulation of the control, and that support is non-existent
> in this patch (and series).
>
> Just drop this, emulation of VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL can be done
> separately.

Sure.


>
>>>> +     mediated = kvm_mediated_pmu_enabled(vcpu);
>>>> +     if (cpu_has_load_perf_global_ctrl()) {
>>>> +             vm_entry_controls_changebit(vmx,
>>>> +                     VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, mediated);
>>>> +             /*
>>>> +              * Initialize guest PERF_GLOBAL_CTRL to reset value as SDM rules.
>>>> +              *
>>>> +              * Note: GUEST_IA32_PERF_GLOBAL_CTRL must be initialized to
>>>> +              * "BIT_ULL(pmu->nr_arch_gp_counters) - 1" instead of pmu->global_ctrl
>>>> +              * since pmu->global_ctrl is only be initialized when guest
>>>> +              * pmu->version > 1. Otherwise if pmu->version is 1, pmu->global_ctrl
>>>> +              * is 0 and guest counters are never really enabled.
>>>> +              */
>>>> +             if (mediated)
>>>> +                     vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
>>>> +                                  BIT_ULL(pmu->nr_arch_gp_counters) - 1);
> This belongs in common code, as a call to the aforementioned hook to propagate
> PERF_GLOBAL_CTRL to hardware.

Sure.


>
>>>> +     }
>>>> +
>>>> +     if (cpu_has_save_perf_global_ctrl())
>>>> +             vm_exit_controls_changebit(vmx,
>>>> +                     VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
>>>> +                     VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, mediated);
>>>>  }
>>>>
>>>>  static void intel_pmu_init(struct kvm_vcpu *vcpu)
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index ff66f17d6358..38ecf3c116bd 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -4390,6 +4390,13 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
>>>>
>>>>       if (cpu_has_load_ia32_efer())
>>>>               vmcs_write64(HOST_IA32_EFER, kvm_host.efer);
>>>> +
>>>> +     /*
>>>> +      * Initialize host PERF_GLOBAL_CTRL to 0 to disable all counters
>>>> +      * immediately once VM exits. Mediated vPMU then call perf_guest_exit()
>>>> +      * to re-enable host perf events.
>>>> +      */
>>>> +     vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
> This needs to be conditioned on the mediated PMU being enabled, because this field
> is not constant when using the emulated PMU (or no vPMU).

Yes.


>
>>>> @@ -8451,6 +8462,15 @@ __init int vmx_hardware_setup(void)
>>>>               enable_sgx = false;
>>>>  #endif
>>>>
>>>> +     /*
>>>> +      * All CPUs that support a mediated PMU are expected to support loading
>>>> +      * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
>>>> +      */
>>>> +     if (enable_mediated_pmu &&
>>>> +         (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
>>>> +                       !cpu_has_save_perf_global_ctrl())))
> This needs to be conditioned on !HYPERVISOR, or it *will* fire.

Ok.


>
> And placing this check here, without *any* mention of *why* you did so, is evil
> and made me very grumpy.  I had to discover the hard way that you checked the
> VMCS fields here, instead of in kvm_init_pmu_capability() where it logically
> belongs, because the VMCS configuration isn't yet initialized.
>
> Grumpiness aside, I don't like this late clear of enable_mediated_pmu, as it risks
> a variation of the problem you're trying to avoid, i.e. risks consuming the variable
> between kvm_init_pmu_capability() and here.

Yes.


>
> I don't see any reason why setup_vmcs_config() can't be called before
> kvm_x86_vendor_init(), so unless I'm missing/forgetting something, let's just do
> that, and move these checks where they belong.

I'm not quite sure about this. Let me double check.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
  2025-05-15  0:37   ` Sean Christopherson
@ 2025-05-15  5:09     ` Mi, Dapeng
  2025-05-15 19:22       ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  5:09 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:37 AM, Sean Christopherson wrote:
> This is not an optimization in any sane interpretation of that word.

Yes, maybe clean up or bug fix is more accurate.


>
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Currently pmu->global_ctrl is initialized in the common kvm_pmu_refresh()
>> helper since both Intel and AMD CPUs set enable bits for all GP counters
>> for PERF_GLOBAL_CTRL MSR. But it may be not the best place to initialize
>> pmu->global_ctrl. Strictly speaking, pmu->global_ctrl is vendor specific
> And?  There's mounds of KVM code that show it's very, very easy to manage
> global_ctrl in common code.

The original intention is to put all initialization code into a same place,
which looks more easily to maintain. But if you don't like it. would drop
the change.


>
>> and there are lots of global_ctrl related processing in
>> intel/amd_pmu_refresh() helpers, so better handle them in same place.
>> Thus move pmu->global_ctrl initialization into intel/amd_pmu_refresh()
>> helpers.
>>
>> Besides, intel_pmu_refresh() doesn't handle global_ctrl_rsvd and
>> global_status_rsvd properly and fix it.
> Really?  You mention a bug fix in passing, and squash it into an opinionated
> refactoring that is advertised as "optimizations" without even stating what the
> bug is?  C'mon.

Sorry not clearly to describe the issue. global_ctrl_rsvd and
global_status_rsvd should be updated only when pmu->verion >=2, but the
original code doesn't.


>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/pmu.c           | 10 -------
>>  arch/x86/kvm/svm/pmu.c       | 14 +++++++--
>>  arch/x86/kvm/vmx/pmu_intel.c | 55 ++++++++++++++++++------------------
>>  3 files changed, 39 insertions(+), 40 deletions(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 4e8cefcce7ab..2ac4c039de8b 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -843,16 +843,6 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>  		return;
>>  
>>  	kvm_pmu_call(refresh)(vcpu);
>> -
>> -	/*
>> -	 * At RESET, both Intel and AMD CPUs set all enable bits for general
>> -	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
>> -	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
>> -	 * in the global controls).  Emulate that behavior when refreshing the
>> -	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>> -	 */
>> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
>> -		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
> Absolutely not, this code stays where it is.

Sure.


>
>>  }
>>  
>>  void kvm_pmu_init(struct kvm_vcpu *vcpu)
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index 153972e944eb..eba086ef5eca 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>  	pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
>>  					 kvm_pmu_cap.num_counters_gp);
>>  
>> -	if (pmu->version > 1) {
>> -		pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>> +	if (kvm_pmu_cap.version > 1) {
> It's not just global_ctrl.  PEBS and the fixed counters also depend on v2+ (the
> SDM contradicts itself; KVM's ABI is that they're v2+).
>
>> +		/*
>> +		 * At RESET, AMD CPUs set all enable bits for general purpose counters in
>> +		 * IA32_PERF_GLOBAL_CTRL (so that software that was written for v1 PMUs
>> +		 * don't unknowingly leave GP counters disabled in the global controls).
>> +		 * Emulate that behavior when refreshing the PMU so that userspace doesn't
>> +		 * need to manually set PERF_GLOBAL_CTRL.
>> +		 */
>> +		pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
>> +		pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
>>  		pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
>>  	}
>>  
>> -	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
>> +	pmu->counter_bitmask[KVM_PMC_GP] = BIT_ULL(48) - 1;
> I like these cleanups, but they too belong in a separate patch.

Sure.


>
>>  	pmu->reserved_bits = 0xfffffff000280000ull;
>>  	pmu->raw_event_mask = AMD64_RAW_EVENT_MASK;
>>  	/* not applicable to AMD; but clean them to prevent any fall out */

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering
  2025-05-15  0:42   ` Sean Christopherson
@ 2025-05-15  5:34     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  5:34 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:42 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> -	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
>> +	pmu->fixed_ctr_ctrl = pmu->fixed_ctr_ctrl_hw = 0;
>> +	pmu->global_ctrl = pmu->global_status = 0;
> VMCS needs to be updated.

Yes. Thanks.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs
  2025-05-15  0:41   ` Sean Christopherson
@ 2025-05-15  5:37     ` Mi, Dapeng
  2025-05-15 19:06       ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  5:37 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:41 AM, Sean Christopherson wrote:
> Again, use more precise language.  "Configure interceptions" is akin to "do work".
> It gives readers a vague idea of what's going on, but this
>
>   KVM: x86/pmu: Disable interception of select PMU MSRs for mediated vPMUs
>
> is just as concise, and more descriptive.

Yes, absolutely. Thanks.


>
>> +	/*
>> +	 * In mediated vPMU, intercept global PMU MSRs when guest PMU only owns
>> +	 * a subset of counters provided in HW or its version is less than 2.
>> +	 */
>> +	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&
>> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)
> This logic belongs in common code.  Just because AMD doesn't have fixed counters
> doesn't mean KVM can't have a superfluous "0 == 0" check.

Yes.


>
>> +	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&
> Just require the guest to have PERF_GLOBAL_CTRL, I don't see any reason to support
> v1 PMUs.  It adds complexity and weirdness, and I can't imagine there's a use case.

Ok.


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  2025-05-15  0:43   ` Sean Christopherson
@ 2025-05-15  5:38     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  5:38 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:43 AM, Sean Christopherson wrote:
> Again, be more precise.
>
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Mediated vPMU needs to intercept EVENTSELx and FIXED_CNTR_CTRL MSRs to
>> filter out guest malicious perf events. Either writing these MSRs or
>> updating event filters would call reprogram_counter() eventually. Thus
>> check if the guest event should be filtered out in reprogram_counter().
>> If so, clear corresponding EVENTSELx MSR or FIXED_CNTR_CTRL field to
>> ensure the guest event won't be really enabled at vm-entry.
>>
>> Besides, mediated vPMU intercepts the MSRs of these guest not owned
>> counters and it just needs simply to read/write from/to pmc->counter.
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Co-developed-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/pmu.c | 27 +++++++++++++++++++++++++++
>>  arch/x86/kvm/pmu.h |  3 +++
>>  2 files changed, 30 insertions(+)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 63143eeb5c44..e9100dc49fdc 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -305,6 +305,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>>  
>>  void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
>>  {
>> +	if (kvm_mediated_pmu_enabled(pmc->vcpu)) {
>> +		pmc->counter = val & pmc_bitmask(pmc);
>> +		return;
>> +	}
>> +
>>  	/*
>>  	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
>>  	 * read-modify-write.  Adjust the counter value so that its value is
>> @@ -455,6 +460,28 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>>  	bool emulate_overflow;
>>  	u8 fixed_ctr_ctrl;
>>  
>> +	if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
>> +		bool allowed = check_pmu_event_filter(pmc);
>> +
>> +		if (pmc_is_gp(pmc)) {
>> +			if (allowed)
>> +				pmc->eventsel_hw |= pmc->eventsel &
>> +						    ARCH_PERFMON_EVENTSEL_ENABLE;
>> +			else
>> +				pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>> +		} else {
>> +			int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
>> +
>> +			if (allowed)
>> +				pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
>> +			else
>> +				pmu->fixed_ctr_ctrl_hw &=
>> +					~intel_fixed_bits_by_idx(idx, 0xf);
>> +		}
>> +
>> +		return 0;
> I think it's worth adding a helper for this, as it makes things a bit more
> self-documenting in terms of when KVM needs to "reprogram" mediated PMU PMCs.

Sure. Thanks.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 00/38] Mediated vPMU 4.0 for x86
  2025-05-15  0:49 ` Sean Christopherson
@ 2025-05-15  5:45   ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-15  5:45 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/15/2025 8:49 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> Dapeng Mi (18):
>>   KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
>>   KVM: x86/pmu: Check PMU cpuid configuration from user space
>>   KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
>>   KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
>>   KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
>>   KVM: VMX: Add macros to wrap around
>>     {secondary,tertiary}_exec_controls_changebit()
>>   KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
>>   KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
>>     vm_exit/entry_ctrl
>>   KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
>>   KVM: x86/pmu: Setup PMU MSRs' interception mode
>>   KVM: x86/pmu: Handle PMU MSRs interception and event filtering
>>   KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
>>   KVM: x86/pmu: Handle emulated instruction for mediated vPMU
>>   KVM: nVMX: Add macros to simplify nested MSR interception setting
>>   KVM: selftests: Add mediated vPMU supported for pmu tests
>>   KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
>>   KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
>>   KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
>>
>> Kan Liang (8):
>>   perf: Support get/put mediated PMU interfaces
>>   perf: Skip pmu_ctx based on event_type
>>   perf: Clean up perf ctx time
>>   perf: Add a EVENT_GUEST flag
>>   perf: Add generic exclude_guest support
>>   perf: Add switch_guest_ctx() interface
>>   perf/x86: Support switch_guest_ctx interface
>>   perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
>>
>> Mingwei Zhang (5):
>>   perf/x86: Forbid PMI handler when guest own PMU
>>   perf/x86/core: Plumb mediated PMU capability from x86_pmu to
>>     x86_pmu_cap
>>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>>   KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
>>   KVM: nVMX: Add nested virtualization support for mediated PMU
>>
>> Sandipan Das (4):
>>   perf/x86/core: Do not set bit width for unavailable counters
>>   KVM: x86/pmu: Add AMD PMU registers to direct access list
>>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>>     write to event selectors
>>   perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
>>
>> Xiong Zhang (3):
>>   x86/irq: Factor out common code for installing kvm irq handler
>>   perf: core/x86: Register a new vector for KVM GUEST PMI
>>   KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
> I ran out of time today and didn't get emails send for all patches.  I'm planning
> on getting that done tomorrow.
>
> I already have most of the proposed changes implemented:
>
>   https://github.com/sean-jc/linux.git x86/mediated_pmu
>
> It compiles and doesn't explode, but it's not fully functional (PMU tests fail).
> I'll poke at it over the next few days, but if someone is itching to figure out
> what I broke, then by all means.
>
> Given that I've already made many modifications (I have a hard time reviewing a
> series this big without editing as I go), unless someone objects, I'll post v5
> (and v6+ as needed), though that'll like be days/weeks as I need to get it working,
> and want to do more passes over the code, shortlogs, and changelogs. 

Sean, thank you very much for reviewing and refactoring the patchset. I
would look at the code and check it locally in next several days as well.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
  2025-03-24 17:31 ` [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry Mingwei Zhang
@ 2025-05-15 16:29   ` Sean Christopherson
  2025-05-16  2:37     ` Mi, Dapeng
  2025-05-16 13:26   ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15 16:29 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index 9159bf1a4730..35f27366c277 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -22,6 +22,8 @@ KVM_X86_PMU_OP(init)
>  KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
> +KVM_X86_PMU_OP(put_guest_context)
> +KVM_X86_PMU_OP(load_guest_context)

For KVM, the "guest_context" part is largely superfluous, as KVM always operates
on guest state, e.g. kvm_fpu_{load,put}().

I do think we should squeeze in "mediated" somewhere, otherwise the it's hard to
see that these are specific to the mediated PMU.

So probably mediated_{load,put}()?

>  #undef KVM_X86_PMU_OP
>  #undef KVM_X86_PMU_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7ee74bbbb0aa..4117a382739a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -568,6 +568,10 @@ struct kvm_pmu {
>  	u64 raw_event_mask;
>  	struct kvm_pmc gp_counters[KVM_MAX_NR_GP_COUNTERS];
>  	struct kvm_pmc fixed_counters[KVM_MAX_NR_FIXED_COUNTERS];
> +	u32 gp_eventsel_base;
> +	u32 gp_counter_base;
> +	u32 fixed_base;
> +	u32 cntr_shift;

Gah, my bad, "shift" was a terrible suggestion.  It should be "stride".

> @@ -306,6 +313,10 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>  int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>  void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
>  bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
> +void kvm_pmu_put_guest_pmcs(struct kvm_vcpu *vcpu);
> +void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu);
> +void kvm_pmu_put_guest_context(struct kvm_vcpu *vcpu);
> +void kvm_pmu_load_guest_context(struct kvm_vcpu *vcpu);
>  
>  bool is_vmware_backdoor_pmc(u32 pmc_idx);
>  bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 1a7e3a897fdf..7e0d84d50b74 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -175,6 +175,22 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	return 1;
>  }
>  
> +static inline void amd_update_msr_base(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	if (kvm_pmu_has_perf_global_ctrl(pmu) ||
> +	    guest_cpu_cap_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
> +		pmu->gp_eventsel_base = MSR_F15H_PERF_CTL0;
> +		pmu->gp_counter_base = MSR_F15H_PERF_CTR0;
> +		pmu->cntr_shift = 2;
> +	} else {
> +		pmu->gp_eventsel_base = MSR_K7_EVNTSEL0;
> +		pmu->gp_counter_base = MSR_K7_PERFCTR0;
> +		pmu->cntr_shift = 1;
> +	}
> +}

Moving quoted text around to organize responses...

> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 796b7bc4affe..ed17ab198dfb 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -460,6 +460,17 @@ static void intel_pmu_enable_fixed_counter_bits(struct kvm_pmu *pmu, u64 bits)
>  		pmu->fixed_ctr_ctrl_rsvd &= ~intel_fixed_bits_by_idx(i, bits);
>  }
>  
> +static inline void intel_update_msr_base(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	pmu->gp_eventsel_base = MSR_P6_EVNTSEL0;
> +	pmu->gp_counter_base = fw_writes_is_enabled(vcpu) ?
> +			       MSR_IA32_PMC0 : MSR_IA32_PERFCTR0;

This is wrong.  And I unintentionally proved that it's wrong, by goofing when I
fixed up this code and using MSR_IA32_PERFCTR0 instead of MSR_IA32_PMC0.

Whether or not the guest supports full-width writes is irrelevant, because support
for FW writes doesn't change the width of the counters.  Just because the *guest* 
can't directly write all e.g. 48 bits doesn't mean clobbering bits 47:32 is ok.

Similarly, on the AMD side, using the legacy interface in KVM is unnecessary.
The guest may be limited to those MSRs, but KVM has a hard dependency on PMU v2,
so just unconditionally use MSR_F15H_PERF_CTR0 (and for the record, because I
had to look it up, the newfangled MSRs on AMD are aliased to the legacy MSRs for
0..3).

Very happily, that means the MSRs don't need to be per-PMU, and they don't even
need to be configured at runtime for a given vendor.  Simply require FW writes
on Intel to enable the mediated PMU, and then hardcode the GP base to MSR_IA32_PMC0.

> +static void amd_put_guest_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
> +	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, pmu->global_status);
> +
> +	/* Clear global status bits if non-zero */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, pmu->global_status);
> +
> +	kvm_pmu_put_guest_pmcs(vcpu);
> +}
> +
> +static void amd_load_guest_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	u64 global_status;
> +
> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);

Back when I suggested we give up on trying to handle PMCs and eventsels in common
x86, this WRMSR didn't exist.  Now that it does, I don't see anything that prevents
invoking kvm_pmu_{load,put}_guest_pmcs() from common x86, KVM just needs to clear
GLOBAL_CTRL before setting eventsels and PMCs.

For the load path:

	/*
	 * Disable all counters before loading event selectors and PMCs so that
	 * KVM doesn't enable or load guest counters while host events are
	 * active.  VMX will enable/disabled counters at VM-Enter/VM-Exit by
	 * atomically loading PERF_GLOBAL_CONTROL.  SVM effectively performs
	 * the switch by configuring all events to be GUEST_ONLY.
	 */
	wrmsrl(kvm_pmu_ops.PERF_GLOBAL_CTRL, 0);

	kvm_pmu_load_guest_pmcs(vcpu);

	kvm_pmu_call(mediated_load)(vcpu);

And for the put path, just reverse the ordering:

	/*
	 * Defer handling of PERF_GLOBAL_CTRL to vendor code.  On Intel, it's
	 * atomically cleared on VM-Exit, i.e. doesn't need to be clear here.
	 */
	kvm_pmu_call(mediated_put)(vcpu);

	kvm_pmu_put_guest_pmcs(vcpu);

	perf_put_guest_context();

On Intel, PERF_GLOBAL_CTRL is cleared on VM-Exit, and on AMD, the vendor hook
will clear it.  The fact that vendor code sets other MSRs is irrelevant, what
matters is that all counters are quieseced.

I think it's still worth having helpers, but they can be static locals.

> +
> +	kvm_pmu_load_guest_pmcs(vcpu);
> +
> +	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);
> +	/* Clear host global_status MSR if non-zero. */
> +	if (global_status)
> +		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);
> +
> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);
> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
> +}
> +
>  static void intel_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
> @@ -809,6 +822,50 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
>  	}
>  }
>  
> +static void intel_put_guest_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	/* Global ctrl register is already saved at VM-exit. */
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> +
> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
> +
> +	/*
> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> +	 * also avoid these guest fixed counters get accidentially enabled
> +	 * during host running when host enable global ctrl.
> +	 */
> +	if (pmu->fixed_ctr_ctrl_hw)
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> +
> +	kvm_pmu_put_guest_pmcs(vcpu);
> +}
> +
> +static void intel_load_guest_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	u64 global_status, toggle;
> +
> +	/* Clear host global_ctrl MSR if non-zero. */
> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> +
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> +	toggle = pmu->global_status ^ global_status;
> +	if (global_status & toggle)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
> +	if (pmu->global_status & toggle)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
> +
> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
> +
> +	kvm_pmu_load_guest_pmcs(vcpu);
> +}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-05-14 23:19     ` Sean Christopherson
  2025-05-15  1:37       ` Mi, Dapeng
@ 2025-05-15 18:39       ` Liang, Kan
  2025-05-15 19:25         ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Liang, Kan @ 2025-05-15 18:39 UTC (permalink / raw)
  To: Sean Christopherson, Peter Zijlstra
  Cc: Mingwei Zhang, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania



On 2025-05-14 7:19 p.m., Sean Christopherson wrote:
> On Fri, Apr 25, 2025, Peter Zijlstra wrote:
>> On Mon, Mar 24, 2025 at 05:30:45PM +0000, Mingwei Zhang wrote:
>>
>>> @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
>>>  }
>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>  
>>> +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
>>> +{
>>> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> +	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>>> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> +	if (cpuctx->task_ctx) {
>>> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> +		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> +	}
>>> +}
>>> +
>>> +/* When entering a guest, schedule out all exclude_guest events. */
>>> +void perf_guest_enter(void)
>>> +{
>>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>> +
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
>>> +		goto unlock;
>>> +
>>> +	perf_host_exit(cpuctx);
>>> +
>>> +	__this_cpu_write(perf_in_guest, true);
>>> +
>>> +unlock:
>>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_enter);
>>> +
>>> +static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
>>> +{
>>> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> +	if (cpuctx->task_ctx)
>>> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> +
>>> +	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> +
>>> +	if (cpuctx->task_ctx)
>>> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> +}
>>> +
>>> +void perf_guest_exit(void)
>>> +{
>>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>> +
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>> +		goto unlock;
>>> +
>>> +	perf_host_enter(cpuctx);
>>> +
>>> +	__this_cpu_write(perf_in_guest, false);
>>> +unlock:
>>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> +}
>>> +EXPORT_SYMBOL_GPL(perf_guest_exit);
>>
>> This naming is confusing on purpose? Pick either guest/host and stick
>> with it.
> 
> +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
> These flows
> 
> After a bit of hacking, and with a few spoilers, this is what I ended up with
> (not anywhere near fully tested).  I like following KVM's kvm_xxx_{load,put}()
> nomenclature to tie everything together, so I went with "guest" instead of "host"
> even though the majority of work being down is to shedule out/in host context.
> 
> /* When loading a guest's mediated PMU, schedule out all exclude_guest events. */
> void perf_load_guest_context(unsigned long data)
> {
> 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> 
> 	lockdep_assert_irqs_disabled();
> 
> 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> 
> 	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
> 		goto unlock;
> 
> 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> 	if (cpuctx->task_ctx) {
> 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> 		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> 	}
> 
> 	arch_perf_load_guest_context(data);
> 
> 	__this_cpu_write(guest_ctx_loaded, true);
> 
> unlock:
> 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> }
> EXPORT_SYMBOL_GPL(perf_load_guest_context);
> 
> void perf_put_guest_context(void)
> {
> 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> 
> 	lockdep_assert_irqs_disabled();
> 
> 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> 
> 	if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
> 		goto unlock;
> 
> 	arch_perf_put_guest_context();

It will set the guest_ctx_loaded to false.
The update_context_time() invoked in the perf_event_sched_in() will not
get a chance to update the guest time.

I think something as below should work.

- Disable all in the PMU (disable global control)
- schedule in the host counters (but not run yet since the global
control of the PMU is disabled)
- arch_perf_put_guest_context()
- Enable all in the PMU (Enable global control. The host counters now start)

void perf_put_guest_context(void)
{
	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);

	lockdep_assert_irqs_disabled();

	perf_ctx_lock(cpuctx, cpuctx->task_ctx);

	if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
		goto unlock;

	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
	if (cpuctx->task_ctx)
		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);

	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);

	arch_perf_put_guest_context();

	if (cpuctx->task_ctx)
		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);

unlock:
	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}



Similar to the perf_load_guest_context().

- Disable all in the PMU (disable global control)
- schedule out all the host counters
- arch_perf_load_guest_context()
- Enable all in the PMU (enable global control)

void perf_load_guest_context(unsigned long data)
{
	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);

	lockdep_assert_irqs_disabled();

	perf_ctx_lock(cpuctx, cpuctx->task_ctx);

	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
		goto unlock;

	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
	if (cpuctx->task_ctx) {
		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
	}

	arch_perf_load_guest_context(data);

	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
	if (cpuctx->task_ctx)
		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
unlock:
	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}


Thanks,
Kan

> 
> 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> 	if (cpuctx->task_ctx)
> 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> 
> 	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
> 
> 	if (cpuctx->task_ctx)
> 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> 
> 	__this_cpu_write(guest_ctx_loaded, false);
> unlock:> 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> }
> EXPORT_SYMBOL_GPL(perf_put_guest_context);


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs
  2025-05-15  5:37     ` Mi, Dapeng
@ 2025-05-15 19:06       ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15 19:06 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Thu, May 15, 2025, Dapeng Mi wrote:
> On 5/15/2025 8:41 AM, Sean Christopherson wrote:
> >> +	if (kvm_mediated_pmu_enabled(vcpu) && kvm_pmu_has_perf_global_ctrl(pmu) &&
> > Just require the guest to have PERF_GLOBAL_CTRL, I don't see any reason to support
> > v1 PMUs.  It adds complexity and weirdness, and I can't imagine there's a use case.

I take that back, there absolutely are use cases, especially for AMD.  Any VM
shape that exists today should be compatible with the mediated PMU.  And I was
dead wrong about adding complexity; KVM already needs to intercept GLOBAL_CTRL
if the guest has fewer PMCs than hardware, so incorporating this check is all of
two lines of code.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
  2025-05-15  5:09     ` Mi, Dapeng
@ 2025-05-15 19:22       ` Sean Christopherson
  2025-05-16  1:03         ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15 19:22 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Thu, May 15, 2025, Dapeng Mi wrote:
> On 5/15/2025 8:37 AM, Sean Christopherson wrote:
> >> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> >> index 153972e944eb..eba086ef5eca 100644
> >> --- a/arch/x86/kvm/svm/pmu.c
> >> +++ b/arch/x86/kvm/svm/pmu.c
> >> @@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
> >>  	pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
> >>  					 kvm_pmu_cap.num_counters_gp);
> >>  
> >> -	if (pmu->version > 1) {
> >> -		pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> >> +	if (kvm_pmu_cap.version > 1) {

ARGH!!!!!  I saw the pmu->version => kvm_pmu_cap.version change when going through
this patch, but assumed it was simply a refactoring bug and ignored it.

Nope, 100% intentional, as I discovered after spending the better part of an hour
debugging.  Stuffing pmu->global_ctrl when it doesn't exist is necessary when the
mediated PMU is enabled, because pmu->global_ctrl will always be loaded in hardware.
And so loading '0' means the PMU is effectively disabled, because KVM disallows the
guest from writing the MSR.

_That_ is the type of thing that absolutely needs a comment and a lengthy explanation
in the changelog.

Intel also needs the same treatment for PMU v1.  And since there's no hackery that
I can see, that suggests PMU v1 guests haven't been tested with the mediated PMU
on Intel.

I guess congratulations are in order, because this patch just became my new goto
example of why I'm so strict about on the "one thing per patch" rule.

> > It's not just global_ctrl.  PEBS and the fixed counters also depend on v2+ (the
> > SDM contradicts itself; KVM's ABI is that they're v2+).
> >
> >> +		/*
> >> +		 * At RESET, AMD CPUs set all enable bits for general purpose counters in
> >> +		 * IA32_PERF_GLOBAL_CTRL (so that software that was written for v1 PMUs
> >> +		 * don't unknowingly leave GP counters disabled in the global controls).
> >> +		 * Emulate that behavior when refreshing the PMU so that userspace doesn't
> >> +		 * need to manually set PERF_GLOBAL_CTRL.
> >> +		 */
> >> +		pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
> >> +		pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
> >>  		pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
> >>  	}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-05-15 18:39       ` Liang, Kan
@ 2025-05-15 19:25         ` Sean Christopherson
  2025-05-15 20:18           ` Liang, Kan
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-15 19:25 UTC (permalink / raw)
  To: Kan Liang
  Cc: Peter Zijlstra, Mingwei Zhang, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Thu, May 15, 2025, Kan Liang wrote:
> On 2025-05-14 7:19 p.m., Sean Christopherson wrote:
> >> This naming is confusing on purpose? Pick either guest/host and stick
> >> with it.
> > 
> > +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
> > These flows
> > 
> > After a bit of hacking, and with a few spoilers, this is what I ended up with
> > (not anywhere near fully tested).  I like following KVM's kvm_xxx_{load,put}()
> > nomenclature to tie everything together, so I went with "guest" instead of "host"
> > even though the majority of work being down is to shedule out/in host context.
> > 
> > /* When loading a guest's mediated PMU, schedule out all exclude_guest events. */
> > void perf_load_guest_context(unsigned long data)
> > {
> > 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> > 
> > 	lockdep_assert_irqs_disabled();
> > 
> > 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > 
> > 	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
> > 		goto unlock;
> > 
> > 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> > 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> > 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> > 	if (cpuctx->task_ctx) {
> > 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> > 		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> > 	}
> > 
> > 	arch_perf_load_guest_context(data);
> > 
> > 	__this_cpu_write(guest_ctx_loaded, true);
> > 
> > unlock:
> > 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> > }
> > EXPORT_SYMBOL_GPL(perf_load_guest_context);
> > 
> > void perf_put_guest_context(void)
> > {
> > 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> > 
> > 	lockdep_assert_irqs_disabled();
> > 
> > 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > 
> > 	if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
> > 		goto unlock;
> > 
> > 	arch_perf_put_guest_context();
> 
> It will set the guest_ctx_loaded to false.
> The update_context_time() invoked in the perf_event_sched_in() will not
> get a chance to update the guest time.

The guest_ctx_loaded in arch/x86/events/core.c is a different variable, it just
happens to have the same name.

It's completely gross, but exposing guest_ctx_loaded outside of kernel/events/core.c
didn't seem like a great alternative.  If we wanted to use a single variable,
then the writes in arch_perf_{load,put}_guest_context() can simply go away.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-05-15 19:25         ` Sean Christopherson
@ 2025-05-15 20:18           ` Liang, Kan
  0 siblings, 0 replies; 127+ messages in thread
From: Liang, Kan @ 2025-05-15 20:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Zijlstra, Mingwei Zhang, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania



On 2025-05-15 3:25 p.m., Sean Christopherson wrote:
> On Thu, May 15, 2025, Kan Liang wrote:
>> On 2025-05-14 7:19 p.m., Sean Christopherson wrote:
>>>> This naming is confusing on purpose? Pick either guest/host and stick
>>>> with it.
>>>
>>> +1.  I also think the inner perf_host_{enter,exit}() helpers are superflous.
>>> These flows
>>>
>>> After a bit of hacking, and with a few spoilers, this is what I ended up with
>>> (not anywhere near fully tested).  I like following KVM's kvm_xxx_{load,put}()
>>> nomenclature to tie everything together, so I went with "guest" instead of "host"
>>> even though the majority of work being down is to shedule out/in host context.
>>>
>>> /* When loading a guest's mediated PMU, schedule out all exclude_guest events. */
>>> void perf_load_guest_context(unsigned long data)
>>> {
>>> 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>
>>> 	lockdep_assert_irqs_disabled();
>>>
>>> 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>
>>> 	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
>>> 		goto unlock;
>>>
>>> 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>> 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>>> 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> 	if (cpuctx->task_ctx) {
>>> 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>> 		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>>> 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>> 	}
>>>
>>> 	arch_perf_load_guest_context(data);
>>>
>>> 	__this_cpu_write(guest_ctx_loaded, true);
>>>
>>> unlock:
>>> 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>> }
>>> EXPORT_SYMBOL_GPL(perf_load_guest_context);
>>>
>>> void perf_put_guest_context(void)
>>> {
>>> 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>
>>> 	lockdep_assert_irqs_disabled();
>>>
>>> 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>
>>> 	if (WARN_ON_ONCE(!__this_cpu_read(guest_ctx_loaded)))
>>> 		goto unlock;
>>>
>>> 	arch_perf_put_guest_context();
>>
>> It will set the guest_ctx_loaded to false.
>> The update_context_time() invoked in the perf_event_sched_in() will not
>> get a chance to update the guest time.
> 
> The guest_ctx_loaded in arch/x86/events/core.c is a different variable, it just
> happens to have the same name.
>

Ah, I thought it was the same variable. Then there should be no issue
with the guest time.

But the same variable name may bring confusions. Maybe add a x86_pmu/x86
prefix to the variable in x86 code.
> It's completely gross, but exposing guest_ctx_loaded outside of kernel/events/core.c
> didn't seem like a great alternative.  If we wanted to use a single variable,
> then the writes in arch_perf_{load,put}_guest_context() can simply go away.
> 
Either two variables or a single variable is fine with me, as long as
they can be easily distinguished.

But I think we should put arch_perf_{load,put}_guest_context() and
guest_ctx_loaded between the perf_ctx_disable/enable() pair.
Perf should only update the state when PMU is completely disabled.
It matches the way the rest of the perf code does.
It could also avoid some potential issues, e.g., the state is not
updated completely, but the counter has already been fired.

Thanks,
Kan



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
  2025-05-15 19:22       ` Sean Christopherson
@ 2025-05-16  1:03         ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-16  1:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania


On 5/16/2025 3:22 AM, Sean Christopherson wrote:
> On Thu, May 15, 2025, Dapeng Mi wrote:
>> On 5/15/2025 8:37 AM, Sean Christopherson wrote:
>>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>>> index 153972e944eb..eba086ef5eca 100644
>>>> --- a/arch/x86/kvm/svm/pmu.c
>>>> +++ b/arch/x86/kvm/svm/pmu.c
>>>> @@ -198,12 +198,20 @@ static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>  	pmu->nr_arch_gp_counters = min_t(unsigned int, pmu->nr_arch_gp_counters,
>>>>  					 kvm_pmu_cap.num_counters_gp);
>>>>  
>>>> -	if (pmu->version > 1) {
>>>> -		pmu->global_ctrl_rsvd = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>> +	if (kvm_pmu_cap.version > 1) {
> ARGH!!!!!  I saw the pmu->version => kvm_pmu_cap.version change when going through
> this patch, but assumed it was simply a refactoring bug and ignored it.
>
> Nope, 100% intentional, as I discovered after spending the better part of an hour
> debugging.  Stuffing pmu->global_ctrl when it doesn't exist is necessary when the
> mediated PMU is enabled, because pmu->global_ctrl will always be loaded in hardware.
> And so loading '0' means the PMU is effectively disabled, because KVM disallows the
> guest from writing the MSR.
>
> _That_ is the type of thing that absolutely needs a comment and a lengthy explanation
> in the changelog.

Yes, this change is intended.  As long as HW supports Global_CTRL MSR, KVM
should write available value into it at vm-entry regardless of if guest PMU
has GLOBAL_CTRL (pmu version < 2), otherwise guest counters won't be really
enabled on HW.

yeah, it's indeed easily confused and should put a comment here. My fault ...


>
> Intel also needs the same treatment for PMU v1.  And since there's no hackery that
> I can see, that suggests PMU v1 guests haven't been tested with the mediated PMU
> on Intel.
>
> I guess congratulations are in order, because this patch just became my new goto
> example of why I'm so strict about on the "one thing per patch" rule.

Embarrassing ....


>
>>> It's not just global_ctrl.  PEBS and the fixed counters also depend on v2+ (the
>>> SDM contradicts itself; KVM's ABI is that they're v2+).
>>>
>>>> +		/*
>>>> +		 * At RESET, AMD CPUs set all enable bits for general purpose counters in
>>>> +		 * IA32_PERF_GLOBAL_CTRL (so that software that was written for v1 PMUs
>>>> +		 * don't unknowingly leave GP counters disabled in the global controls).
>>>> +		 * Emulate that behavior when refreshing the PMU so that userspace doesn't
>>>> +		 * need to manually set PERF_GLOBAL_CTRL.
>>>> +		 */
>>>> +		pmu->global_ctrl = BIT_ULL(pmu->nr_arch_gp_counters) - 1;
>>>> +		pmu->global_ctrl_rsvd = ~pmu->global_ctrl;
>>>>  		pmu->global_status_rsvd = pmu->global_ctrl_rsvd;
>>>>  	}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU
  2025-03-24 17:31 ` [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU Mingwei Zhang
@ 2025-05-16  1:10   ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16  1:10 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> -	pmc->emulated_counter++;
> -	kvm_pmu_request_counter_reprogram(pmc);
> +	struct kvm_vcpu *vcpu = pmc->vcpu;
> +
> +	/*
> +	 * For perf-based PMUs, accumulate software-emulated events separately
> +	 * from pmc->counter, as pmc->counter is offset by the count of the
> +	 * associated perf event. Request reprogramming, which will consult
> +	 * both emulated and hardware-generated events to detect overflow.
> +	 */
> +	if (!kvm_mediated_pmu_enabled(vcpu)) {
> +		pmc->emulated_counter++;
> +		kvm_pmu_request_counter_reprogram(pmc);
> +		return;
> +	}
> +
> +	/*
> +	 * For mediated PMUs, pmc->counter is updated when the vCPU's PMU is
> +	 * put, and will be loaded into hardware when the PMU is loaded. Simply
> +	 * increment the counter and signal overflow if it wraps to zero.
> +	 */
> +	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> +	if (!pmc->counter) {

Ugh, this is broken for the fastpath.  If kvm_skip_emulated_instruction() is
invoked by handle_fastpath_set_msr_irqoff() or handle_fastpath_hlt(), KVM may
consume stale information (GLOBAL_CTRL, GLOBAL_STATUS and PMCs), and even if KVM
gets lucky and those are all fresh, the PMC and GLOBAL_STATUS changes won't be
propagated back to hardware.

The best idea I have is to track whether or not the guest may be counting branches
and/or instruction based on eventsels, and then bail from fastpaths that need to
skip instructions.  That flag would also be useful to further optimize
kvm_pmu_trigger_event().

> +		pmc_to_pmu(pmc)->global_status |= BIT_ULL(pmc->idx);
> +		if (pmc_pmi_enabled(pmc))
> +			kvm_make_request(KVM_REQ_PMI, vcpu);
> +	}
>  }
>  
>  static inline bool cpl_is_matched(struct kvm_pmc *pmc)
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  2025-03-24 17:31 ` [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and " Mingwei Zhang
  2025-05-15  0:43   ` Sean Christopherson
@ 2025-05-16  1:26   ` Mi, Dapeng
  2025-05-16 20:54     ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-16  1:26 UTC (permalink / raw)
  To: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Sean Christopherson,
	Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania


On 3/25/2025 1:31 AM, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>
> Mediated vPMU needs to intercept EVENTSELx and FIXED_CNTR_CTRL MSRs to
> filter out guest malicious perf events. Either writing these MSRs or
> updating event filters would call reprogram_counter() eventually. Thus
> check if the guest event should be filtered out in reprogram_counter().
> If so, clear corresponding EVENTSELx MSR or FIXED_CNTR_CTRL field to
> ensure the guest event won't be really enabled at vm-entry.
>
> Besides, mediated vPMU intercepts the MSRs of these guest not owned
> counters and it just needs simply to read/write from/to pmc->counter.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/pmu.c | 27 +++++++++++++++++++++++++++
>  arch/x86/kvm/pmu.h |  3 +++
>  2 files changed, 30 insertions(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 63143eeb5c44..e9100dc49fdc 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -305,6 +305,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>  
>  void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
>  {
> +	if (kvm_mediated_pmu_enabled(pmc->vcpu)) {
> +		pmc->counter = val & pmc_bitmask(pmc);
> +		return;
> +	}
> +
>  	/*
>  	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
>  	 * read-modify-write.  Adjust the counter value so that its value is
> @@ -455,6 +460,28 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>  	bool emulate_overflow;
>  	u8 fixed_ctr_ctrl;
>  
> +	if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
> +		bool allowed = check_pmu_event_filter(pmc);
> +
> +		if (pmc_is_gp(pmc)) {
> +			if (allowed)
> +				pmc->eventsel_hw |= pmc->eventsel &
> +						    ARCH_PERFMON_EVENTSEL_ENABLE;
> +			else
> +				pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> +		} else {
> +			int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
> +
> +			if (allowed)
> +				pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;

Sean, just found there is a potential bug here.  The
"pmu->fixed_ctr_ctrl_hw" should not be assigned to "pmu->fixed_ctr_ctrl"
here, otherwise the other filtered fixed counter (not this allowed fixed
counter) could be enabled accidentally.

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index ba9d336f1d1d..f32e5f66f73b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -473,7 +473,8 @@ static int reprogram_counter(struct kvm_pmc *pmc)
                        int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;

                        if (allowed)
-                               pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
+                               pmu->fixed_ctr_ctrl_hw |= pmu->fixed_ctr_ctrl &
+                                              
intel_fixed_bits_by_idx(idx, 0xf);
                        else
                                pmu->fixed_ctr_ctrl_hw &=
                                        ~intel_fixed_bits_by_idx(idx, 0xf);

> +			else
> +				pmu->fixed_ctr_ctrl_hw &=
> +					~intel_fixed_bits_by_idx(idx, 0xf);
> +		}
> +
> +		return 0;
> +	}
> +
>  	emulate_overflow = pmc_pause_counter(pmc);
>  
>  	if (!pmc_event_is_allowed(pmc))
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 509c995b7871..6289f523d893 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -113,6 +113,9 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  {
>  	u64 counter, enabled, running;
>  
> +	if (kvm_mediated_pmu_enabled(pmc->vcpu))
> +		return pmc->counter & pmc_bitmask(pmc);
> +
>  	counter = pmc->counter + pmc->emulated_counter;
>  
>  	if (pmc->perf_event && !pmc->is_paused)

^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
  2025-05-15 16:29   ` Sean Christopherson
@ 2025-05-16  2:37     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-16  2:37 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/16/2025 12:29 AM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> index 9159bf1a4730..35f27366c277 100644
>> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> @@ -22,6 +22,8 @@ KVM_X86_PMU_OP(init)
>>  KVM_X86_PMU_OP_OPTIONAL(reset)
>>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
>> +KVM_X86_PMU_OP(put_guest_context)
>> +KVM_X86_PMU_OP(load_guest_context)
> For KVM, the "guest_context" part is largely superfluous, as KVM always operates
> on guest state, e.g. kvm_fpu_{load,put}().
>
> I do think we should squeeze in "mediated" somewhere, otherwise the it's hard to
> see that these are specific to the mediated PMU.
>
> So probably mediated_{load,put}()?

After moving all PMC's manipulation into kvm common code, these two helper
actually only manipulates global shared MSRs right now, not sure if add
extra "global" word to the ops name which makes the name more descriptive.
But considering we may add more other MSRs like PEBS MSRs in these 2
helpers, I'm ok for mediated_{load,put}.


>
>>  #undef KVM_X86_PMU_OP
>>  #undef KVM_X86_PMU_OP_OPTIONAL
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 7ee74bbbb0aa..4117a382739a 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -568,6 +568,10 @@ struct kvm_pmu {
>>  	u64 raw_event_mask;
>>  	struct kvm_pmc gp_counters[KVM_MAX_NR_GP_COUNTERS];
>>  	struct kvm_pmc fixed_counters[KVM_MAX_NR_FIXED_COUNTERS];
>> +	u32 gp_eventsel_base;
>> +	u32 gp_counter_base;
>> +	u32 fixed_base;
>> +	u32 cntr_shift;
> Gah, my bad, "shift" was a terrible suggestion.  It should be "stride".

Yes.


>
>> @@ -306,6 +313,10 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>>  int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>>  void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
>>  bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
>> +void kvm_pmu_put_guest_pmcs(struct kvm_vcpu *vcpu);
>> +void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu);
>> +void kvm_pmu_put_guest_context(struct kvm_vcpu *vcpu);
>> +void kvm_pmu_load_guest_context(struct kvm_vcpu *vcpu);
>>  
>>  bool is_vmware_backdoor_pmc(u32 pmc_idx);
>>  bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu);
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index 1a7e3a897fdf..7e0d84d50b74 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -175,6 +175,22 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  	return 1;
>>  }
>>  
>> +static inline void amd_update_msr_base(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	if (kvm_pmu_has_perf_global_ctrl(pmu) ||
>> +	    guest_cpu_cap_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
>> +		pmu->gp_eventsel_base = MSR_F15H_PERF_CTL0;
>> +		pmu->gp_counter_base = MSR_F15H_PERF_CTR0;
>> +		pmu->cntr_shift = 2;
>> +	} else {
>> +		pmu->gp_eventsel_base = MSR_K7_EVNTSEL0;
>> +		pmu->gp_counter_base = MSR_K7_PERFCTR0;
>> +		pmu->cntr_shift = 1;
>> +	}
>> +}
> Moving quoted text around to organize responses...
>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 796b7bc4affe..ed17ab198dfb 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -460,6 +460,17 @@ static void intel_pmu_enable_fixed_counter_bits(struct kvm_pmu *pmu, u64 bits)
>>  		pmu->fixed_ctr_ctrl_rsvd &= ~intel_fixed_bits_by_idx(i, bits);
>>  }
>>  
>> +static inline void intel_update_msr_base(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	pmu->gp_eventsel_base = MSR_P6_EVNTSEL0;
>> +	pmu->gp_counter_base = fw_writes_is_enabled(vcpu) ?
>> +			       MSR_IA32_PMC0 : MSR_IA32_PERFCTR0;
> This is wrong.  And I unintentionally proved that it's wrong, by goofing when I
> fixed up this code and using MSR_IA32_PERFCTR0 instead of MSR_IA32_PMC0.
>
> Whether or not the guest supports full-width writes is irrelevant, because support
> for FW writes doesn't change the width of the counters.  Just because the *guest* 
> can't directly write all e.g. 48 bits doesn't mean clobbering bits 47:32 is ok.
>
> Similarly, on the AMD side, using the legacy interface in KVM is unnecessary.
> The guest may be limited to those MSRs, but KVM has a hard dependency on PMU v2,
> so just unconditionally use MSR_F15H_PERF_CTR0 (and for the record, because I
> had to look it up, the newfangled MSRs on AMD are aliased to the legacy MSRs for
> 0..3).
>
> Very happily, that means the MSRs don't need to be per-PMU, and they don't even
> need to be configured at runtime for a given vendor.  Simply require FW writes
> on Intel to enable the mediated PMU, and then hardcode the GP base to MSR_IA32_PMC0.

Since mediated vPMU requires the HW PMU version is 4 at least, so I think
it's safe to set GP base to MSR_IA32_PMC0.


>
>> +static void amd_put_guest_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
>> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
>> +	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, pmu->global_status);
>> +
>> +	/* Clear global status bits if non-zero */
>> +	if (pmu->global_status)
>> +		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, pmu->global_status);
>> +
>> +	kvm_pmu_put_guest_pmcs(vcpu);
>> +}
>> +
>> +static void amd_load_guest_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	u64 global_status;
>> +
>> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
> Back when I suggested we give up on trying to handle PMCs and eventsels in common
> x86, this WRMSR didn't exist.  Now that it does, I don't see anything that prevents
> invoking kvm_pmu_{load,put}_guest_pmcs() from common x86, KVM just needs to clear
> GLOBAL_CTRL before setting eventsels and PMCs.
>
> For the load path:
>
> 	/*
> 	 * Disable all counters before loading event selectors and PMCs so that
> 	 * KVM doesn't enable or load guest counters while host events are
> 	 * active.  VMX will enable/disabled counters at VM-Enter/VM-Exit by
> 	 * atomically loading PERF_GLOBAL_CONTROL.  SVM effectively performs
> 	 * the switch by configuring all events to be GUEST_ONLY.
> 	 */
> 	wrmsrl(kvm_pmu_ops.PERF_GLOBAL_CTRL, 0);
>
> 	kvm_pmu_load_guest_pmcs(vcpu);
>
> 	kvm_pmu_call(mediated_load)(vcpu);
>
> And for the put path, just reverse the ordering:
>
> 	/*
> 	 * Defer handling of PERF_GLOBAL_CTRL to vendor code.  On Intel, it's
> 	 * atomically cleared on VM-Exit, i.e. doesn't need to be clear here.
> 	 */
> 	kvm_pmu_call(mediated_put)(vcpu);
>
> 	kvm_pmu_put_guest_pmcs(vcpu);
>
> 	perf_put_guest_context();
>
> On Intel, PERF_GLOBAL_CTRL is cleared on VM-Exit, and on AMD, the vendor hook
> will clear it.  The fact that vendor code sets other MSRs is irrelevant, what
> matters is that all counters are quieseced.
>
> I think it's still worth having helpers, but they can be static locals.

Agree.


>
>> +
>> +	kvm_pmu_load_guest_pmcs(vcpu);
>> +
>> +	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);
>> +	/* Clear host global_status MSR if non-zero. */
>> +	if (global_status)
>> +		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);
>> +
>> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);
>> +	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
>> +}
>> +
>>  static void intel_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
>> @@ -809,6 +822,50 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
>>  	}
>>  }
>>  
>> +static void intel_put_guest_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	/* Global ctrl register is already saved at VM-exit. */
>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>> +
>> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>> +	if (pmu->global_status)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>> +
>> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
>> +
>> +	/*
>> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>> +	 * also avoid these guest fixed counters get accidentially enabled
>> +	 * during host running when host enable global ctrl.
>> +	 */
>> +	if (pmu->fixed_ctr_ctrl_hw)
>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> +
>> +	kvm_pmu_put_guest_pmcs(vcpu);
>> +}
>> +
>> +static void intel_load_guest_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	u64 global_status, toggle;
>> +
>> +	/* Clear host global_ctrl MSR if non-zero. */
>> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>> +
>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>> +	toggle = pmu->global_status ^ global_status;
>> +	if (global_status & toggle)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
>> +	if (pmu->global_status & toggle)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
>> +
>> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
>> +
>> +	kvm_pmu_load_guest_pmcs(vcpu);
>> +}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
  2025-03-24 17:31 ` [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry Mingwei Zhang
  2025-05-15 16:29   ` Sean Christopherson
@ 2025-05-16 13:26   ` Sean Christopherson
  2025-05-19  5:07     ` Mi, Dapeng
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 13:26 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> +	/*
> +	 * Clear hardware selector MSR content and its counter to avoid
> +	 * leakage and also avoid this guest GP counter get accidentally
> +	 * enabled during host running when host enable global ctrl.
> +	 */
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		eventsel_msr = pmc_msr_addr(pmu, pmu->gp_eventsel_base, i);
> +		counter_msr = pmc_msr_addr(pmu, pmu->gp_counter_base, i);
> +
> +		rdpmcl(i, pmc->counter);
> +		rdmsrl(eventsel_msr, pmc->eventsel_hw);

As pointed out by Dapeng offlist, this RDMSR is unnecessary since event selector
MSRs are always intercepted.

> +		if (pmc->counter)
> +			wrmsrl(counter_msr, 0);
> +		if (pmc->eventsel_hw)
> +			wrmsrl(eventsel_msr, 0);
> +	}
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		counter_msr = pmc_msr_addr(pmu, pmu->fixed_base, i);
> +
> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> +		if (pmc->counter)
> +			wrmsrl(counter_msr, 0);
> +	}
> +
> +}
> +static void intel_put_guest_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	/* Global ctrl register is already saved at VM-exit. */
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> +
> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);

And same thing here.  Though I'm confused as to why KVM always intercept
FIXED_CTR_CTRL.

/me rummages around the SDM

Ahh, because there are Any Thread bits in there.  That absolutely needs to be
called out, probably in the interception logic in pmu_intel.c.  I'll add a comment.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU
  2025-03-24 17:31 ` [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU Mingwei Zhang
@ 2025-05-16 13:33   ` Sean Christopherson
  2025-05-19  5:24     ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 13:33 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

This shortlog is unnecessarily confusing.  It reads as if supported for running
L2 in a vCPU with a mediated PMU is somehow lacking.

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> Add nested virtualization support for mediated PMU by combining the MSR
> interception bitmaps of vmcs01 and vmcs12.

Do not phrase changelogs related to nested virtualization in terms of enabling a
_KVM_ feature.  KVM has no control over what hypervisor runs in L1.  It's a-ok to
provide example use cases, but they need to be just that, examples.

> Readers may argue even without this patch, nested virtualization works for
> mediated PMU because L1 will see Perfmon v2 and will have to use legacy vPMU
> implementation if it is Linux. However, any assumption made on L1 may be
> invalid, e.g., L1 may not even be Linux.
> 
> If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
> MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
> the access.
> 
> However, in current implementation, if without adding anything for nested,
> KVM always set MSR interception bits in vmcs02. This leads to the fact that
> L0 will emulate all MSR read/writes for L2, leading to errors, since the
> current mediated vPMU never implements set_msr() and get_msr() for any
> counter access except counter accesses from the VMM side.
> 
> So fix the issue by setting up the correct MSR interception for PMU MSRs.

This is not a fix.  

    KVM: nVMX: Disable PMU MSR interception as appropriate while running L2
    
    Merge KVM's PMU MSR interception bitmaps with those of L1, i.e. merge the
    bitmaps of vmcs01 and vmcs12, e.g. so that KVM doesn't interpose on MSR
    accesses unnecessarily if L1 exposes a mediated PMU (or equivalent) to L2.

> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index cf557acf91f8..dbec40cb55bc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -626,6 +626,36 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
>  #define nested_vmx_merge_msr_bitmaps_rw(msr) \
>  	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_RW)
>  
> +/*
> + * Disable PMU MSRs interception for nested VM if L0 and L1 are
> + * both mediated vPMU.
> + */

Again, KVM has no idea what is running in L1.  Drop this.

> +static void nested_vmx_merge_pmu_msr_bitmaps(struct kvm_vcpu *vcpu,
> +					     unsigned long *msr_bitmap_l1,
> +					     unsigned long *msr_bitmap_l0)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int i;
> +
> +	if (!kvm_mediated_pmu_enabled(vcpu))

This is a worthwhile check, but a comment would be helpful:

	/*
	 * Skip the merges if the vCPU doesn't have a mediated PMU MSR, i.e. if
	 * none of the MSRs can possibly be passed through to L1.
	 */
	if (!kvm_vcpu_has_mediated_pmu(vcpu))
		return;

> +		return;
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		nested_vmx_merge_msr_bitmaps_rw(MSR_ARCH_PERFMON_EVENTSEL0 + i);

This is unnecessary, KVM always intercepts event selectors.

> +		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PERFCTR0 + i);
> +		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PMC0 + i);
> +	}
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> +		nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR0 + i);
> +
> +	nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR_CTRL);

Same thing here.

> +	nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_GLOBAL_CTRL);
> +	nested_vmx_merge_msr_bitmaps_read(MSR_CORE_PERF_GLOBAL_STATUS);
> +	nested_vmx_merge_msr_bitmaps_write(MSR_CORE_PERF_GLOBAL_OVF_CTRL);
> +}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs
  2025-03-24 17:31 ` [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs Mingwei Zhang
  2025-05-15  0:41   ` Sean Christopherson
@ 2025-05-16 13:34   ` Sean Christopherson
  2025-05-19  5:18     ` Mi, Dapeng
  1 sibling, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 13:34 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> +static void amd_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	int msr_clear = !!(kvm_mediated_pmu_enabled(vcpu));
> +	int i;
> +
> +	for (i = 0; i < min(pmu->nr_arch_gp_counters, AMD64_NUM_COUNTERS); i++) {
> +		/*
> +		 * Legacy counters are always available irrespective of any
> +		 * CPUID feature bits and when X86_FEATURE_PERFCTR_CORE is set,
> +		 * PERF_LEGACY_CTLx and PERF_LEGACY_CTRx registers are mirrored
> +		 * with PERF_CTLx and PERF_CTRx respectively.
> +		 */
> +		set_msr_interception(vcpu, svm->msrpm, MSR_K7_EVNTSEL0 + i, 0, 0);

This is pointless.  Simply do nothing and KVM will always intercept event selectors.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2025-03-24 17:31 ` [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
@ 2025-05-16 13:35   ` Sean Christopherson
  2025-05-16 14:45     ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 13:35 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> Reject PMU MSRs interception explicitly in
> vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
> specially handled in intel_passthrough_pmu_msrs().
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 38ecf3c116bd..7bb16bed08da 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -165,7 +165,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
>  
>  /*
>   * List of MSRs that can be directly passed to the guest.
> - * In addition to these x2apic, PT and LBR MSRs are handled specially.
> + * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
>   */
>  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
>  	MSR_IA32_SPEC_CTRL,
> @@ -691,6 +691,16 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
>  	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
>  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> +	case MSR_IA32_PMC0 ...
> +		MSR_IA32_PMC0 + KVM_MAX_NR_GP_COUNTERS - 1:
> +	case MSR_IA32_PERFCTR0 ...
> +		MSR_IA32_PERFCTR0 + KVM_MAX_NR_GP_COUNTERS - 1:
> +	case MSR_CORE_PERF_FIXED_CTR0 ...
> +		MSR_CORE_PERF_FIXED_CTR0 + KVM_MAX_NR_FIXED_COUNTERS - 1:
> +	case MSR_CORE_PERF_GLOBAL_STATUS:
> +	case MSR_CORE_PERF_GLOBAL_CTRL:
> +	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> +		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
>  		return -ENOENT;
>  	}

This belongs in the patch that configures interception.  A better split would be
to have an Intel patch and an AMD patch, not three patches with logic splattered
all over.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 25/38] KVM: x86/pmu: Add AMD PMU registers to direct access list
  2025-03-24 17:31 ` [PATCH v4 25/38] KVM: x86/pmu: Add AMD PMU registers to direct access list Mingwei Zhang
@ 2025-05-16 13:36   ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 13:36 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> From: Sandipan Das <sandipan.das@amd.com>
> 
> Add all PMU-related MSRs (including legacy K7 MSRs) to the list of
> possible direct access MSRs.  Most of them will not be intercepted when
> using passthrough PMU.
> 
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/svm/svm.c | 24 ++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.h |  2 +-
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index a713c803a3a3..bff351992468 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -143,6 +143,30 @@ static const struct svm_direct_access_msrs {
>  	{ .index = X2APIC_MSR(APIC_TMICT),		.always = false },
>  	{ .index = X2APIC_MSR(APIC_TMCCT),		.always = false },
>  	{ .index = X2APIC_MSR(APIC_TDCR),		.always = false },
> +	{ .index = MSR_K7_EVNTSEL0,			.always = false },

These are always intercepted, i.e. don't belong in this list.

> +	{ .index = MSR_K7_PERFCTR0,			.always = false },
> +	{ .index = MSR_K7_EVNTSEL1,			.always = false },
> +	{ .index = MSR_K7_PERFCTR1,			.always = false },
> +	{ .index = MSR_K7_EVNTSEL2,			.always = false },
> +	{ .index = MSR_K7_PERFCTR2,			.always = false },
> +	{ .index = MSR_K7_EVNTSEL3,			.always = false },
> +	{ .index = MSR_K7_PERFCTR3,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTL0,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTR0,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTL1,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTR1,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTL2,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTR2,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTL3,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTR3,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTL4,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTR4,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTL5,			.always = false },
> +	{ .index = MSR_F15H_PERF_CTR5,			.always = false },
> +	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_CTL,	.always = false },
> +	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,	.always = false },
> +	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,	.always = false },
> +	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET,	.always = false },
>  	{ .index = MSR_INVALID,				.always = false },
>  };

As with the Intel patch, this absolutely belongs in the patch that supports
disabling intercepts.

> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9d7cdb8fbf87..ae71bf5f12d0 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -44,7 +44,7 @@ static inline struct page *__sme_pa_to_page(unsigned long pa)
>  #define	IOPM_SIZE PAGE_SIZE * 3
>  #define	MSRPM_SIZE PAGE_SIZE * 2
>  
> -#define MAX_DIRECT_ACCESS_MSRS	48
> +#define MAX_DIRECT_ACCESS_MSRS	72
>  #define MSRPM_OFFSETS	32
>  extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
>  extern bool npt_enabled;
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2025-05-16 13:35   ` Sean Christopherson
@ 2025-05-16 14:45     ` Sean Christopherson
  2025-05-19  5:21       ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 14:45 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania

On Fri, May 16, 2025, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> > Reject PMU MSRs interception explicitly in
> > vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
> > specially handled in intel_passthrough_pmu_msrs().
> > 
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 38ecf3c116bd..7bb16bed08da 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -165,7 +165,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
> >  
> >  /*
> >   * List of MSRs that can be directly passed to the guest.
> > - * In addition to these x2apic, PT and LBR MSRs are handled specially.
> > + * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.

Except y'all forgot to actually do the "special" handling, vmx_msr_filter_changed()
needs to refresh the PMU MSR filters.  Only the x2APIC MSRs are excluded from
userspace filtering (see kvm_msr_allowed()), everything else can be intercepted
by userespace.  E.g. if an MSR filter is set _before_ PMU refresh, KVM's behavior
will diverge from a filter that is set after PMU refresh.

> >   */
> >  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
> >  	MSR_IA32_SPEC_CTRL,
> > @@ -691,6 +691,16 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
> >  	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
> >  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> >  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> > +	case MSR_IA32_PMC0 ...
> > +		MSR_IA32_PMC0 + KVM_MAX_NR_GP_COUNTERS - 1:
> > +	case MSR_IA32_PERFCTR0 ...
> > +		MSR_IA32_PERFCTR0 + KVM_MAX_NR_GP_COUNTERS - 1:
> > +	case MSR_CORE_PERF_FIXED_CTR0 ...
> > +		MSR_CORE_PERF_FIXED_CTR0 + KVM_MAX_NR_FIXED_COUNTERS - 1:
> > +	case MSR_CORE_PERF_GLOBAL_STATUS:
> > +	case MSR_CORE_PERF_GLOBAL_CTRL:
> > +	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> > +		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
> >  		return -ENOENT;
> >  	}
> 
> This belongs in the patch that configures interception.  A better split would be
> to have an Intel patch and an AMD patch, 

I take that back, splitting the Intel and AMD logic makes is just as messy,
because the control logic is very much shared between VMX and SVM.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  2025-05-16  1:26   ` Mi, Dapeng
@ 2025-05-16 20:54     ` Sean Christopherson
  2025-05-19  4:16       ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-16 20:54 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Fri, May 16, 2025, Dapeng Mi wrote:
> On 3/25/2025 1:31 AM, Mingwei Zhang wrote:
> > +	if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
> > +		bool allowed = check_pmu_event_filter(pmc);
> > +
> > +		if (pmc_is_gp(pmc)) {
> > +			if (allowed)
> > +				pmc->eventsel_hw |= pmc->eventsel &
> > +						    ARCH_PERFMON_EVENTSEL_ENABLE;
> > +			else
> > +				pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> > +		} else {
> > +			int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
> > +
> > +			if (allowed)
> > +				pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
> 
> Sean, just found there is a potential bug here.  The
> "pmu->fixed_ctr_ctrl_hw" should not be assigned to "pmu->fixed_ctr_ctrl"
> here, otherwise the other filtered fixed counter (not this allowed fixed
> counter) could be enabled accidentally.
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index ba9d336f1d1d..f32e5f66f73b 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -473,7 +473,8 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>                         int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
> 
>                         if (allowed)
> -                               pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
> +                               pmu->fixed_ctr_ctrl_hw |= pmu->fixed_ctr_ctrl &
> +                                              
> intel_fixed_bits_by_idx(idx, 0xf);

Hmm, I think that's fine, since pmu->fixed_ctr_ctrl should have changed?  But I'd
rather play it safe and do (completely untested):

	if (pmc_is_gp(pmc)) {
		pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
		if (allowed)
			pmc->eventsel_hw |= pmc->eventsel &
					    ARCH_PERFMON_EVENTSEL_ENABLE;
	} else {
		u64 mask = intel_fixed_bits_by_idx(pmc->idx - KVM_FIXED_PMC_BASE_IDX, 0xf);

		pmu->fixed_ctr_ctrl_hw &= ~mask;
		if (allowed)
			pmu->fixed_ctr_ctrl_hw |= pmu->fixed_ctr_ctrl & mask;
	}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and event filtering
  2025-05-16 20:54     ` Sean Christopherson
@ 2025-05-19  4:16       ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-19  4:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania


On 5/17/2025 4:54 AM, Sean Christopherson wrote:
> On Fri, May 16, 2025, Dapeng Mi wrote:
>> On 3/25/2025 1:31 AM, Mingwei Zhang wrote:
>>> +	if (kvm_mediated_pmu_enabled(pmu_to_vcpu(pmu))) {
>>> +		bool allowed = check_pmu_event_filter(pmc);
>>> +
>>> +		if (pmc_is_gp(pmc)) {
>>> +			if (allowed)
>>> +				pmc->eventsel_hw |= pmc->eventsel &
>>> +						    ARCH_PERFMON_EVENTSEL_ENABLE;
>>> +			else
>>> +				pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>>> +		} else {
>>> +			int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
>>> +
>>> +			if (allowed)
>>> +				pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
>> Sean, just found there is a potential bug here.  The
>> "pmu->fixed_ctr_ctrl_hw" should not be assigned to "pmu->fixed_ctr_ctrl"
>> here, otherwise the other filtered fixed counter (not this allowed fixed
>> counter) could be enabled accidentally.
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index ba9d336f1d1d..f32e5f66f73b 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -473,7 +473,8 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>>                         int idx = pmc->idx - KVM_FIXED_PMC_BASE_IDX;
>>
>>                         if (allowed)
>> -                               pmu->fixed_ctr_ctrl_hw = pmu->fixed_ctr_ctrl;
>> +                               pmu->fixed_ctr_ctrl_hw |= pmu->fixed_ctr_ctrl &
>> +                                              
>> intel_fixed_bits_by_idx(idx, 0xf);
> Hmm, I think that's fine, since pmu->fixed_ctr_ctrl should have changed?  But I'd

Not exactly. Assume guest enables the fixed counter 0 and 1, then
pmu->fixed_ctr_ctrl would be 0xbb. At the first time, user disables the
fixed counter 0 by KVM event filter, then pmu->fixed_ctr_ctrl_hw would be
0xb0, secondly user disables the fixed counter 1 as well, so
pmu->fixed_ctr_ctrl_hw is 0x0. At the third time, user re-enables fixed
counter 1, so pmu->fixed_ctr_ctrl_hw is expected to be 0xb0, but it's not
actually. Although the 1st calling (for fixed counter 0) to
reprogram_counter() would disable it, but the 2nd calling (for fixed
counter 1) to reprogram_counter() accidentally enables it
(pmu->fixed_ctr_ctrl_hw becomes 0xbb eventually). This is incorrect.


> rather play it safe and do (completely untested):
>
> 	if (pmc_is_gp(pmc)) {
> 		pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
> 		if (allowed)
> 			pmc->eventsel_hw |= pmc->eventsel &
> 					    ARCH_PERFMON_EVENTSEL_ENABLE;
> 	} else {
> 		u64 mask = intel_fixed_bits_by_idx(pmc->idx - KVM_FIXED_PMC_BASE_IDX, 0xf);
>
> 		pmu->fixed_ctr_ctrl_hw &= ~mask;
> 		if (allowed)
> 			pmu->fixed_ctr_ctrl_hw |= pmu->fixed_ctr_ctrl & mask;
> 	}

The code looks good.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
  2025-05-16 13:26   ` Sean Christopherson
@ 2025-05-19  5:07     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-19  5:07 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/16/2025 9:26 PM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> +	/*
>> +	 * Clear hardware selector MSR content and its counter to avoid
>> +	 * leakage and also avoid this guest GP counter get accidentally
>> +	 * enabled during host running when host enable global ctrl.
>> +	 */
>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> +		pmc = &pmu->gp_counters[i];
>> +		eventsel_msr = pmc_msr_addr(pmu, pmu->gp_eventsel_base, i);
>> +		counter_msr = pmc_msr_addr(pmu, pmu->gp_counter_base, i);
>> +
>> +		rdpmcl(i, pmc->counter);
>> +		rdmsrl(eventsel_msr, pmc->eventsel_hw);
> As pointed out by Dapeng offlist, this RDMSR is unnecessary since event selector
> MSRs are always intercepted.
>
>> +		if (pmc->counter)
>> +			wrmsrl(counter_msr, 0);
>> +		if (pmc->eventsel_hw)
>> +			wrmsrl(eventsel_msr, 0);
>> +	}
>> +
>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> +		pmc = &pmu->fixed_counters[i];
>> +		counter_msr = pmc_msr_addr(pmu, pmu->fixed_base, i);
>> +
>> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>> +		if (pmc->counter)
>> +			wrmsrl(counter_msr, 0);
>> +	}
>> +
>> +}
>> +static void intel_put_guest_context(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	/* Global ctrl register is already saved at VM-exit. */
>> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>> +
>> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>> +	if (pmu->global_status)
>> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>> +
>> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
> And same thing here.  Though I'm confused as to why KVM always intercept
> FIXED_CTR_CTRL.
>
> /me rummages around the SDM
>
> Ahh, because there are Any Thread bits in there.  That absolutely needs to be
> called out, probably in the interception logic in pmu_intel.c.  I'll add a comment.

Another reason is event filter. User may select to filter out some but not
all fixed counters, so PERF_FIXED_CTR_CTRL has to be intercepted.


>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs
  2025-05-16 13:34   ` Sean Christopherson
@ 2025-05-19  5:18     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-19  5:18 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/16/2025 9:34 PM, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> +static void amd_pmu_update_msr_intercepts(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +	int msr_clear = !!(kvm_mediated_pmu_enabled(vcpu));
>> +	int i;
>> +
>> +	for (i = 0; i < min(pmu->nr_arch_gp_counters, AMD64_NUM_COUNTERS); i++) {
>> +		/*
>> +		 * Legacy counters are always available irrespective of any
>> +		 * CPUID feature bits and when X86_FEATURE_PERFCTR_CORE is set,
>> +		 * PERF_LEGACY_CTLx and PERF_LEGACY_CTRx registers are mirrored
>> +		 * with PERF_CTLx and PERF_CTRx respectively.
>> +		 */
>> +		set_msr_interception(vcpu, svm->msrpm, MSR_K7_EVNTSEL0 + i, 0, 0);
> This is pointless.  Simply do nothing and KVM will always intercept event selectors.

Yes.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2025-05-16 14:45     ` Sean Christopherson
@ 2025-05-19  5:21       ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-19  5:21 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/16/2025 10:45 PM, Sean Christopherson wrote:
> On Fri, May 16, 2025, Sean Christopherson wrote:
>> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>>> Reject PMU MSRs interception explicitly in
>>> vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
>>> specially handled in intel_passthrough_pmu_msrs().
>>>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>> ---
>>>  arch/x86/kvm/vmx/vmx.c | 12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index 38ecf3c116bd..7bb16bed08da 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -165,7 +165,7 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
>>>  
>>>  /*
>>>   * List of MSRs that can be directly passed to the guest.
>>> - * In addition to these x2apic, PT and LBR MSRs are handled specially.
>>> + * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
> Except y'all forgot to actually do the "special" handling, vmx_msr_filter_changed()
> needs to refresh the PMU MSR filters.  Only the x2APIC MSRs are excluded from
> userspace filtering (see kvm_msr_allowed()), everything else can be intercepted
> by userespace.  E.g. if an MSR filter is set _before_ PMU refresh, KVM's behavior
> will diverge from a filter that is set after PMU refresh.

Yes, it's indeed missed. Needs to add them. Thanks for reminding it.


>
>>>   */
>>>  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
>>>  	MSR_IA32_SPEC_CTRL,
>>> @@ -691,6 +691,16 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
>>>  	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
>>>  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>>>  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
>>> +	case MSR_IA32_PMC0 ...
>>> +		MSR_IA32_PMC0 + KVM_MAX_NR_GP_COUNTERS - 1:
>>> +	case MSR_IA32_PERFCTR0 ...
>>> +		MSR_IA32_PERFCTR0 + KVM_MAX_NR_GP_COUNTERS - 1:
>>> +	case MSR_CORE_PERF_FIXED_CTR0 ...
>>> +		MSR_CORE_PERF_FIXED_CTR0 + KVM_MAX_NR_FIXED_COUNTERS - 1:
>>> +	case MSR_CORE_PERF_GLOBAL_STATUS:
>>> +	case MSR_CORE_PERF_GLOBAL_CTRL:
>>> +	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
>>> +		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
>>>  		return -ENOENT;
>>>  	}
>> This belongs in the patch that configures interception.  A better split would be
>> to have an Intel patch and an AMD patch, 
> I take that back, splitting the Intel and AMD logic makes is just as messy,
> because the control logic is very much shared between VMX and SVM.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU
  2025-05-16 13:33   ` Sean Christopherson
@ 2025-05-19  5:24     ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-19  5:24 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Paolo Bonzini, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, Liang, Kan, H. Peter Anvin,
	linux-perf-users, linux-kernel, kvm, linux-kselftest, Yongwei Ma,
	Xiong Zhang, Jim Mattson, Sandipan Das, Zide Chen,
	Eranian Stephane, Shukla Manali, Nikunj Dadhania


On 5/16/2025 9:33 PM, Sean Christopherson wrote:
> This shortlog is unnecessarily confusing.  It reads as if supported for running
> L2 in a vCPU with a mediated PMU is somehow lacking.
>
> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>> Add nested virtualization support for mediated PMU by combining the MSR
>> interception bitmaps of vmcs01 and vmcs12.
> Do not phrase changelogs related to nested virtualization in terms of enabling a
> _KVM_ feature.  KVM has no control over what hypervisor runs in L1.  It's a-ok to
> provide example use cases, but they need to be just that, examples.
>
>> Readers may argue even without this patch, nested virtualization works for
>> mediated PMU because L1 will see Perfmon v2 and will have to use legacy vPMU
>> implementation if it is Linux. However, any assumption made on L1 may be
>> invalid, e.g., L1 may not even be Linux.
>>
>> If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
>> MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
>> the access.
>>
>> However, in current implementation, if without adding anything for nested,
>> KVM always set MSR interception bits in vmcs02. This leads to the fact that
>> L0 will emulate all MSR read/writes for L2, leading to errors, since the
>> current mediated vPMU never implements set_msr() and get_msr() for any
>> counter access except counter accesses from the VMM side.
>>
>> So fix the issue by setting up the correct MSR interception for PMU MSRs.
> This is not a fix.  
>
>     KVM: nVMX: Disable PMU MSR interception as appropriate while running L2
>     
>     Merge KVM's PMU MSR interception bitmaps with those of L1, i.e. merge the
>     bitmaps of vmcs01 and vmcs12, e.g. so that KVM doesn't interpose on MSR
>     accesses unnecessarily if L1 exposes a mediated PMU (or equivalent) to L2.

Sure. Thanks.


>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/kvm/vmx/nested.c | 32 ++++++++++++++++++++++++++++++++
>>  1 file changed, 32 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index cf557acf91f8..dbec40cb55bc 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -626,6 +626,36 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
>>  #define nested_vmx_merge_msr_bitmaps_rw(msr) \
>>  	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_RW)
>>  
>> +/*
>> + * Disable PMU MSRs interception for nested VM if L0 and L1 are
>> + * both mediated vPMU.
>> + */
> Again, KVM has no idea what is running in L1.  Drop this.

Yes. thanks.


>
>> +static void nested_vmx_merge_pmu_msr_bitmaps(struct kvm_vcpu *vcpu,
>> +					     unsigned long *msr_bitmap_l1,
>> +					     unsigned long *msr_bitmap_l0)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +	int i;
>> +
>> +	if (!kvm_mediated_pmu_enabled(vcpu))
> This is a worthwhile check, but a comment would be helpful:
>
> 	/*
> 	 * Skip the merges if the vCPU doesn't have a mediated PMU MSR, i.e. if
> 	 * none of the MSRs can possibly be passed through to L1.
> 	 */
> 	if (!kvm_vcpu_has_mediated_pmu(vcpu))
> 		return;

Sure. thanks.


>
>> +		return;
>> +
>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> +		nested_vmx_merge_msr_bitmaps_rw(MSR_ARCH_PERFMON_EVENTSEL0 + i);
> This is unnecessary, KVM always intercepts event selectors.

Yes.


>
>> +		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PERFCTR0 + i);
>> +		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PMC0 + i);
>> +	}
>> +
>> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>> +		nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR0 + i);
>> +
>> +	nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR_CTRL);
> Same thing here.

Yes.


>
>> +	nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_GLOBAL_CTRL);
>> +	nested_vmx_merge_msr_bitmaps_read(MSR_CORE_PERF_GLOBAL_STATUS);
>> +	nested_vmx_merge_msr_bitmaps_write(MSR_CORE_PERF_GLOBAL_OVF_CTRL);
>> +}

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-03-24 17:30 ` [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Mingwei Zhang
  2025-05-14 22:51   ` Sean Christopherson
@ 2025-05-19  6:58   ` Namhyung Kim
  2025-05-20 16:09     ` Liang, Kan
  2025-05-21 19:46   ` Namhyung Kim
  2 siblings, 1 reply; 127+ messages in thread
From: Namhyung Kim @ 2025-05-19  6:58 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

Hello,

On Mon, Mar 24, 2025 at 05:30:44PM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Current perf doesn't explicitly schedule out all exclude_guest events
> while the guest is running. There is no problem with the current
> emulated vPMU. Because perf owns all the PMU counters. It can mask the
> counter which is assigned to an exclude_guest event when a guest is
> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
> (AMD way). The counter doesn't count when a guest is running.
> 
> However, either way doesn't work with the introduced passthrough vPMU.
> A guest owns all the PMU counters when it's running. The host should not
> mask any counters. The counter may be used by the guest. The evsentsel
> may be overwritten.
> 
> Perf should explicitly schedule out all exclude_guest events to release
> the PMU resources when entering a guest, and resume the counting when
> exiting the guest.
> 
> It's possible that an exclude_guest event is created when a guest is
> running. The new event should not be scheduled in as well.
> 
> The ctx time is shared among different PMUs. The time cannot be stopped
> when a guest is running. It is required to calculate the time for events
> from other PMUs, e.g., uncore events. Add timeguest to track the guest
> run time. For an exclude_guest event, the elapsed time equals
> the ctx time - guest time.
> Cgroup has dedicated times. Use the same method to deduct the guest time
> from the cgroup time as well.
> 
> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |   6 ++
>  kernel/events/core.c       | 209 +++++++++++++++++++++++++++++--------
>  2 files changed, 169 insertions(+), 46 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a2fd1bdc955c..7bda1e20be12 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -999,6 +999,11 @@ struct perf_event_context {
>  	 */
>  	struct perf_time_ctx		time;
>  
> +	/*
> +	 * Context clock, runs when in the guest mode.
> +	 */
> +	struct perf_time_ctx		timeguest;

Why not make it an array as you use it later?

> +
>  	/*
>  	 * These fields let us detect when two contexts have both
>  	 * been cloned (inherited) from a common ancestor.
> @@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern {
>   */
>  struct perf_cgroup_info {
>  	struct perf_time_ctx		time;
> +	struct perf_time_ctx		timeguest;
>  	int				active;
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e38c8b5e8086..7a2115b2c5c1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -163,7 +163,8 @@ enum event_type_t {
>  	/* see ctx_resched() for details */
>  	EVENT_CPU	= 0x10,
>  	EVENT_CGROUP	= 0x20,
> -	EVENT_FLAGS	= EVENT_CGROUP,
> +	EVENT_GUEST	= 0x40,

It's not clear to me if this flag is for events to include guests or
exclude them.  Can you please add a comment?

Thanks,
Namhyung


> +	EVENT_FLAGS	= EVENT_CGROUP | EVENT_GUEST,
>  	/* compound helpers */
>  	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
>  	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
> @@ -435,6 +436,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>  
>  static atomic_t nr_mediated_pmu_vms;
>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> +static DEFINE_PER_CPU(bool, perf_in_guest);
>  
>  /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */
>  static inline bool is_include_guest_event(struct perf_event *event)
> @@ -738,6 +740,9 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>  {
>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>  		return true;
> +	if ((event_type & EVENT_GUEST) &&
> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU))
> +		return true;
>  	return false;
>  }
>  
> @@ -788,6 +793,39 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>  	WRITE_ONCE(time->offset, time->time - time->stamp);
>  }
>  
> +static_assert(offsetof(struct perf_event_context, timeguest) -
> +	      offsetof(struct perf_event_context, time) ==
> +	      sizeof(struct perf_time_ctx));
> +
> +#define T_TOTAL		0
> +#define T_GUEST		1
> +
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *times)
> +{
> +	u64 time = times[T_TOTAL].time;
> +
> +	if (event->attr.exclude_guest)
> +		time -= times[T_GUEST].time;
> +
> +	return time;
> +}
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *times,
> +					    u64 now)
> +{
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
> +		/*
> +		 * (now + times[total].offset) - (now + times[guest].offset) :=
> +		 * times[total].offset - times[guest].offset
> +		 */
> +		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);
> +	}
> +
> +	return now + READ_ONCE(times[T_TOTAL].offset);
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  
>  static inline bool
> @@ -824,12 +862,16 @@ static inline int is_cgroup_event(struct perf_event *event)
>  	return event->cgrp != NULL;
>  }
>  
> +static_assert(offsetof(struct perf_cgroup_info, timeguest) -
> +	      offsetof(struct perf_cgroup_info, time) ==
> +	      sizeof(struct perf_time_ctx));
> +
>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
>  	struct perf_cgroup_info *t;
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	return t->time.time;
> +	return __perf_event_time_ctx(event, &t->time);
>  }
>  
>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> @@ -838,9 +880,21 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>  	if (!__load_acquire(&t->active))
> -		return t->time.time;
> -	now += READ_ONCE(t->time.offset);
> -	return now;
> +		return __perf_event_time_ctx(event, &t->time);
> +
> +	return __perf_event_time_ctx_now(event, &t->time, now);
> +}
> +
> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
> +{
> +	update_perf_time_ctx(&info->timeguest, now, adv);
> +}
> +
> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
> +{
> +	update_perf_time_ctx(&info->time, now, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_cgrp_guest_time(info, now, true);
>  }
>  
>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
> @@ -856,7 +910,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
>  			cgrp = container_of(css, struct perf_cgroup, css);
>  			info = this_cpu_ptr(cgrp->info);
>  
> -			update_perf_time_ctx(&info->time, now, true);
> +			update_cgrp_time(info, now);
>  			if (final)
>  				__store_release(&info->active, 0);
>  		}
> @@ -879,11 +933,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
>  	 * Do not update time when cgroup is not active
>  	 */
>  	if (info->active)
> -		update_perf_time_ctx(&info->time, perf_clock(), true);
> +		update_cgrp_time(info, perf_clock());
>  }
>  
>  static inline void
> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>  {
>  	struct perf_event_context *ctx = &cpuctx->ctx;
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> @@ -903,8 +957,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>  	for (css = &cgrp->css; css; css = css->parent) {
>  		cgrp = container_of(css, struct perf_cgroup, css);
>  		info = this_cpu_ptr(cgrp->info);
> -		update_perf_time_ctx(&info->time, ctx->time.stamp, false);
> -		__store_release(&info->active, 1);
> +		if (guest) {
> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
> +		} else {
> +			update_perf_time_ctx(&info->time, ctx->time.stamp, false);
> +			__store_release(&info->active, 1);
> +		}
>  	}
>  }
>  
> @@ -1104,7 +1162,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>  }
>  
>  static inline void
> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>  {
>  }
>  
> @@ -1514,16 +1572,24 @@ static void perf_unpin_context(struct perf_event_context *ctx)
>   */
>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
>  {
> -	u64 now = perf_clock();
> +	lockdep_assert_held(&ctx->lock);
> +
> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
> +}
>  
> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
> +{
>  	lockdep_assert_held(&ctx->lock);
>  
> -	update_perf_time_ctx(&ctx->time, now, adv);
> +	/* must be called after __update_context_time(); */
> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
>  }
>  
>  static void update_context_time(struct perf_event_context *ctx)
>  {
>  	__update_context_time(ctx, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_context_guest_time(ctx, true);
>  }
>  
>  static u64 perf_event_time(struct perf_event *event)
> @@ -1536,7 +1602,7 @@ static u64 perf_event_time(struct perf_event *event)
>  	if (is_cgroup_event(event))
>  		return perf_cgroup_event_time(event);
>  
> -	return ctx->time.time;
> +	return __perf_event_time_ctx(event, &ctx->time);
>  }
>  
>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
> @@ -1550,10 +1616,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
>  		return perf_cgroup_event_time_now(event, now);
>  
>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> -		return ctx->time.time;
> +		return __perf_event_time_ctx(event, &ctx->time);
>  
> -	now += READ_ONCE(ctx->time.offset);
> -	return now;
> +	return __perf_event_time_ctx_now(event, &ctx->time, now);
>  }
>  
>  static enum event_type_t get_event_type(struct perf_event *event)
> @@ -2384,20 +2449,23 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
>  }
>  
>  static inline void
> -__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final)
> +__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx,
> +		  bool final, enum event_type_t event_type)
>  {
>  	if (ctx->is_active & EVENT_TIME) {
>  		if (ctx->is_active & EVENT_FROZEN)
>  			return;
> +
>  		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx, final);
> +		/* vPMU should not stop time */
> +		update_cgrp_time_from_cpuctx(cpuctx, !(event_type & EVENT_GUEST) && final);
>  	}
>  }
>  
>  static inline void
>  ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx)
>  {
> -	__ctx_time_update(cpuctx, ctx, false);
> +	__ctx_time_update(cpuctx, ctx, false, 0);
>  }
>  
>  /*
> @@ -3405,7 +3473,7 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>  	 *
>  	 * would only update time for the pinned events.
>  	 */
> -	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx);
> +	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx, event_type);
>  
>  	/*
>  	 * CPU-release for the below ->is_active store,
> @@ -3431,7 +3499,18 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>  			cpuctx->task_ctx = NULL;
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule out all exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_MEDIATED_VPMU.
> +		 */
> +		is_active = EVENT_ALL;
> +		__update_context_guest_time(ctx, false);
> +		perf_cgroup_set_timestamp(cpuctx, true);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	for_each_epc(pmu_ctx, ctx, pmu, event_type)
>  		__pmu_ctx_sched_out(pmu_ctx, is_active);
> @@ -3926,10 +4005,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
>  		event_update_userpage(event);
>  }
>  
> +struct merge_sched_data {
> +	int can_add_hw;
> +	enum event_type_t event_type;
> +};
> +
>  static int merge_sched_in(struct perf_event *event, void *data)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> -	int *can_add_hw = data;
> +	struct merge_sched_data *msd = data;
>  
>  	if (event->state <= PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -3937,13 +4021,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  	if (!event_filter_match(event))
>  		return 0;
>  
> -	if (group_can_go_on(event, *can_add_hw)) {
> +	/*
> +	 * Don't schedule in any host events from PMU with
> +	 * PERF_PMU_CAP_MEDIATED_VPMU, while a guest is running.
> +	 */
> +	if (__this_cpu_read(perf_in_guest) &&
> +	    event->pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU &&
> +	    !(msd->event_type & EVENT_GUEST))
> +		return 0;
> +
> +	if (group_can_go_on(event, msd->can_add_hw)) {
>  		if (!group_sched_in(event, ctx))
>  			list_add_tail(&event->active_list, get_event_list(event));
>  	}
>  
>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		*can_add_hw = 0;
> +		msd->can_add_hw = 0;
>  		if (event->attr.pinned) {
>  			perf_cgroup_event_disable(event, ctx);
>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> @@ -3962,11 +4055,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  
>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>  				struct perf_event_groups *groups,
> -				struct pmu *pmu)
> +				struct pmu *pmu,
> +				enum event_type_t event_type)
>  {
> -	int can_add_hw = 1;
> +	struct merge_sched_data msd = {
> +		.can_add_hw = 1,
> +		.event_type = event_type,
> +	};
>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
> -			   merge_sched_in, &can_add_hw);
> +			   merge_sched_in, &msd);
>  }
>  
>  static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
> @@ -3975,9 +4072,9 @@ static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
>  	struct perf_event_context *ctx = pmu_ctx->ctx;
>  
>  	if (event_type & EVENT_PINNED)
> -		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu);
> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu, event_type);
>  	if (event_type & EVENT_FLEXIBLE)
> -		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu);
> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu, event_type);
>  }
>  
>  static void
> @@ -3994,9 +4091,11 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>  		return;
>  
>  	if (!(is_active & EVENT_TIME)) {
> +		/* EVENT_TIME should be active while the guest runs */
> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>  		/* start ctx time */
>  		__update_context_time(ctx, false);
> -		perf_cgroup_set_timestamp(cpuctx);
> +		perf_cgroup_set_timestamp(cpuctx, false);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -4012,7 +4111,23 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule in the required exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_MEDIATED_VPMU.
> +		 */
> +		is_active = event_type & EVENT_ALL;
> +
> +		/*
> +		 * Update ctx time to set the new start time for
> +		 * the exclude_guest events.
> +		 */
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	/*
>  	 * First go through the list and put on any pinned groups
> @@ -4020,13 +4135,13 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>  	 */
>  	if (is_active & EVENT_PINNED) {
>  		for_each_epc(pmu_ctx, ctx, pmu, event_type)
> -			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED);
> +			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED | (event_type & EVENT_GUEST));
>  	}
>  
>  	/* Then walk through the lower prio flexible groups */
>  	if (is_active & EVENT_FLEXIBLE) {
>  		for_each_epc(pmu_ctx, ctx, pmu, event_type)
> -			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE);
> +			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE | (event_type & EVENT_GUEST));
>  	}
>  }
>  
> @@ -6285,23 +6400,25 @@ void perf_event_update_userpage(struct perf_event *event)
>  	if (!rb)
>  		goto unlock;
>  
> -	/*
> -	 * compute total_time_enabled, total_time_running
> -	 * based on snapshot values taken when the event
> -	 * was last scheduled in.
> -	 *
> -	 * we cannot simply called update_context_time()
> -	 * because of locking issue as we can be called in
> -	 * NMI context
> -	 */
> -	calc_timer_values(event, &now, &enabled, &running);
> -
> -	userpg = rb->user_page;
>  	/*
>  	 * Disable preemption to guarantee consistent time stamps are stored to
>  	 * the user page.
>  	 */
>  	preempt_disable();
> +
> +	/*
> +	 * compute total_time_enabled, total_time_running
> +	 * based on snapshot values taken when the event
> +	 * was last scheduled in.
> +	 *
> +	 * we cannot simply called update_context_time()
> +	 * because of locking issue as we can be called in
> +	 * NMI context
> +	 */
> +	calc_timer_values(event, &now, &enabled, &running);
> +
> +	userpg = rb->user_page;
> +
>  	++userpg->lock;
>  	barrier();
>  	userpg->index = perf_event_index(event);
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-05-19  6:58   ` Namhyung Kim
@ 2025-05-20 16:09     ` Liang, Kan
  2025-05-20 17:51       ` Namhyung Kim
  0 siblings, 1 reply; 127+ messages in thread
From: Liang, Kan @ 2025-05-20 16:09 UTC (permalink / raw)
  To: Namhyung Kim, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania



On 2025-05-19 2:58 a.m., Namhyung Kim wrote:
> Hello,
> 
> On Mon, Mar 24, 2025 at 05:30:44PM +0000, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Current perf doesn't explicitly schedule out all exclude_guest events
>> while the guest is running. There is no problem with the current
>> emulated vPMU. Because perf owns all the PMU counters. It can mask the
>> counter which is assigned to an exclude_guest event when a guest is
>> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
>> (AMD way). The counter doesn't count when a guest is running.
>>
>> However, either way doesn't work with the introduced passthrough vPMU.
>> A guest owns all the PMU counters when it's running. The host should not
>> mask any counters. The counter may be used by the guest. The evsentsel
>> may be overwritten.
>>
>> Perf should explicitly schedule out all exclude_guest events to release
>> the PMU resources when entering a guest, and resume the counting when
>> exiting the guest.
>>
>> It's possible that an exclude_guest event is created when a guest is
>> running. The new event should not be scheduled in as well.
>>
>> The ctx time is shared among different PMUs. The time cannot be stopped
>> when a guest is running. It is required to calculate the time for events
>> from other PMUs, e.g., uncore events. Add timeguest to track the guest
>> run time. For an exclude_guest event, the elapsed time equals
>> the ctx time - guest time.
>> Cgroup has dedicated times. Use the same method to deduct the guest time
>> from the cgroup time as well.
>>
>> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |   6 ++
>>  kernel/events/core.c       | 209 +++++++++++++++++++++++++++++--------
>>  2 files changed, 169 insertions(+), 46 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index a2fd1bdc955c..7bda1e20be12 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -999,6 +999,11 @@ struct perf_event_context {
>>  	 */
>>  	struct perf_time_ctx		time;
>>  
>> +	/*
>> +	 * Context clock, runs when in the guest mode.
>> +	 */
>> +	struct perf_time_ctx		timeguest;
> 
> Why not make it an array as you use it later?

Do you mean
struct perf_time_ctx	times[2]?

I don't see a big benefit of using times[T_GUEST] VS.timeguest.

> 
>> +
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>>  	 * been cloned (inherited) from a common ancestor.
>> @@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern {
>>   */
>>  struct perf_cgroup_info {
>>  	struct perf_time_ctx		time;
>> +	struct perf_time_ctx		timeguest;
>>  	int				active;
>>  };
>>  
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index e38c8b5e8086..7a2115b2c5c1 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -163,7 +163,8 @@ enum event_type_t {
>>  	/* see ctx_resched() for details */
>>  	EVENT_CPU	= 0x10,
>>  	EVENT_CGROUP	= 0x20,
>> -	EVENT_FLAGS	= EVENT_CGROUP,
>> +	EVENT_GUEST	= 0x40,
> 
> It's not clear to me if this flag is for events to include guests or
> exclude them.  Can you please add a comment?
> 

/*
 * There are guest events. The for_each_epc() iteration can
 * skip those PMUs which doesn't support guest events via the
 * MEDIATED_VPMU. It is also used to indicate the start/end of
 * guest events to calculate the guest running time.
 */

Thanks,
Kan

> Thanks,
> Namhyung
> 
> 
>> +	EVENT_FLAGS	= EVENT_CGROUP | EVENT_GUEST,
>>  	/* compound helpers */
>>  	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
>>  	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
>> @@ -435,6 +436,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>>  
>>  static atomic_t nr_mediated_pmu_vms;
>>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
>> +static DEFINE_PER_CPU(bool, perf_in_guest);
>>  
>>  /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */
>>  static inline bool is_include_guest_event(struct perf_event *event)
>> @@ -738,6 +740,9 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>>  {
>>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>>  		return true;
>> +	if ((event_type & EVENT_GUEST) &&
>> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU))
>> +		return true;
>>  	return false;
>>  }
>>  
>> @@ -788,6 +793,39 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>>  	WRITE_ONCE(time->offset, time->time - time->stamp);
>>  }
>>  
>> +static_assert(offsetof(struct perf_event_context, timeguest) -
>> +	      offsetof(struct perf_event_context, time) ==
>> +	      sizeof(struct perf_time_ctx));
>> +
>> +#define T_TOTAL		0
>> +#define T_GUEST		1
>> +
>> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
>> +					struct perf_time_ctx *times)
>> +{
>> +	u64 time = times[T_TOTAL].time;
>> +
>> +	if (event->attr.exclude_guest)
>> +		time -= times[T_GUEST].time;
>> +
>> +	return time;
>> +}
>> +
>> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
>> +					    struct perf_time_ctx *times,
>> +					    u64 now)
>> +{
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
>> +		/*
>> +		 * (now + times[total].offset) - (now + times[guest].offset) :=
>> +		 * times[total].offset - times[guest].offset
>> +		 */
>> +		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);
>> +	}
>> +
>> +	return now + READ_ONCE(times[T_TOTAL].offset);
>> +}
>> +
>>  #ifdef CONFIG_CGROUP_PERF
>>  
>>  static inline bool
>> @@ -824,12 +862,16 @@ static inline int is_cgroup_event(struct perf_event *event)
>>  	return event->cgrp != NULL;
>>  }
>>  
>> +static_assert(offsetof(struct perf_cgroup_info, timeguest) -
>> +	      offsetof(struct perf_cgroup_info, time) ==
>> +	      sizeof(struct perf_time_ctx));
>> +
>>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>>  {
>>  	struct perf_cgroup_info *t;
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>> -	return t->time.time;
>> +	return __perf_event_time_ctx(event, &t->time);
>>  }
>>  
>>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>> @@ -838,9 +880,21 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>>  	if (!__load_acquire(&t->active))
>> -		return t->time.time;
>> -	now += READ_ONCE(t->time.offset);
>> -	return now;
>> +		return __perf_event_time_ctx(event, &t->time);
>> +
>> +	return __perf_event_time_ctx_now(event, &t->time, now);
>> +}
>> +
>> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
>> +{
>> +	update_perf_time_ctx(&info->timeguest, now, adv);
>> +}
>> +
>> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
>> +{
>> +	update_perf_time_ctx(&info->time, now, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_cgrp_guest_time(info, now, true);
>>  }
>>  
>>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>> @@ -856,7 +910,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
>>  			cgrp = container_of(css, struct perf_cgroup, css);
>>  			info = this_cpu_ptr(cgrp->info);
>>  
>> -			update_perf_time_ctx(&info->time, now, true);
>> +			update_cgrp_time(info, now);
>>  			if (final)
>>  				__store_release(&info->active, 0);
>>  		}
>> @@ -879,11 +933,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
>>  	 * Do not update time when cgroup is not active
>>  	 */
>>  	if (info->active)
>> -		update_perf_time_ctx(&info->time, perf_clock(), true);
>> +		update_cgrp_time(info, perf_clock());
>>  }
>>  
>>  static inline void
>> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>>  {
>>  	struct perf_event_context *ctx = &cpuctx->ctx;
>>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
>> @@ -903,8 +957,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>>  	for (css = &cgrp->css; css; css = css->parent) {
>>  		cgrp = container_of(css, struct perf_cgroup, css);
>>  		info = this_cpu_ptr(cgrp->info);
>> -		update_perf_time_ctx(&info->time, ctx->time.stamp, false);
>> -		__store_release(&info->active, 1);
>> +		if (guest) {
>> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
>> +		} else {
>> +			update_perf_time_ctx(&info->time, ctx->time.stamp, false);
>> +			__store_release(&info->active, 1);
>> +		}
>>  	}
>>  }
>>  
>> @@ -1104,7 +1162,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>>  }
>>  
>>  static inline void
>> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>>  {
>>  }
>>  
>> @@ -1514,16 +1572,24 @@ static void perf_unpin_context(struct perf_event_context *ctx)
>>   */
>>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
>>  {
>> -	u64 now = perf_clock();
>> +	lockdep_assert_held(&ctx->lock);
>> +
>> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
>> +}
>>  
>> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
>> +{
>>  	lockdep_assert_held(&ctx->lock);
>>  
>> -	update_perf_time_ctx(&ctx->time, now, adv);
>> +	/* must be called after __update_context_time(); */
>> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
>>  }
>>  
>>  static void update_context_time(struct perf_event_context *ctx)
>>  {
>>  	__update_context_time(ctx, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_context_guest_time(ctx, true);
>>  }
>>  
>>  static u64 perf_event_time(struct perf_event *event)
>> @@ -1536,7 +1602,7 @@ static u64 perf_event_time(struct perf_event *event)
>>  	if (is_cgroup_event(event))
>>  		return perf_cgroup_event_time(event);
>>  
>> -	return ctx->time.time;
>> +	return __perf_event_time_ctx(event, &ctx->time);
>>  }
>>  
>>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
>> @@ -1550,10 +1616,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
>>  		return perf_cgroup_event_time_now(event, now);
>>  
>>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
>> -		return ctx->time.time;
>> +		return __perf_event_time_ctx(event, &ctx->time);
>>  
>> -	now += READ_ONCE(ctx->time.offset);
>> -	return now;
>> +	return __perf_event_time_ctx_now(event, &ctx->time, now);
>>  }
>>  
>>  static enum event_type_t get_event_type(struct perf_event *event)
>> @@ -2384,20 +2449,23 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
>>  }
>>  
>>  static inline void
>> -__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final)
>> +__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx,
>> +		  bool final, enum event_type_t event_type)
>>  {
>>  	if (ctx->is_active & EVENT_TIME) {
>>  		if (ctx->is_active & EVENT_FROZEN)
>>  			return;
>> +
>>  		update_context_time(ctx);
>> -		update_cgrp_time_from_cpuctx(cpuctx, final);
>> +		/* vPMU should not stop time */
>> +		update_cgrp_time_from_cpuctx(cpuctx, !(event_type & EVENT_GUEST) && final);
>>  	}
>>  }
>>  
>>  static inline void
>>  ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx)
>>  {
>> -	__ctx_time_update(cpuctx, ctx, false);
>> +	__ctx_time_update(cpuctx, ctx, false, 0);
>>  }
>>  
>>  /*
>> @@ -3405,7 +3473,7 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>>  	 *
>>  	 * would only update time for the pinned events.
>>  	 */
>> -	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx);
>> +	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx, event_type);
>>  
>>  	/*
>>  	 * CPU-release for the below ->is_active store,
>> @@ -3431,7 +3499,18 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule out all exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_MEDIATED_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +		__update_context_guest_time(ctx, false);
>> +		perf_cgroup_set_timestamp(cpuctx, true);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	for_each_epc(pmu_ctx, ctx, pmu, event_type)
>>  		__pmu_ctx_sched_out(pmu_ctx, is_active);
>> @@ -3926,10 +4005,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
>>  		event_update_userpage(event);
>>  }
>>  
>> +struct merge_sched_data {
>> +	int can_add_hw;
>> +	enum event_type_t event_type;
>> +};
>> +
>>  static int merge_sched_in(struct perf_event *event, void *data)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> -	int *can_add_hw = data;
>> +	struct merge_sched_data *msd = data;
>>  
>>  	if (event->state <= PERF_EVENT_STATE_OFF)
>>  		return 0;
>> @@ -3937,13 +4021,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  	if (!event_filter_match(event))
>>  		return 0;
>>  
>> -	if (group_can_go_on(event, *can_add_hw)) {
>> +	/*
>> +	 * Don't schedule in any host events from PMU with
>> +	 * PERF_PMU_CAP_MEDIATED_VPMU, while a guest is running.
>> +	 */
>> +	if (__this_cpu_read(perf_in_guest) &&
>> +	    event->pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU &&
>> +	    !(msd->event_type & EVENT_GUEST))
>> +		return 0;
>> +
>> +	if (group_can_go_on(event, msd->can_add_hw)) {
>>  		if (!group_sched_in(event, ctx))
>>  			list_add_tail(&event->active_list, get_event_list(event));
>>  	}
>>  
>>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -		*can_add_hw = 0;
>> +		msd->can_add_hw = 0;
>>  		if (event->attr.pinned) {
>>  			perf_cgroup_event_disable(event, ctx);
>>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>> @@ -3962,11 +4055,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  
>>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>>  				struct perf_event_groups *groups,
>> -				struct pmu *pmu)
>> +				struct pmu *pmu,
>> +				enum event_type_t event_type)
>>  {
>> -	int can_add_hw = 1;
>> +	struct merge_sched_data msd = {
>> +		.can_add_hw = 1,
>> +		.event_type = event_type,
>> +	};
>>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
>> -			   merge_sched_in, &can_add_hw);
>> +			   merge_sched_in, &msd);
>>  }
>>  
>>  static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
>> @@ -3975,9 +4072,9 @@ static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
>>  	struct perf_event_context *ctx = pmu_ctx->ctx;
>>  
>>  	if (event_type & EVENT_PINNED)
>> -		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu);
>> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu, event_type);
>>  	if (event_type & EVENT_FLEXIBLE)
>> -		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu);
>> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu, event_type);
>>  }
>>  
>>  static void
>> @@ -3994,9 +4091,11 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>>  		return;
>>  
>>  	if (!(is_active & EVENT_TIME)) {
>> +		/* EVENT_TIME should be active while the guest runs */
>> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>>  		/* start ctx time */
>>  		__update_context_time(ctx, false);
>> -		perf_cgroup_set_timestamp(cpuctx);
>> +		perf_cgroup_set_timestamp(cpuctx, false);
>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -4012,7 +4111,23 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule in the required exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_MEDIATED_VPMU.
>> +		 */
>> +		is_active = event_type & EVENT_ALL;
>> +
>> +		/*
>> +		 * Update ctx time to set the new start time for
>> +		 * the exclude_guest events.
>> +		 */
>> +		update_context_time(ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, false);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	/*
>>  	 * First go through the list and put on any pinned groups
>> @@ -4020,13 +4135,13 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
>>  	 */
>>  	if (is_active & EVENT_PINNED) {
>>  		for_each_epc(pmu_ctx, ctx, pmu, event_type)
>> -			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED);
>> +			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED | (event_type & EVENT_GUEST));
>>  	}
>>  
>>  	/* Then walk through the lower prio flexible groups */
>>  	if (is_active & EVENT_FLEXIBLE) {
>>  		for_each_epc(pmu_ctx, ctx, pmu, event_type)
>> -			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE);
>> +			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE | (event_type & EVENT_GUEST));
>>  	}
>>  }
>>  
>> @@ -6285,23 +6400,25 @@ void perf_event_update_userpage(struct perf_event *event)
>>  	if (!rb)
>>  		goto unlock;
>>  
>> -	/*
>> -	 * compute total_time_enabled, total_time_running
>> -	 * based on snapshot values taken when the event
>> -	 * was last scheduled in.
>> -	 *
>> -	 * we cannot simply called update_context_time()
>> -	 * because of locking issue as we can be called in
>> -	 * NMI context
>> -	 */
>> -	calc_timer_values(event, &now, &enabled, &running);
>> -
>> -	userpg = rb->user_page;
>>  	/*
>>  	 * Disable preemption to guarantee consistent time stamps are stored to
>>  	 * the user page.
>>  	 */
>>  	preempt_disable();
>> +
>> +	/*
>> +	 * compute total_time_enabled, total_time_running
>> +	 * based on snapshot values taken when the event
>> +	 * was last scheduled in.
>> +	 *
>> +	 * we cannot simply called update_context_time()
>> +	 * because of locking issue as we can be called in
>> +	 * NMI context
>> +	 */
>> +	calc_timer_values(event, &now, &enabled, &running);
>> +
>> +	userpg = rb->user_page;
>> +
>>  	++userpg->lock;
>>  	barrier();
>>  	userpg->index = perf_event_index(event);
>> -- 
>> 2.49.0.395.g12beb8f557-goog
>>
> 


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-05-20 16:09     ` Liang, Kan
@ 2025-05-20 17:51       ` Namhyung Kim
  2025-05-20 18:50         ` Liang, Kan
  0 siblings, 1 reply; 127+ messages in thread
From: Namhyung Kim @ 2025-05-20 17:51 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Sean Christopherson, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Tue, May 20, 2025 at 12:09:02PM -0400, Liang, Kan wrote:
> 
> 
> On 2025-05-19 2:58 a.m., Namhyung Kim wrote:
> > Hello,
> > 
> > On Mon, Mar 24, 2025 at 05:30:44PM +0000, Mingwei Zhang wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Current perf doesn't explicitly schedule out all exclude_guest events
> >> while the guest is running. There is no problem with the current
> >> emulated vPMU. Because perf owns all the PMU counters. It can mask the
> >> counter which is assigned to an exclude_guest event when a guest is
> >> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
> >> (AMD way). The counter doesn't count when a guest is running.
> >>
> >> However, either way doesn't work with the introduced passthrough vPMU.
> >> A guest owns all the PMU counters when it's running. The host should not
> >> mask any counters. The counter may be used by the guest. The evsentsel
> >> may be overwritten.
> >>
> >> Perf should explicitly schedule out all exclude_guest events to release
> >> the PMU resources when entering a guest, and resume the counting when
> >> exiting the guest.
> >>
> >> It's possible that an exclude_guest event is created when a guest is
> >> running. The new event should not be scheduled in as well.
> >>
> >> The ctx time is shared among different PMUs. The time cannot be stopped
> >> when a guest is running. It is required to calculate the time for events
> >> from other PMUs, e.g., uncore events. Add timeguest to track the guest
> >> run time. For an exclude_guest event, the elapsed time equals
> >> the ctx time - guest time.
> >> Cgroup has dedicated times. Use the same method to deduct the guest time
> >> from the cgroup time as well.
> >>
> >> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> ---
> >>  include/linux/perf_event.h |   6 ++
> >>  kernel/events/core.c       | 209 +++++++++++++++++++++++++++++--------
> >>  2 files changed, 169 insertions(+), 46 deletions(-)
> >>
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index a2fd1bdc955c..7bda1e20be12 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -999,6 +999,11 @@ struct perf_event_context {
> >>  	 */
> >>  	struct perf_time_ctx		time;
> >>  
> >> +	/*
> >> +	 * Context clock, runs when in the guest mode.
> >> +	 */
> >> +	struct perf_time_ctx		timeguest;
> > 
> > Why not make it an array as you use it later?
> 
> Do you mean
> struct perf_time_ctx	times[2]?

Yep.

> 
> I don't see a big benefit of using times[T_GUEST] VS.timeguest.

No big benefits.  But I just think it's natural to make it an array if
you want to access them as if it's an array. :)

> 
> > 
> >> +
> >>  	/*
> >>  	 * These fields let us detect when two contexts have both
> >>  	 * been cloned (inherited) from a common ancestor.
> >> @@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern {
> >>   */
> >>  struct perf_cgroup_info {
> >>  	struct perf_time_ctx		time;
> >> +	struct perf_time_ctx		timeguest;
> >>  	int				active;
> >>  };
> >>  
> >> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> index e38c8b5e8086..7a2115b2c5c1 100644
> >> --- a/kernel/events/core.c
> >> +++ b/kernel/events/core.c
> >> @@ -163,7 +163,8 @@ enum event_type_t {
> >>  	/* see ctx_resched() for details */
> >>  	EVENT_CPU	= 0x10,
> >>  	EVENT_CGROUP	= 0x20,
> >> -	EVENT_FLAGS	= EVENT_CGROUP,
> >> +	EVENT_GUEST	= 0x40,
> > 
> > It's not clear to me if this flag is for events to include guests or
> > exclude them.  Can you please add a comment?
> > 
> 
> /*
>  * There are guest events. The for_each_epc() iteration can
>  * skip those PMUs which doesn't support guest events via the
>  * MEDIATED_VPMU. It is also used to indicate the start/end of
>  * guest events to calculate the guest running time.
>  */

Thanks for the explanation.  So it's for events with !exclude_guest on
host and to do some operation only for host-only events on mediated
vPMUs.

Thanks,
Namhyung

> > 
> >> +	EVENT_FLAGS	= EVENT_CGROUP | EVENT_GUEST,
> >>  	/* compound helpers */
> >>  	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
> >>  	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
> >> @@ -435,6 +436,7 @@ static atomic_t nr_include_guest_events __read_mostly;
> >>  
> >>  static atomic_t nr_mediated_pmu_vms;
> >>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> >> +static DEFINE_PER_CPU(bool, perf_in_guest);
> >>  
> >>  /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */
> >>  static inline bool is_include_guest_event(struct perf_event *event)
> >> @@ -738,6 +740,9 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
> >>  {
> >>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
> >>  		return true;
> >> +	if ((event_type & EVENT_GUEST) &&
> >> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU))
> >> +		return true;
> >>  	return false;
> >>  }
> >>  
> >> @@ -788,6 +793,39 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
> >>  	WRITE_ONCE(time->offset, time->time - time->stamp);
> >>  }
> >>  
> >> +static_assert(offsetof(struct perf_event_context, timeguest) -
> >> +	      offsetof(struct perf_event_context, time) ==
> >> +	      sizeof(struct perf_time_ctx));
> >> +
> >> +#define T_TOTAL		0
> >> +#define T_GUEST		1
> >> +
> >> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> >> +					struct perf_time_ctx *times)
> >> +{
> >> +	u64 time = times[T_TOTAL].time;
> >> +
> >> +	if (event->attr.exclude_guest)
> >> +		time -= times[T_GUEST].time;
> >> +
> >> +	return time;
> >> +}
> >> +
> >> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> >> +					    struct perf_time_ctx *times,
> >> +					    u64 now)
> >> +{
> >> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
> >> +		/*
> >> +		 * (now + times[total].offset) - (now + times[guest].offset) :=
> >> +		 * times[total].offset - times[guest].offset
> >> +		 */
> >> +		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);
> >> +	}
> >> +
> >> +	return now + READ_ONCE(times[T_TOTAL].offset);
> >> +}
> >> +
> >>  #ifdef CONFIG_CGROUP_PERF
> >>  
> >>  static inline bool
> >> @@ -824,12 +862,16 @@ static inline int is_cgroup_event(struct perf_event *event)
> >>  	return event->cgrp != NULL;
> >>  }
> >>  
> >> +static_assert(offsetof(struct perf_cgroup_info, timeguest) -
> >> +	      offsetof(struct perf_cgroup_info, time) ==
> >> +	      sizeof(struct perf_time_ctx));
> >> +
> >>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
> >>  {
> >>  	struct perf_cgroup_info *t;
> >>  
> >>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> >> -	return t->time.time;
> >> +	return __perf_event_time_ctx(event, &t->time);
> >>  }
> >>  
> >>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> >> @@ -838,9 +880,21 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> >>  
> >>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> >>  	if (!__load_acquire(&t->active))
> >> -		return t->time.time;
> >> -	now += READ_ONCE(t->time.offset);
> >> -	return now;
> >> +		return __perf_event_time_ctx(event, &t->time);
> >> +
> >> +	return __perf_event_time_ctx_now(event, &t->time, now);
> >> +}
> >> +
> >> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
> >> +{
> >> +	update_perf_time_ctx(&info->timeguest, now, adv);
> >> +}
> >> +
> >> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
> >> +{
> >> +	update_perf_time_ctx(&info->time, now, true);
> >> +	if (__this_cpu_read(perf_in_guest))
> >> +		__update_cgrp_guest_time(info, now, true);
> >>  }
> >>  
> >>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
> >> @@ -856,7 +910,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
> >>  			cgrp = container_of(css, struct perf_cgroup, css);
> >>  			info = this_cpu_ptr(cgrp->info);
> >>  
> >> -			update_perf_time_ctx(&info->time, now, true);
> >> +			update_cgrp_time(info, now);
> >>  			if (final)
> >>  				__store_release(&info->active, 0);
> >>  		}
> >> @@ -879,11 +933,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
> >>  	 * Do not update time when cgroup is not active
> >>  	 */
> >>  	if (info->active)
> >> -		update_perf_time_ctx(&info->time, perf_clock(), true);
> >> +		update_cgrp_time(info, perf_clock());
> >>  }
> >>  
> >>  static inline void
> >> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> >> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
> >>  {
> >>  	struct perf_event_context *ctx = &cpuctx->ctx;
> >>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> >> @@ -903,8 +957,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> >>  	for (css = &cgrp->css; css; css = css->parent) {
> >>  		cgrp = container_of(css, struct perf_cgroup, css);
> >>  		info = this_cpu_ptr(cgrp->info);
> >> -		update_perf_time_ctx(&info->time, ctx->time.stamp, false);
> >> -		__store_release(&info->active, 1);
> >> +		if (guest) {
> >> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
> >> +		} else {
> >> +			update_perf_time_ctx(&info->time, ctx->time.stamp, false);
> >> +			__store_release(&info->active, 1);
> >> +		}
> >>  	}
> >>  }
> >>  
> >> @@ -1104,7 +1162,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
> >>  }
> >>  
> >>  static inline void
> >> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> >> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
> >>  {
> >>  }
> >>  
> >> @@ -1514,16 +1572,24 @@ static void perf_unpin_context(struct perf_event_context *ctx)
> >>   */
> >>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
> >>  {
> >> -	u64 now = perf_clock();
> >> +	lockdep_assert_held(&ctx->lock);
> >> +
> >> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
> >> +}
> >>  
> >> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
> >> +{
> >>  	lockdep_assert_held(&ctx->lock);
> >>  
> >> -	update_perf_time_ctx(&ctx->time, now, adv);
> >> +	/* must be called after __update_context_time(); */
> >> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
> >>  }
> >>  
> >>  static void update_context_time(struct perf_event_context *ctx)
> >>  {
> >>  	__update_context_time(ctx, true);
> >> +	if (__this_cpu_read(perf_in_guest))
> >> +		__update_context_guest_time(ctx, true);
> >>  }
> >>  
> >>  static u64 perf_event_time(struct perf_event *event)
> >> @@ -1536,7 +1602,7 @@ static u64 perf_event_time(struct perf_event *event)
> >>  	if (is_cgroup_event(event))
> >>  		return perf_cgroup_event_time(event);
> >>  
> >> -	return ctx->time.time;
> >> +	return __perf_event_time_ctx(event, &ctx->time);
> >>  }
> >>  
> >>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
> >> @@ -1550,10 +1616,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
> >>  		return perf_cgroup_event_time_now(event, now);
> >>  
> >>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> >> -		return ctx->time.time;
> >> +		return __perf_event_time_ctx(event, &ctx->time);
> >>  
> >> -	now += READ_ONCE(ctx->time.offset);
> >> -	return now;
> >> +	return __perf_event_time_ctx_now(event, &ctx->time, now);
> >>  }
> >>  
> >>  static enum event_type_t get_event_type(struct perf_event *event)
> >> @@ -2384,20 +2449,23 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
> >>  }
> >>  
> >>  static inline void
> >> -__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final)
> >> +__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx,
> >> +		  bool final, enum event_type_t event_type)
> >>  {
> >>  	if (ctx->is_active & EVENT_TIME) {
> >>  		if (ctx->is_active & EVENT_FROZEN)
> >>  			return;
> >> +
> >>  		update_context_time(ctx);
> >> -		update_cgrp_time_from_cpuctx(cpuctx, final);
> >> +		/* vPMU should not stop time */
> >> +		update_cgrp_time_from_cpuctx(cpuctx, !(event_type & EVENT_GUEST) && final);
> >>  	}
> >>  }
> >>  
> >>  static inline void
> >>  ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx)
> >>  {
> >> -	__ctx_time_update(cpuctx, ctx, false);
> >> +	__ctx_time_update(cpuctx, ctx, false, 0);
> >>  }
> >>  
> >>  /*
> >> @@ -3405,7 +3473,7 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
> >>  	 *
> >>  	 * would only update time for the pinned events.
> >>  	 */
> >> -	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx);
> >> +	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx, event_type);
> >>  
> >>  	/*
> >>  	 * CPU-release for the below ->is_active store,
> >> @@ -3431,7 +3499,18 @@ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
> >>  			cpuctx->task_ctx = NULL;
> >>  	}
> >>  
> >> -	is_active ^= ctx->is_active; /* changed bits */
> >> +	if (event_type & EVENT_GUEST) {
> >> +		/*
> >> +		 * Schedule out all exclude_guest events of PMU
> >> +		 * with PERF_PMU_CAP_MEDIATED_VPMU.
> >> +		 */
> >> +		is_active = EVENT_ALL;
> >> +		__update_context_guest_time(ctx, false);
> >> +		perf_cgroup_set_timestamp(cpuctx, true);
> >> +		barrier();
> >> +	} else {
> >> +		is_active ^= ctx->is_active; /* changed bits */
> >> +	}
> >>  
> >>  	for_each_epc(pmu_ctx, ctx, pmu, event_type)
> >>  		__pmu_ctx_sched_out(pmu_ctx, is_active);
> >> @@ -3926,10 +4005,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
> >>  		event_update_userpage(event);
> >>  }
> >>  
> >> +struct merge_sched_data {
> >> +	int can_add_hw;
> >> +	enum event_type_t event_type;
> >> +};
> >> +
> >>  static int merge_sched_in(struct perf_event *event, void *data)
> >>  {
> >>  	struct perf_event_context *ctx = event->ctx;
> >> -	int *can_add_hw = data;
> >> +	struct merge_sched_data *msd = data;
> >>  
> >>  	if (event->state <= PERF_EVENT_STATE_OFF)
> >>  		return 0;
> >> @@ -3937,13 +4021,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
> >>  	if (!event_filter_match(event))
> >>  		return 0;
> >>  
> >> -	if (group_can_go_on(event, *can_add_hw)) {
> >> +	/*
> >> +	 * Don't schedule in any host events from PMU with
> >> +	 * PERF_PMU_CAP_MEDIATED_VPMU, while a guest is running.
> >> +	 */
> >> +	if (__this_cpu_read(perf_in_guest) &&
> >> +	    event->pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU &&
> >> +	    !(msd->event_type & EVENT_GUEST))
> >> +		return 0;
> >> +
> >> +	if (group_can_go_on(event, msd->can_add_hw)) {
> >>  		if (!group_sched_in(event, ctx))
> >>  			list_add_tail(&event->active_list, get_event_list(event));
> >>  	}
> >>  
> >>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
> >> -		*can_add_hw = 0;
> >> +		msd->can_add_hw = 0;
> >>  		if (event->attr.pinned) {
> >>  			perf_cgroup_event_disable(event, ctx);
> >>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> >> @@ -3962,11 +4055,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
> >>  
> >>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
> >>  				struct perf_event_groups *groups,
> >> -				struct pmu *pmu)
> >> +				struct pmu *pmu,
> >> +				enum event_type_t event_type)
> >>  {
> >> -	int can_add_hw = 1;
> >> +	struct merge_sched_data msd = {
> >> +		.can_add_hw = 1,
> >> +		.event_type = event_type,
> >> +	};
> >>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
> >> -			   merge_sched_in, &can_add_hw);
> >> +			   merge_sched_in, &msd);
> >>  }
> >>  
> >>  static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
> >> @@ -3975,9 +4072,9 @@ static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
> >>  	struct perf_event_context *ctx = pmu_ctx->ctx;
> >>  
> >>  	if (event_type & EVENT_PINNED)
> >> -		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu);
> >> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu, event_type);
> >>  	if (event_type & EVENT_FLEXIBLE)
> >> -		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu);
> >> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu, event_type);
> >>  }
> >>  
> >>  static void
> >> @@ -3994,9 +4091,11 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
> >>  		return;
> >>  
> >>  	if (!(is_active & EVENT_TIME)) {
> >> +		/* EVENT_TIME should be active while the guest runs */
> >> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
> >>  		/* start ctx time */
> >>  		__update_context_time(ctx, false);
> >> -		perf_cgroup_set_timestamp(cpuctx);
> >> +		perf_cgroup_set_timestamp(cpuctx, false);
> >>  		/*
> >>  		 * CPU-release for the below ->is_active store,
> >>  		 * see __load_acquire() in perf_event_time_now()
> >> @@ -4012,7 +4111,23 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
> >>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
> >>  	}
> >>  
> >> -	is_active ^= ctx->is_active; /* changed bits */
> >> +	if (event_type & EVENT_GUEST) {
> >> +		/*
> >> +		 * Schedule in the required exclude_guest events of PMU
> >> +		 * with PERF_PMU_CAP_MEDIATED_VPMU.
> >> +		 */
> >> +		is_active = event_type & EVENT_ALL;
> >> +
> >> +		/*
> >> +		 * Update ctx time to set the new start time for
> >> +		 * the exclude_guest events.
> >> +		 */
> >> +		update_context_time(ctx);
> >> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> >> +		barrier();
> >> +	} else {
> >> +		is_active ^= ctx->is_active; /* changed bits */
> >> +	}
> >>  
> >>  	/*
> >>  	 * First go through the list and put on any pinned groups
> >> @@ -4020,13 +4135,13 @@ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t
> >>  	 */
> >>  	if (is_active & EVENT_PINNED) {
> >>  		for_each_epc(pmu_ctx, ctx, pmu, event_type)
> >> -			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED);
> >> +			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED | (event_type & EVENT_GUEST));
> >>  	}
> >>  
> >>  	/* Then walk through the lower prio flexible groups */
> >>  	if (is_active & EVENT_FLEXIBLE) {
> >>  		for_each_epc(pmu_ctx, ctx, pmu, event_type)
> >> -			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE);
> >> +			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE | (event_type & EVENT_GUEST));
> >>  	}
> >>  }
> >>  
> >> @@ -6285,23 +6400,25 @@ void perf_event_update_userpage(struct perf_event *event)
> >>  	if (!rb)
> >>  		goto unlock;
> >>  
> >> -	/*
> >> -	 * compute total_time_enabled, total_time_running
> >> -	 * based on snapshot values taken when the event
> >> -	 * was last scheduled in.
> >> -	 *
> >> -	 * we cannot simply called update_context_time()
> >> -	 * because of locking issue as we can be called in
> >> -	 * NMI context
> >> -	 */
> >> -	calc_timer_values(event, &now, &enabled, &running);
> >> -
> >> -	userpg = rb->user_page;
> >>  	/*
> >>  	 * Disable preemption to guarantee consistent time stamps are stored to
> >>  	 * the user page.
> >>  	 */
> >>  	preempt_disable();
> >> +
> >> +	/*
> >> +	 * compute total_time_enabled, total_time_running
> >> +	 * based on snapshot values taken when the event
> >> +	 * was last scheduled in.
> >> +	 *
> >> +	 * we cannot simply called update_context_time()
> >> +	 * because of locking issue as we can be called in
> >> +	 * NMI context
> >> +	 */
> >> +	calc_timer_values(event, &now, &enabled, &running);
> >> +
> >> +	userpg = rb->user_page;
> >> +
> >>  	++userpg->lock;
> >>  	barrier();
> >>  	userpg->index = perf_event_index(event);
> >> -- 
> >> 2.49.0.395.g12beb8f557-goog
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-05-20 17:51       ` Namhyung Kim
@ 2025-05-20 18:50         ` Liang, Kan
  0 siblings, 0 replies; 127+ messages in thread
From: Liang, Kan @ 2025-05-20 18:50 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Sean Christopherson, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania



On 2025-05-20 1:51 p.m., Namhyung Kim wrote:
>>>> @@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern {
>>>>   */
>>>>  struct perf_cgroup_info {
>>>>  	struct perf_time_ctx		time;
>>>> +	struct perf_time_ctx		timeguest;
>>>>  	int				active;
>>>>  };
>>>>  
>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>> index e38c8b5e8086..7a2115b2c5c1 100644
>>>> --- a/kernel/events/core.c
>>>> +++ b/kernel/events/core.c
>>>> @@ -163,7 +163,8 @@ enum event_type_t {
>>>>  	/* see ctx_resched() for details */
>>>>  	EVENT_CPU	= 0x10,
>>>>  	EVENT_CGROUP	= 0x20,
>>>> -	EVENT_FLAGS	= EVENT_CGROUP,
>>>> +	EVENT_GUEST	= 0x40,
>>> It's not clear to me if this flag is for events to include guests or
>>> exclude them.  Can you please add a comment?
>>>
>> /*
>>  * There are guest events. The for_each_epc() iteration can
>>  * skip those PMUs which doesn't support guest events via the
>>  * MEDIATED_VPMU. It is also used to indicate the start/end of
>>  * guest events to calculate the guest running time.
>>  */
> Thanks for the explanation.  So it's for events with !exclude_guest on
> host 

The above "guest events" means that the events in a guest. The KVM
should only invokes the interface when a guest requires PMU.

For the host, for now, only the event with exclude_guest is supported.
The !exclude_guest event on host must be failed to be created if there
is a running VM.

and to do some operation only for host-only events on mediated
> vPMUs.

Yes.

Update the comments as below.

/*
 * There are events in a guest enabled with MEDIATED_VPMU.
 * The flag can be used in two places.
 * - The for_each_epc() iteration can skip those PMUs which
 *   doesn't support the events in a guest via the MEDIATED_VPMU.
 * - Indicate the start/end point of the events in a guest.
 *   The guest running time can be deducted for the
 *   host-only (exclude_guest) events.
 */

Thanks,
Kan

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
  2025-05-15  2:53     ` Mi, Dapeng
@ 2025-05-21 18:43       ` Sean Christopherson
  2025-05-22  1:36         ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-05-21 18:43 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Thu, May 15, 2025, Dapeng Mi wrote:
> On 5/15/2025 8:09 AM, Sean Christopherson wrote:
> > On Mon, Mar 24, 2025, Mingwei Zhang wrote:
> >> +	return vcpu->kvm->arch.enable_pmu &&
> > This is superfluous, pmu->version should never be non-zero without the PMU being
> > enabled at the VM level.
> 
> Strictly speaking, "arch.enable_pmu" and pmu->version doesn't indicates
> fully same thing.  "arch.enable_pmu" indicates whether PMU function is
> enabled in KVM, but the "pmu->version" comes from user space configuration.
> In theory user space could configure a "0"  PMU version just like
> pmu_counters_test does. Currently I'm not sure if the check for
> "pmu->version" can be removed, let me have a double check.

Gah, sorry, my comment was vague and confusing.  What I was trying to say is that
the vcpu->kvm->arch.enable_pmu check is superfluous and can be dropped.

> >> +	kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;
> > So I tried to run a QEMU with this and it failed, because QEMU expected the PMU
> > to be enabled and tried to write to PMU MSRs.  I haven't dug through the QEMU
> > code, but I assume that QEMU rightly expects that passing in PMU in CPUID when
> > KVM_GET_SUPPORTED_CPUID says its supported will result in the VM having a PMU.
> 
> As long as the module parameter "enable_mediated_pmu" is enabled, qemu
> needs below extra code to enable mediated vPMU, otherwise PMU is disabled
> in KVM.
> 
> https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.com/
> 
> > I.e. by trying to get cute with backwards compatibility, I think we broke backwards
> > compatiblity.  At this point, I'm leaning toward making the module param off-by-default,
> > but otherwise not messing with the behavior of kvm->arch.enable_pmu.  Not sure if
> > that has implications for KVM_PMU_CAP_DISABLE though.
> 
> I'm not sure if it's a kind of break for backwards compatibility.  As long
> as "enable_mediated_pmu" is not enabled, the qemu doesn't need any changes,
> the legacy vPMU can still be enabled by old qemu version. But if user want
> to enable mediated vPMU, so they should use the new version qemu which has
> the capability to enable mediated vPMU, it sounds reasonable for me.

I agree it's reasonable to require a userspace update to take advantage of new
features, what I don't like is what happens if userspace _hasn't_ been updated.
I also don't love that forcing a userspace update in this case is more than a bit
contrived.  It's very doable to let existing userspace utilize the mediated PMU,
forcing KVM_CAP_PMU_CAPABILITY is essentially KVM punting a problem to userspace.

And the complications with the mediated PMU don't really have anything to do with
the VMM, they're more about all the other tasks and daemons running on the system,
e.g. that might be using perf.

Thinking more about this, the problem isn't so much that enabling mediated PMUs
by default is undesirable, it's that giving userspace a binary choise doesn't
provide enough flexibility.  E.g. for single-user QEMU-based use cases (including
my use of QEMU), requiring a new QEMU is painful and annoying, and so having an
on-by-default option would be nice.

But for use cases that already utilize KVM_CAP_PMU_CAPABILITY, e.g. to explicitly
disable PMUs for a subset of VMs, on-by-default is very undesirable, e.g. would
require KVM to support KVM_PMU_CAP_DISABLE, and would generate unnecessary noise
and contention in perf.

So, what if we simply make enable_mediated_pmu a tri-state of sorts?

  0   == disabled
  > 0 == enabled for all VMs (no opt-in or opt-out supported)
  < 0 == enabled, but off by default (requires opt-in)

Then use cases like my personal usage of QEMU can run with enable_mediated_pmu=1,
while use cases like Google Cloud can run with enable_mediated_pmu=-1, and everyone
is happy (hopefully), without too much added complexity in KVM.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 04/38] perf: Add a EVENT_GUEST flag
  2025-03-24 17:30 ` [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Mingwei Zhang
  2025-05-14 22:51   ` Sean Christopherson
  2025-05-19  6:58   ` Namhyung Kim
@ 2025-05-21 19:46   ` Namhyung Kim
  2 siblings, 0 replies; 127+ messages in thread
From: Namhyung Kim @ 2025-05-21 19:46 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:30:44PM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Current perf doesn't explicitly schedule out all exclude_guest events
> while the guest is running. There is no problem with the current
> emulated vPMU. Because perf owns all the PMU counters. It can mask the
> counter which is assigned to an exclude_guest event when a guest is
> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
> (AMD way). The counter doesn't count when a guest is running.
> 
> However, either way doesn't work with the introduced passthrough vPMU.
> A guest owns all the PMU counters when it's running. The host should not
> mask any counters. The counter may be used by the guest. The evsentsel
> may be overwritten.
> 
> Perf should explicitly schedule out all exclude_guest events to release
> the PMU resources when entering a guest, and resume the counting when
> exiting the guest.
> 
> It's possible that an exclude_guest event is created when a guest is
> running. The new event should not be scheduled in as well.
> 
> The ctx time is shared among different PMUs. The time cannot be stopped
> when a guest is running. It is required to calculate the time for events
> from other PMUs, e.g., uncore events. Add timeguest to track the guest
> run time. For an exclude_guest event, the elapsed time equals
> the ctx time - guest time.
> Cgroup has dedicated times. Use the same method to deduct the guest time
> from the cgroup time as well.
> 
> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |   6 ++
>  kernel/events/core.c       | 209 +++++++++++++++++++++++++++++--------
>  2 files changed, 169 insertions(+), 46 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a2fd1bdc955c..7bda1e20be12 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -999,6 +999,11 @@ struct perf_event_context {
>  	 */
>  	struct perf_time_ctx		time;
>  
> +	/*
> +	 * Context clock, runs when in the guest mode.
> +	 */
> +	struct perf_time_ctx		timeguest;
> +
>  	/*
>  	 * These fields let us detect when two contexts have both
>  	 * been cloned (inherited) from a common ancestor.
> @@ -1089,6 +1094,7 @@ struct bpf_perf_event_data_kern {
>   */
>  struct perf_cgroup_info {
>  	struct perf_time_ctx		time;
> +	struct perf_time_ctx		timeguest;
>  	int				active;
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e38c8b5e8086..7a2115b2c5c1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -163,7 +163,8 @@ enum event_type_t {
>  	/* see ctx_resched() for details */
>  	EVENT_CPU	= 0x10,
>  	EVENT_CGROUP	= 0x20,
> -	EVENT_FLAGS	= EVENT_CGROUP,
> +	EVENT_GUEST	= 0x40,
> +	EVENT_FLAGS	= EVENT_CGROUP | EVENT_GUEST,
>  	/* compound helpers */
>  	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
>  	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
> @@ -435,6 +436,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>  
>  static atomic_t nr_mediated_pmu_vms;
>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> +static DEFINE_PER_CPU(bool, perf_in_guest);
>  
>  /* !exclude_guest event of PMU with PERF_PMU_CAP_MEDIATED_VPMU */
>  static inline bool is_include_guest_event(struct perf_event *event)
> @@ -738,6 +740,9 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>  {
>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>  		return true;
> +	if ((event_type & EVENT_GUEST) &&
> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_MEDIATED_VPMU))
> +		return true;
>  	return false;
>  }
>  
> @@ -788,6 +793,39 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>  	WRITE_ONCE(time->offset, time->time - time->stamp);
>  }
>  
> +static_assert(offsetof(struct perf_event_context, timeguest) -
> +	      offsetof(struct perf_event_context, time) ==
> +	      sizeof(struct perf_time_ctx));
> +
> +#define T_TOTAL		0
> +#define T_GUEST		1
> +
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *times)
> +{
> +	u64 time = times[T_TOTAL].time;
> +
> +	if (event->attr.exclude_guest)
> +		time -= times[T_GUEST].time;
> +
> +	return time;
> +}
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *times,
> +					    u64 now)
> +{
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
> +		/*
> +		 * (now + times[total].offset) - (now + times[guest].offset) :=
> +		 * times[total].offset - times[guest].offset
> +		 */
> +		return READ_ONCE(times[T_TOTAL].offset) - READ_ONCE(times[T_GUEST].offset);

So this will remove both time_enabled and time_running.  I think it's
fine as the events are not multiplexed, but some curious users may
wonder why the time_enabled is less than expected. :)


> +	}
> +
> +	return now + READ_ONCE(times[T_TOTAL].offset);
> +}
> +

[SNIP] 
> @@ -6285,23 +6400,25 @@ void perf_event_update_userpage(struct perf_event *event)
>  	if (!rb)
>  		goto unlock;
>  
> -	/*
> -	 * compute total_time_enabled, total_time_running
> -	 * based on snapshot values taken when the event
> -	 * was last scheduled in.
> -	 *
> -	 * we cannot simply called update_context_time()
> -	 * because of locking issue as we can be called in
> -	 * NMI context
> -	 */
> -	calc_timer_values(event, &now, &enabled, &running);
> -
> -	userpg = rb->user_page;
>  	/*
>  	 * Disable preemption to guarantee consistent time stamps are stored to
>  	 * the user page.
>  	 */
>  	preempt_disable();
> +
> +	/*
> +	 * compute total_time_enabled, total_time_running
> +	 * based on snapshot values taken when the event
> +	 * was last scheduled in.
> +	 *
> +	 * we cannot simply called update_context_time()

s/called/call.  I know you just moved the code though. :)

Thanks,
Namhyung


> +	 * because of locking issue as we can be called in
> +	 * NMI context
> +	 */
> +	calc_timer_values(event, &now, &enabled, &running);
> +
> +	userpg = rb->user_page;
> +
>  	++userpg->lock;
>  	barrier();
>  	userpg->index = perf_event_index(event);
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-03-24 17:30 ` [PATCH v4 05/38] perf: Add generic exclude_guest support Mingwei Zhang
  2025-04-25 11:13   ` Peter Zijlstra
@ 2025-05-21 19:55   ` Namhyung Kim
  2025-05-21 20:12     ` Liang, Kan
  1 sibling, 1 reply; 127+ messages in thread
From: Namhyung Kim @ 2025-05-21 19:55 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:30:45PM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Only KVM knows the exact time when a guest is entering/exiting. Expose
> two interfaces to KVM to switch the ownership of the PMU resources.
> 
> All the pinned events must be scheduled in first. Extend the
> perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |  4 ++
>  kernel/events/core.c       | 80 ++++++++++++++++++++++++++++++++++----
>  2 files changed, 77 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 7bda1e20be12..37187ee8e226 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1822,6 +1822,8 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  int perf_get_mediated_pmu(void);
>  void perf_put_mediated_pmu(void);
> +void perf_guest_enter(void);
> +void perf_guest_exit(void);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
>  perf_aux_output_begin(struct perf_output_handle *handle,
> @@ -1919,6 +1921,8 @@ static inline int perf_get_mediated_pmu(void)
>  }
>  
>  static inline void perf_put_mediated_pmu(void)			{ }
> +static inline void perf_guest_enter(void)			{ }
> +static inline void perf_guest_exit(void)			{ }
>  #endif
>  
>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7a2115b2c5c1..d05487d465c9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2827,14 +2827,15 @@ static void task_ctx_sched_out(struct perf_event_context *ctx,
>  
>  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>  				struct perf_event_context *ctx,
> -				struct pmu *pmu)
> +				struct pmu *pmu,
> +				enum event_type_t event_type)
>  {
> -	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED);
> +	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED | event_type);
>  	if (ctx)
> -		 ctx_sched_in(ctx, pmu, EVENT_PINNED);
> -	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
> +		ctx_sched_in(ctx, pmu, EVENT_PINNED | event_type);
> +	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE | event_type);
>  	if (ctx)
> -		 ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE);
> +		ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE | event_type);
>  }
>  
>  /*
> @@ -2890,7 +2891,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  	else if (event_type & EVENT_PINNED)
>  		ctx_sched_out(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
>  
> -	perf_event_sched_in(cpuctx, task_ctx, pmu);
> +	perf_event_sched_in(cpuctx, task_ctx, pmu, 0);
>  
>  	for_each_epc(epc, &cpuctx->ctx, pmu, 0)
>  		perf_pmu_enable(epc->pmu);
> @@ -4188,7 +4189,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
>  		ctx_sched_out(&cpuctx->ctx, NULL, EVENT_FLEXIBLE);
>  	}
>  
> -	perf_event_sched_in(cpuctx, ctx, NULL);
> +	perf_event_sched_in(cpuctx, ctx, NULL, 0);
>  
>  	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
>  
> @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
> +{
> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> +	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> +	if (cpuctx->task_ctx) {
> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> +		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> +	}
> +}

Cpu context and task context may have events in the same PMU.
How about this?

	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
	if (cpuctx->task_ctx)
		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);

	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
	if (cpuctx->task_ctx)
		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);

	if (cpuctx->task_ctx)
		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);

Thanks,
Namhyung

> +
> +/* When entering a guest, schedule out all exclude_guest events. */
> +void perf_guest_enter(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
> +		goto unlock;
> +
> +	perf_host_exit(cpuctx);
> +
> +	__this_cpu_write(perf_in_guest, true);
> +
> +unlock:
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_enter);
> +
> +static inline void perf_host_enter(struct perf_cpu_context *cpuctx)
> +{
> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> +	if (cpuctx->task_ctx)
> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> +
> +	perf_event_sched_in(cpuctx, cpuctx->task_ctx, NULL, EVENT_GUEST);
> +
> +	if (cpuctx->task_ctx)
> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> +}
> +
> +void perf_guest_exit(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> +		goto unlock;
> +
> +	perf_host_enter(cpuctx);
> +
> +	__this_cpu_write(perf_in_guest, false);
> +unlock:
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_exit);
> +
>  /*
>   * Holding the top-level event's child_mutex means that any
>   * descendant process that has inherited this event will block
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 34/38] perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
  2025-03-24 17:31 ` [PATCH v4 34/38] perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host Mingwei Zhang
@ 2025-05-21 20:00   ` Namhyung Kim
  0 siblings, 0 replies; 127+ messages in thread
From: Namhyung Kim @ 2025-05-21 20:00 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:31:14PM +0000, Mingwei Zhang wrote:
> From: Sandipan Das <sandipan.das@amd.com>
> 
> Apply the PERF_PMU_CAP_MEDIATED_VPMU flag for version 2 and later
> implementations of the core PMU. Aside from having Global Control and
> Status registers, virtualizing the PMU using the passthrough model
> requires an interface to set or clear the overflow bits in the Global
> Status MSRs while restoring or saving the PMU context of a vCPU.
> 
> PerfMonV2-capable hardware has additional MSRs for this purpose namely,
> PerfCntrGlobalStatusSet and PerfCntrGlobalStatusClr, thereby making it
> suitable for use with mediated vPMU.

So IBS cannot be used in the guest (with MEDIATED_VPMU) and host can
profile guests with it, right?

Thanks,
Namhyung

> 
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/events/amd/core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index 30d6ceb4c8ad..a8b537dd2ddb 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -1433,6 +1433,8 @@ static int __init amd_core_pmu_init(void)
>  
>  		amd_pmu_global_cntr_mask = x86_pmu.cntr_mask64;
>  
> +		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
> +
>  		/* Update PMC handling functions */
>  		x86_pmu.enable_all = amd_pmu_v2_enable_all;
>  		x86_pmu.disable_all = amd_pmu_v2_disable_all;
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 09/38] perf: Add switch_guest_ctx() interface
  2025-03-24 17:30 ` [PATCH v4 09/38] perf: Add switch_guest_ctx() interface Mingwei Zhang
  2025-04-25 11:12   ` Peter Zijlstra
  2025-05-14 23:30   ` Sean Christopherson
@ 2025-05-21 20:01   ` Namhyung Kim
  2 siblings, 0 replies; 127+ messages in thread
From: Namhyung Kim @ 2025-05-21 20:01 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	Kan, H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Mon, Mar 24, 2025 at 05:30:49PM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> When entering/exiting a guest, some contexts for a guest have to be
> switched. For examples, there is a dedicated interrupt vector for
> guests on Intel platforms.
> 
> When PMI switch into a new guest vector, guest_lvtpc value need to be
> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> bit should be cleared also, then PMI can be generated continuously
> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> and switch_guest_ctx().
> 
> Add a dedicated list to track all the pmus with the PASSTHROUGH cap, which

s/PASSTHROUGH/MEDIATED_VPMU/ ?

Thanks,
Namhyung


> may require switching the guest context. It can avoid going through the
> huge pmus list.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 05/38] perf: Add generic exclude_guest support
  2025-05-21 19:55   ` Namhyung Kim
@ 2025-05-21 20:12     ` Liang, Kan
  0 siblings, 0 replies; 127+ messages in thread
From: Liang, Kan @ 2025-05-21 20:12 UTC (permalink / raw)
  To: Namhyung Kim, Mingwei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Sean Christopherson, Paolo Bonzini, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, Liang,
	H. Peter Anvin, linux-perf-users, linux-kernel, kvm,
	linux-kselftest, Yongwei Ma, Xiong Zhang, Dapeng Mi, Jim Mattson,
	Sandipan Das, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania



On 2025-05-21 3:55 p.m., Namhyung Kim wrote:
> On Mon, Mar 24, 2025 at 05:30:45PM +0000, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Only KVM knows the exact time when a guest is entering/exiting. Expose
>> two interfaces to KVM to switch the ownership of the PMU resources.
>>
>> All the pinned events must be scheduled in first. Extend the
>> perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |  4 ++
>>  kernel/events/core.c       | 80 ++++++++++++++++++++++++++++++++++----
>>  2 files changed, 77 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 7bda1e20be12..37187ee8e226 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1822,6 +1822,8 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>  int perf_get_mediated_pmu(void);
>>  void perf_put_mediated_pmu(void);
>> +void perf_guest_enter(void);
>> +void perf_guest_exit(void);
>>  #else /* !CONFIG_PERF_EVENTS: */
>>  static inline void *
>>  perf_aux_output_begin(struct perf_output_handle *handle,
>> @@ -1919,6 +1921,8 @@ static inline int perf_get_mediated_pmu(void)
>>  }
>>  
>>  static inline void perf_put_mediated_pmu(void)			{ }
>> +static inline void perf_guest_enter(void)			{ }
>> +static inline void perf_guest_exit(void)			{ }
>>  #endif
>>  
>>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 7a2115b2c5c1..d05487d465c9 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2827,14 +2827,15 @@ static void task_ctx_sched_out(struct perf_event_context *ctx,
>>  
>>  static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
>>  				struct perf_event_context *ctx,
>> -				struct pmu *pmu)
>> +				struct pmu *pmu,
>> +				enum event_type_t event_type)
>>  {
>> -	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED);
>> +	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED | event_type);
>>  	if (ctx)
>> -		 ctx_sched_in(ctx, pmu, EVENT_PINNED);
>> -	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
>> +		ctx_sched_in(ctx, pmu, EVENT_PINNED | event_type);
>> +	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE | event_type);
>>  	if (ctx)
>> -		 ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE);
>> +		ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE | event_type);
>>  }
>>  
>>  /*
>> @@ -2890,7 +2891,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>>  	else if (event_type & EVENT_PINNED)
>>  		ctx_sched_out(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
>>  
>> -	perf_event_sched_in(cpuctx, task_ctx, pmu);
>> +	perf_event_sched_in(cpuctx, task_ctx, pmu, 0);
>>  
>>  	for_each_epc(epc, &cpuctx->ctx, pmu, 0)
>>  		perf_pmu_enable(epc->pmu);
>> @@ -4188,7 +4189,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
>>  		ctx_sched_out(&cpuctx->ctx, NULL, EVENT_FLEXIBLE);
>>  	}
>>  
>> -	perf_event_sched_in(cpuctx, ctx, NULL);
>> +	perf_event_sched_in(cpuctx, ctx, NULL, 0);
>>  
>>  	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
>>  
>> @@ -6040,6 +6041,71 @@ void perf_put_mediated_pmu(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>  
>> +static inline void perf_host_exit(struct perf_cpu_context *cpuctx)
>> +{
>> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> +	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
>> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>> +	if (cpuctx->task_ctx) {
>> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> +		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
>> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>> +	}
>> +}
> 
> Cpu context and task context may have events in the same PMU.
> How about this?
> 
> 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> 	if (cpuctx->task_ctx)
> 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> 
> 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
> 	if (cpuctx->task_ctx)
> 		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> 
> 	if (cpuctx->task_ctx)
> 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> 

Yes, this is what has to be fixed in V4.

No matter for load context or put context, I think the below steps have
to be followed.
- Disable both cpuctx->ctx and cpuctx->task_ctx
- schedule in/out host counters and load/put guest countext
- Enable both cpuctx->ctx and cpuctx->task_ctx

A similar proposed can be found at
https://lore.kernel.org/lkml/4aaf67ab-aa5c-41e6-bced-3cb000172c52@linux.intel.com/

Thanks,
Kan


^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
  2025-05-21 18:43       ` Sean Christopherson
@ 2025-05-22  1:36         ` Mi, Dapeng
  0 siblings, 0 replies; 127+ messages in thread
From: Mi, Dapeng @ 2025-05-22  1:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania


On 5/22/2025 2:43 AM, Sean Christopherson wrote:
> On Thu, May 15, 2025, Dapeng Mi wrote:
>> On 5/15/2025 8:09 AM, Sean Christopherson wrote:
>>> On Mon, Mar 24, 2025, Mingwei Zhang wrote:
>>>> +	return vcpu->kvm->arch.enable_pmu &&
>>> This is superfluous, pmu->version should never be non-zero without the PMU being
>>> enabled at the VM level.
>> Strictly speaking, "arch.enable_pmu" and pmu->version doesn't indicates
>> fully same thing.  "arch.enable_pmu" indicates whether PMU function is
>> enabled in KVM, but the "pmu->version" comes from user space configuration.
>> In theory user space could configure a "0"  PMU version just like
>> pmu_counters_test does. Currently I'm not sure if the check for
>> "pmu->version" can be removed, let me have a double check.
> Gah, sorry, my comment was vague and confusing.  What I was trying to say is that
> the vcpu->kvm->arch.enable_pmu check is superfluous and can be dropped.

Hmm, yes.  "pmu->version > 0" implies that arch.enable_pmu must be true
(kvm_pmu_refresh() checks if arch.enable_pmu is true before setting
pmu->verison).


>
>>>> +	kvm->arch.enable_pmu = enable_pmu && !enable_mediated_pmu;
>>> So I tried to run a QEMU with this and it failed, because QEMU expected the PMU
>>> to be enabled and tried to write to PMU MSRs.  I haven't dug through the QEMU
>>> code, but I assume that QEMU rightly expects that passing in PMU in CPUID when
>>> KVM_GET_SUPPORTED_CPUID says its supported will result in the VM having a PMU.
>> As long as the module parameter "enable_mediated_pmu" is enabled, qemu
>> needs below extra code to enable mediated vPMU, otherwise PMU is disabled
>> in KVM.
>>
>> https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.com/
>>
>>> I.e. by trying to get cute with backwards compatibility, I think we broke backwards
>>> compatiblity.  At this point, I'm leaning toward making the module param off-by-default,
>>> but otherwise not messing with the behavior of kvm->arch.enable_pmu.  Not sure if
>>> that has implications for KVM_PMU_CAP_DISABLE though.
>> I'm not sure if it's a kind of break for backwards compatibility.  As long
>> as "enable_mediated_pmu" is not enabled, the qemu doesn't need any changes,
>> the legacy vPMU can still be enabled by old qemu version. But if user want
>> to enable mediated vPMU, so they should use the new version qemu which has
>> the capability to enable mediated vPMU, it sounds reasonable for me.
> I agree it's reasonable to require a userspace update to take advantage of new
> features, what I don't like is what happens if userspace _hasn't_ been updated.
> I also don't love that forcing a userspace update in this case is more than a bit
> contrived.  It's very doable to let existing userspace utilize the mediated PMU,
> forcing KVM_CAP_PMU_CAPABILITY is essentially KVM punting a problem to userspace.
>
> And the complications with the mediated PMU don't really have anything to do with
> the VMM, they're more about all the other tasks and daemons running on the system,
> e.g. that might be using perf.
>
> Thinking more about this, the problem isn't so much that enabling mediated PMUs
> by default is undesirable, it's that giving userspace a binary choise doesn't
> provide enough flexibility.  E.g. for single-user QEMU-based use cases (including
> my use of QEMU), requiring a new QEMU is painful and annoying, and so having an
> on-by-default option would be nice.
>
> But for use cases that already utilize KVM_CAP_PMU_CAPABILITY, e.g. to explicitly
> disable PMUs for a subset of VMs, on-by-default is very undesirable, e.g. would
> require KVM to support KVM_PMU_CAP_DISABLE, and would generate unnecessary noise
> and contention in perf.
>
> So, what if we simply make enable_mediated_pmu a tri-state of sorts?
>
>   0   == disabled
>   > 0 == enabled for all VMs (no opt-in or opt-out supported)
>   < 0 == enabled, but off by default (requires opt-in)
>
> Then use cases like my personal usage of QEMU can run with enable_mediated_pmu=1,
> while use cases like Google Cloud can run with enable_mediated_pmu=-1, and everyone
> is happy (hopefully), without too much added complexity in KVM.

Hmm, I agree. a tri-state "enable_mediated_pmu" is much flexible, but we
need to a good document to describe it, maybe like this.

enable_mediated_pmu

0       ==  globally disabled for all VMs

> 0    ==  globally enabled for all VMs

< 0    ==  VM-scoped disabled, need VMM explicitly enables by
KVM_CAP_PMU_CAPABILITY ioctl.




^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-03-24 17:31 ` [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc Mingwei Zhang
  2025-05-15  0:19   ` Sean Christopherson
@ 2025-05-26  6:15   ` Sandipan Das
  2025-07-09 15:53     ` Sean Christopherson
  1 sibling, 1 reply; 127+ messages in thread
From: Sandipan Das @ 2025-05-26  6:15 UTC (permalink / raw)
  To: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Sean Christopherson,
	Paolo Bonzini
  Cc: Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On 3/24/2025 11:01 PM, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Check if rdpmc can be intercepted for mediated vPMU. Simply speaking,
> if guest own all PMU counters in mediated vPMU, then rdpmc interception
> should be disabled to mitigate the performance impact, otherwise rdpmc
> has to be intercepted to avoid guest obtain host counter's data via
> rdpmc instruction.
> 
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Co-developed-by: Sandipan Das <sandipan.das@amd.com>
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/include/asm/msr-index.h |  1 +
>  arch/x86/kvm/pmu.c               | 34 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/pmu.h               | 19 ++++++++++++++++++
>  arch/x86/kvm/svm/pmu.c           | 14 ++++++++++++-
>  arch/x86/kvm/vmx/pmu_intel.c     | 18 ++++++++---------
>  5 files changed, 76 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index ca70846ffd55..337f4b0a2998 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -312,6 +312,7 @@
>  #define PERF_CAP_PEBS_FORMAT		0xf00
>  #define PERF_CAP_FW_WRITES		BIT_ULL(13)
>  #define PERF_CAP_PEBS_BASELINE		BIT_ULL(14)
> +#define PERF_CAP_PERF_METRICS		BIT_ULL(15)
>  #define PERF_CAP_PEBS_MASK		(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
>  					 PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
>  
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 92c742ead663..6ad71752be4b 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -604,6 +604,40 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
>  	return 0;
>  }
>  
> +inline bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	if (!kvm_mediated_pmu_enabled(vcpu))
> +		return false;
> +
> +	/*
> +	 * VMware allows access to these Pseduo-PMCs even when read via RDPMC
> +	 * in Ring3 when CR4.PCE=0.
> +	 */
> +	if (enable_vmware_backdoor)
> +		return false;
> +
> +	/*
> +	 * FIXME: In theory, perf metrics is always combined with fixed
> +	 *	  counter 3. it's fair enough to compare the guest and host
> +	 *	  fixed counter number and don't need to check perf metrics
> +	 *	  explicitly. However kvm_pmu_cap.num_counters_fixed is limited
> +	 *	  KVM_MAX_NR_FIXED_COUNTERS (3) as fixed counter 3 is not
> +	 *	  supported now. perf metrics is still needed to be checked
> +	 *	  explicitly here. Once fixed counter 3 is supported, the perf
> +	 *	  metrics checking can be removed.
> +	 */
> +	return pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
> +	       pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
> +	       vcpu_has_perf_metrics(vcpu) == kvm_host_has_perf_metrics() &&
> +	       pmu->counter_bitmask[KVM_PMC_GP] ==
> +				(BIT_ULL(kvm_pmu_cap.bit_width_gp) - 1) &&
> +	       pmu->counter_bitmask[KVM_PMC_FIXED] ==
> +				(BIT_ULL(kvm_pmu_cap.bit_width_fixed) - 1);
> +}
> +EXPORT_SYMBOL_GPL(kvm_rdpmc_in_guest);
> +
>  void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
>  {
>  	if (lapic_in_kernel(vcpu)) {
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index e1d0096f249b..509c995b7871 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -271,6 +271,24 @@ static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc)
>  	return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
>  }
>  
> +static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
> +{
> +	if (!guest_cpu_cap_has(vcpu, X86_FEATURE_PDCM))
> +		return 0;
> +
> +	return vcpu->arch.perf_capabilities;
> +}
> +
> +static inline bool vcpu_has_perf_metrics(struct kvm_vcpu *vcpu)
> +{
> +	return !!(vcpu_get_perf_capabilities(vcpu) & PERF_CAP_PERF_METRICS);
> +}
> +
> +static inline bool kvm_host_has_perf_metrics(void)
> +{
> +	return !!(kvm_host.perf_capabilities & PERF_CAP_PERF_METRICS);
> +}
> +
>  void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
>  int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
> @@ -287,6 +305,7 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
>  bool vcpu_pmu_can_enable(struct kvm_vcpu *vcpu);
>  
>  bool is_vmware_backdoor_pmc(u32 pmc_idx);
> +bool kvm_rdpmc_in_guest(struct kvm_vcpu *vcpu);
>  
>  extern struct kvm_pmu_ops intel_pmu_ops;
>  extern struct kvm_pmu_ops amd_pmu_ops;
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index c8b9fd9b5350..153972e944eb 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -173,7 +173,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	return 1;
>  }
>  
> -static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> +static void __amd_pmu_refresh(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>  	union cpuid_0x80000022_ebx ebx;
> @@ -212,6 +212,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>  	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
>  }
>  
> +static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	__amd_pmu_refresh(vcpu);
> +
> +	if (kvm_rdpmc_in_guest(vcpu))
> +		svm_clr_intercept(svm, INTERCEPT_RDPMC);
> +	else
> +		svm_set_intercept(svm, INTERCEPT_RDPMC);
> +}
> +

After putting kprobes on kvm_pmu_rdpmc(), I noticed that RDPMC instructions were
getting intercepted for the secondary vCPUs. This happens because when secondary
vCPUs come up, kvm_vcpu_reset() gets called after guest CPUID has been updated.
While RDPMC interception is initially disabled in the kvm_pmu_refresh() path, it
gets re-enabled in the kvm_vcpu_reset() path as svm_vcpu_reset() calls init_vmcb().
We should consider adding the following change to avoid that.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6f9142063cc4..1c9c183092f3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1354,7 +1354,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
                svm_set_intercept(svm, INTERCEPT_SMI);

        svm_set_intercept(svm, INTERCEPT_SELECTIVE_CR0);
-       svm_set_intercept(svm, INTERCEPT_RDPMC);
        svm_set_intercept(svm, INTERCEPT_CPUID);
        svm_set_intercept(svm, INTERCEPT_INVD);
        svm_set_intercept(svm, INTERCEPT_INVLPG);

>  static void amd_pmu_init(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index fc017e9a6a0c..2a5f79206b02 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -108,14 +108,6 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
>  	return &counters[array_index_nospec(idx, num_counters)];
>  }
>  
> -static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
> -{
> -	if (!guest_cpu_cap_has(vcpu, X86_FEATURE_PDCM))
> -		return 0;
> -
> -	return vcpu->arch.perf_capabilities;
> -}
> -
>  static inline bool fw_writes_is_enabled(struct kvm_vcpu *vcpu)
>  {
>  	return (vcpu_get_perf_capabilities(vcpu) & PERF_CAP_FW_WRITES) != 0;
> @@ -456,7 +448,7 @@ static void intel_pmu_enable_fixed_counter_bits(struct kvm_pmu *pmu, u64 bits)
>  		pmu->fixed_ctr_ctrl_rsvd &= ~intel_fixed_bits_by_idx(i, bits);
>  }
>  
> -static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> +static void __intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>  	struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
> @@ -564,6 +556,14 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> +{
> +	__intel_pmu_refresh(vcpu);
> +
> +	exec_controls_changebit(to_vmx(vcpu), CPU_BASED_RDPMC_EXITING,
> +				!kvm_rdpmc_in_guest(vcpu));
> +}
> +
>  static void intel_pmu_init(struct kvm_vcpu *vcpu)
>  {
>  	int i;


^ permalink raw reply related	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-05-26  6:15   ` Sandipan Das
@ 2025-07-09 15:53     ` Sean Christopherson
  2025-07-29  3:29       ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-07-09 15:53 UTC (permalink / raw)
  To: Sandipan Das
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Mon, May 26, 2025, Sandipan Das wrote:
> > @@ -212,6 +212,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> >  	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
> >  }
> >  
> > +static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +
> > +	__amd_pmu_refresh(vcpu);
> > +
> > +	if (kvm_rdpmc_in_guest(vcpu))
> > +		svm_clr_intercept(svm, INTERCEPT_RDPMC);
> > +	else
> > +		svm_set_intercept(svm, INTERCEPT_RDPMC);
> > +}
> > +
> 
> After putting kprobes on kvm_pmu_rdpmc(), I noticed that RDPMC instructions were
> getting intercepted for the secondary vCPUs. This happens because when secondary
> vCPUs come up, kvm_vcpu_reset() gets called after guest CPUID has been updated.
> While RDPMC interception is initially disabled in the kvm_pmu_refresh() path, it
> gets re-enabled in the kvm_vcpu_reset() path as svm_vcpu_reset() calls init_vmcb().
> We should consider adding the following change to avoid that.

Revisiting this code after the MSR interception rework, I think we should go for
a more complete, big-hammer solution.  Rather than manipulate intercepts during
kvm_pmu_refresh(), do the updates as part of the "common" recalc intercepts flow.
And then to trigger recalc on PERF_CAPABILITIES writes, turn KVM_REQ_MSR_FILTER_CHANGED
into a generic KVM_REQ_RECALC_INTERCEPTS.

That way there's one path for calculating dynamic intercepts, which should make it
much more difficult for us to screw up things like reacting to MSR filter changes.
And providing a single path avoids needing to have a series of back-and-forth calls
between common x86 code, PMU code, and vendor code.

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-07-09 15:53     ` Sean Christopherson
@ 2025-07-29  3:29       ` Mi, Dapeng
  2025-07-30  0:38         ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-07-29  3:29 UTC (permalink / raw)
  To: Sean Christopherson, Sandipan Das
  Cc: Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania


On 7/9/2025 11:53 PM, Sean Christopherson wrote:
> On Mon, May 26, 2025, Sandipan Das wrote:
>>> @@ -212,6 +212,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>  	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
>>>  }
>>>  
>>> +static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>> +{
>>> +	struct vcpu_svm *svm = to_svm(vcpu);
>>> +
>>> +	__amd_pmu_refresh(vcpu);
>>> +
>>> +	if (kvm_rdpmc_in_guest(vcpu))
>>> +		svm_clr_intercept(svm, INTERCEPT_RDPMC);
>>> +	else
>>> +		svm_set_intercept(svm, INTERCEPT_RDPMC);
>>> +}
>>> +
>> After putting kprobes on kvm_pmu_rdpmc(), I noticed that RDPMC instructions were
>> getting intercepted for the secondary vCPUs. This happens because when secondary
>> vCPUs come up, kvm_vcpu_reset() gets called after guest CPUID has been updated.
>> While RDPMC interception is initially disabled in the kvm_pmu_refresh() path, it
>> gets re-enabled in the kvm_vcpu_reset() path as svm_vcpu_reset() calls init_vmcb().
>> We should consider adding the following change to avoid that.
> Revisiting this code after the MSR interception rework, I think we should go for
> a more complete, big-hammer solution.  Rather than manipulate intercepts during
> kvm_pmu_refresh(), do the updates as part of the "common" recalc intercepts flow.
> And then to trigger recalc on PERF_CAPABILITIES writes, turn KVM_REQ_MSR_FILTER_CHANGED
> into a generic KVM_REQ_RECALC_INTERCEPTS.
>
> That way there's one path for calculating dynamic intercepts, which should make it
> much more difficult for us to screw up things like reacting to MSR filter changes.
> And providing a single path avoids needing to have a series of back-and-forth calls
> between common x86 code, PMU code, and vendor code.

Sounds good to me.

BTW, Sean, may I know your plan about the mediated vPMU v5 patch set? Thanks.



^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface
  2025-04-25 13:56         ` Liang, Kan
@ 2025-07-30  0:31           ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2025-07-30  0:31 UTC (permalink / raw)
  To: Kan Liang
  Cc: Peter Zijlstra, Mingwei Zhang, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Dapeng Mi, Jim Mattson, Sandipan Das, Zide Chen, Eranian Stephane,
	Shukla Manali, Nikunj Dadhania

On Fri, Apr 25, 2025, Kan Liang wrote:
> On 2025-04-25 9:43 a.m., Peter Zijlstra wrote:
> > On Fri, Apr 25, 2025 at 09:06:26AM -0400, Liang, Kan wrote:
> >>
> >>
> >> On 2025-04-25 7:15 a.m., Peter Zijlstra wrote:
> >>> On Mon, Mar 24, 2025 at 05:30:50PM +0000, Mingwei Zhang wrote:
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> Implement switch_guest_ctx interface for x86 PMU, switch PMI to dedicated
> >>>> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
> >>>> NMI at perf guest exit.
> >>>>
> >>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> >>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> >>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >>>> ---
> >>>>  arch/x86/events/core.c | 12 ++++++++++++
> >>>>  1 file changed, 12 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> >>>> index 8f218ac0d445..28161d6ff26d 100644
> >>>> --- a/arch/x86/events/core.c
> >>>> +++ b/arch/x86/events/core.c
> >>>> @@ -2677,6 +2677,16 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static void x86_pmu_switch_guest_ctx(bool enter, void *data)
> >>>> +{
> >>>> +	u32 guest_lvtpc = *(u32 *)data;
> >>>> +
> >>>> +	if (enter)
> >>>> +		apic_write(APIC_LVTPC, guest_lvtpc);
> >>>> +	else
> >>>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> >>>> +}
> >>>
> >>> This, why can't it use x86_pmu.guest_lvtpc here and call it a day? Why
> >>> is that argument passed around through the generic code only to get back
> >>> here?
> >>
> >> The vector has to be from the KVM. However, the current interfaces only
> >> support KVM read perf variables, e.g., perf_get_x86_pmu_capability and
> >> perf_get_hw_event_config.
> >> We need to add an new interface to allow the KVM write a perf variable,
> >> e.g., perf_set_guest_lvtpc.
> > 
> > But all that should remain in x86, there is no reason what so ever to
> > leak this into generic code.

Finally prepping v5, and this is one of two <knock wood> comments that isn't fully
addressed.

The vector isn't a problem; that's *always* PERF_GUEST_MEDIATED_PMI_VECTOR and
so doesn't even require anything in x86_pmu.

But whether or not the entry should be masked comes from the guest's LVTPC entry,
and I don't see a cleaner way to get that information into x86, especially since
the switch between host and guest PMI needs to happen in the "perf context disabled"
section.

I think/hope I dressed up the code so that it's not _so_ ugly, and so that it's
fully extensible in the unlikely event a non-x86 arch were to ever support a
mediated vPMU, e.g. @data could be used to pass a pointer to a struct.

  void perf_load_guest_context(unsigned long data)
  {
	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);

	lockdep_assert_irqs_disabled();

	guard(perf_ctx_lock)(cpuctx, cpuctx->task_ctx);

	if (WARN_ON_ONCE(__this_cpu_read(guest_ctx_loaded)))
		return;

	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_GUEST);
	if (cpuctx->task_ctx) {
		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
	}

	arch_perf_load_guest_context(data);

	...
  }

  void arch_perf_load_guest_context(unsigned long data)
  {
	u32 masked = data & APIC_LVT_MASKED;

	apic_write(APIC_LVTPC,
		   APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked);
	this_cpu_write(x86_guest_ctx_loaded, true);
  }

Holler if you have a better idea.  I'll plan on posting v5 in the next day or so
no matter what, so that it's not delayed for this one thing (it's already been
delayed more than I was hoping, and there are a lot of changes relative to v4).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-07-29  3:29       ` Mi, Dapeng
@ 2025-07-30  0:38         ` Sean Christopherson
  2025-07-30  2:25           ` Mi, Dapeng
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-07-30  0:38 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Sandipan Das, Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Tue, Jul 29, 2025, Dapeng Mi wrote:
> BTW, Sean, may I know your plan about the mediated vPMU v5 patch set? Thanks.

I'll get it out this week (hopefully tomorrow).

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-07-30  0:38         ` Sean Christopherson
@ 2025-07-30  2:25           ` Mi, Dapeng
  2025-08-01 23:32             ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Mi, Dapeng @ 2025-07-30  2:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sandipan Das, Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania


On 7/30/2025 8:38 AM, Sean Christopherson wrote:
> On Tue, Jul 29, 2025, Dapeng Mi wrote:
>> BTW, Sean, may I know your plan about the mediated vPMU v5 patch set? Thanks.
> I'll get it out this week (hopefully tomorrow).

Thumbs up! Thanks.


>

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-07-30  2:25           ` Mi, Dapeng
@ 2025-08-01 23:32             ` Sean Christopherson
  2025-08-05  0:54               ` Sean Christopherson
  0 siblings, 1 reply; 127+ messages in thread
From: Sean Christopherson @ 2025-08-01 23:32 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Sandipan Das, Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Wed, Jul 30, 2025, Dapeng Mi wrote:
> 
> On 7/30/2025 8:38 AM, Sean Christopherson wrote:
> > On Tue, Jul 29, 2025, Dapeng Mi wrote:
> >> BTW, Sean, may I know your plan about the mediated vPMU v5 patch set? Thanks.
> > I'll get it out this week (hopefully tomorrow).
> 
> Thumbs up! Thanks.

I lied, I'm not going to get it out until Monday.  Figuring out how to deal with
instruction emulation in the fastpath VM-Exit handlers took me longer than I was
hoping/expecting.

It's fully tested, and I have all but one changelog written, but I'm out of time
for today (I made a stupid goof (inverted a !) that cost me an ~hour today, *sigh*).

Unless I get hit by a meteor, I'll get it out Monday.

Sorry for the delay.  :-/

^ permalink raw reply	[flat|nested] 127+ messages in thread

* Re: [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
  2025-08-01 23:32             ` Sean Christopherson
@ 2025-08-05  0:54               ` Sean Christopherson
  0 siblings, 0 replies; 127+ messages in thread
From: Sean Christopherson @ 2025-08-05  0:54 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Sandipan Das, Mingwei Zhang, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Paolo Bonzini,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, Liang, Kan, H. Peter Anvin, linux-perf-users,
	linux-kernel, kvm, linux-kselftest, Yongwei Ma, Xiong Zhang,
	Jim Mattson, Zide Chen, Eranian Stephane, Shukla Manali,
	Nikunj Dadhania

On Fri, Aug 01, 2025, Sean Christopherson wrote:
> On Wed, Jul 30, 2025, Dapeng Mi wrote:
> > 
> > On 7/30/2025 8:38 AM, Sean Christopherson wrote:
> > > On Tue, Jul 29, 2025, Dapeng Mi wrote:
> > >> BTW, Sean, may I know your plan about the mediated vPMU v5 patch set? Thanks.
> > > I'll get it out this week (hopefully tomorrow).
> > 
> > Thumbs up! Thanks.
> 
> I lied, I'm not going to get it out until Monday.  Figuring out how to deal with
> instruction emulation in the fastpath VM-Exit handlers took me longer than I was
> hoping/expecting.
> 
> It's fully tested, and I have all but one changelog written, but I'm out of time
> for today (I made a stupid goof (inverted a !) that cost me an ~hour today, *sigh*).
> 
> Unless I get hit by a meteor, I'll get it out Monday.

*sigh*

Wrong again (fortunately, I didn't get hit by a meteor).  Long story short, I
revisited (yet again) how to deal with enabling the mediated PMU.  I had been
doing almost all of my testing with a hack to force-enable a mediated PMU, and
when it came time to rip that out, I just couldn't convince myself that requiring
userspace to enable KVM_CAP_PMU_CAPABILITY was the best behavior (I especially
hated that PMU support would silently disappear).

So, bad news is, v5 isn't happening today.  Good news is that I think I figured
out a not-awful solution for enabling the mediated PMU.  I'll post details (and
hopefully v5) tomorrow.

^ permalink raw reply	[flat|nested] 127+ messages in thread

end of thread, other threads:[~2025-08-05  0:54 UTC | newest]

Thread overview: 127+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-24 17:30 [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 01/38] perf: Support get/put mediated PMU interfaces Mingwei Zhang
2025-05-14 22:48   ` Sean Christopherson
2025-05-15  1:31     ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 02/38] perf: Skip pmu_ctx based on event_type Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 03/38] perf: Clean up perf ctx time Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 04/38] perf: Add a EVENT_GUEST flag Mingwei Zhang
2025-05-14 22:51   ` Sean Christopherson
2025-05-15  1:35     ` Mi, Dapeng
2025-05-19  6:58   ` Namhyung Kim
2025-05-20 16:09     ` Liang, Kan
2025-05-20 17:51       ` Namhyung Kim
2025-05-20 18:50         ` Liang, Kan
2025-05-21 19:46   ` Namhyung Kim
2025-03-24 17:30 ` [PATCH v4 05/38] perf: Add generic exclude_guest support Mingwei Zhang
2025-04-25 11:13   ` Peter Zijlstra
2025-05-14 23:19     ` Sean Christopherson
2025-05-15  1:37       ` Mi, Dapeng
2025-05-15 18:39       ` Liang, Kan
2025-05-15 19:25         ` Sean Christopherson
2025-05-15 20:18           ` Liang, Kan
2025-05-21 19:55   ` Namhyung Kim
2025-05-21 20:12     ` Liang, Kan
2025-03-24 17:30 ` [PATCH v4 06/38] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
2025-05-14 23:21   ` Sean Christopherson
2025-05-15  2:10     ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 07/38] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
2025-05-14 23:24   ` Sean Christopherson
2025-05-15  1:40     ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 08/38] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 09/38] perf: Add switch_guest_ctx() interface Mingwei Zhang
2025-04-25 11:12   ` Peter Zijlstra
2025-05-14 23:30   ` Sean Christopherson
2025-05-15  1:45     ` Mi, Dapeng
2025-05-21 20:01   ` Namhyung Kim
2025-03-24 17:30 ` [PATCH v4 10/38] perf/x86: Support switch_guest_ctx interface Mingwei Zhang
2025-04-25 11:15   ` Peter Zijlstra
2025-04-25 13:06     ` Liang, Kan
2025-04-25 13:43       ` Peter Zijlstra
2025-04-25 13:56         ` Liang, Kan
2025-07-30  0:31           ` Sean Christopherson
2025-03-24 17:30 ` [PATCH v4 11/38] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
2025-05-15  0:00   ` Sean Christopherson
2025-05-15  1:52     ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 12/38] perf/x86/core: Do not set bit width for unavailable counters Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 13/38] perf/x86/core: Plumb mediated PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 14/38] KVM: x86/pmu: Introduce enable_mediated_pmu global parameter Mingwei Zhang
2025-05-15  0:09   ` Sean Christopherson
2025-05-15  2:53     ` Mi, Dapeng
2025-05-21 18:43       ` Sean Christopherson
2025-05-22  1:36         ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 15/38] KVM: x86/pmu: Check PMU cpuid configuration from user space Mingwei Zhang
2025-05-15  0:12   ` Sean Christopherson
2025-05-15  3:00     ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 16/38] KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 17/38] KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{} Mingwei Zhang
2025-05-15  0:12   ` Sean Christopherson
2025-05-15  3:04     ` Mi, Dapeng
2025-03-24 17:30 ` [PATCH v4 18/38] KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header Mingwei Zhang
2025-03-24 17:30 ` [PATCH v4 19/38] KVM: VMX: Add macros to wrap around {secondary,tertiary}_exec_controls_changebit() Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 20/38] KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc Mingwei Zhang
2025-05-15  0:19   ` Sean Christopherson
2025-05-15  3:23     ` Mi, Dapeng
2025-05-26  6:15   ` Sandipan Das
2025-07-09 15:53     ` Sean Christopherson
2025-07-29  3:29       ` Mi, Dapeng
2025-07-30  0:38         ` Sean Christopherson
2025-07-30  2:25           ` Mi, Dapeng
2025-08-01 23:32             ` Sean Christopherson
2025-08-05  0:54               ` Sean Christopherson
2025-03-24 17:31 ` [PATCH v4 21/38] KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl Mingwei Zhang
2025-03-26 16:51   ` Chen, Zide
2025-03-26 20:09     ` Mingwei Zhang
2025-05-15  0:33       ` Sean Christopherson
2025-05-15  3:45         ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 22/38] KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers Mingwei Zhang
2025-05-15  0:37   ` Sean Christopherson
2025-05-15  5:09     ` Mi, Dapeng
2025-05-15 19:22       ` Sean Christopherson
2025-05-16  1:03         ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 23/38] KVM: x86/pmu: Configure the interception of PMU MSRs Mingwei Zhang
2025-05-15  0:41   ` Sean Christopherson
2025-05-15  5:37     ` Mi, Dapeng
2025-05-15 19:06       ` Sean Christopherson
2025-05-16 13:34   ` Sean Christopherson
2025-05-19  5:18     ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 24/38] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
2025-05-16 13:35   ` Sean Christopherson
2025-05-16 14:45     ` Sean Christopherson
2025-05-19  5:21       ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 25/38] KVM: x86/pmu: Add AMD PMU registers to direct access list Mingwei Zhang
2025-05-16 13:36   ` Sean Christopherson
2025-03-24 17:31 ` [PATCH v4 26/38] KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering Mingwei Zhang
2025-05-15  0:42   ` Sean Christopherson
2025-05-15  5:34     ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 27/38] KVM: x86/pmu: Handle PMU MSRs interception and " Mingwei Zhang
2025-05-15  0:43   ` Sean Christopherson
2025-05-15  5:38     ` Mi, Dapeng
2025-05-16  1:26   ` Mi, Dapeng
2025-05-16 20:54     ` Sean Christopherson
2025-05-19  4:16       ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 28/38] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest writes to event selectors Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 29/38] KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry Mingwei Zhang
2025-05-15 16:29   ` Sean Christopherson
2025-05-16  2:37     ` Mi, Dapeng
2025-05-16 13:26   ` Sean Christopherson
2025-05-19  5:07     ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 30/38] KVM: x86/pmu: Handle emulated instruction for mediated vPMU Mingwei Zhang
2025-05-16  1:10   ` Sean Christopherson
2025-03-24 17:31 ` [PATCH v4 31/38] KVM: nVMX: Add macros to simplify nested MSR interception setting Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 32/38] KVM: nVMX: Add nested virtualization support for mediated PMU Mingwei Zhang
2025-05-16 13:33   ` Sean Christopherson
2025-05-19  5:24     ` Mi, Dapeng
2025-03-24 17:31 ` [PATCH v4 33/38] perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 34/38] perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host Mingwei Zhang
2025-05-21 20:00   ` Namhyung Kim
2025-03-24 17:31 ` [PATCH v4 35/38] KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 36/38] KVM: selftests: Add mediated vPMU supported for pmu tests Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 37/38] KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test Mingwei Zhang
2025-03-24 17:31 ` [PATCH v4 38/38] KVM: Selftests: Fix pmu_counters_test error for mediated vPMU Mingwei Zhang
2025-04-16  7:22 ` [PATCH v4 00/38] Mediated vPMU 4.0 for x86 Mi, Dapeng
2025-04-25 12:27   ` Peter Zijlstra
2025-05-06  9:57 ` Mi, Dapeng
2025-05-06 19:45   ` Sean Christopherson
2025-05-07  0:46     ` Mi, Dapeng
2025-05-15  0:49 ` Sean Christopherson
2025-05-15  5:45   ` Mi, Dapeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).