[RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86
@ 2024-08-01  4:58 Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 01/58] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Mingwei Zhang
                   ` (60 more replies)
  0 siblings, 61 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

This series contains perf interface improvements to address Peter's
comments. In addition, fix several bugs for v2. This version is based on
6.10-rc4. The main changes are:

 - Use atomics to replace refcounts to track the nr_mediated_pmu_vms.
 - Use the generic ctx_sched_{in,out}() to switch PMU resources when a
   guest is entering and exiting.
 - Add a new EVENT_GUEST flag to indicate the context switch case of
   entering and exiting a guest. Updates the generic ctx_sched_{in,out}
   to specifically handle this case, especially for time management.
 - Switch PMI vector in perf_guest_{enter,exit}() as well. Add a new
   driver-specific interface to facilitate the switch.
 - Remove the PMU_FL_PASSTHROUGH flag and uses the PASSTHROUGH pmu
   capability instead.
 - Adjust commit sequence in PERF and KVM PMI interrupt functions.
 - Use pmc_is_globally_enabled() check in emulated counter increment [1]
 - Fix PMU context switch [2] by using rdpmc() instead of rdmsr().

AMD fixes:
 - Add support for legacy PMU MSRs in MSR interception.
 - Make MSR usage consistent if PerfMonV2 is available.
 - Avoid enabling passthrough vPMU when local APIC is not in kernel.
 - increment counters in emulation mode.

This series is organized in the following order:

Patches 1-3:
 - Immediate bug fixes that can be applied to Linux tip.
 - Note: will put immediate fixes ahead in the future. These patches
   might be duplicated with existing posts.
 - Note: patches 1-2 are needed for AMD when host kernel enables
   preemption. Otherwise, guest will suffer from softlockup.

Patches 4-17:
 - Perf side changes, infra changes in core pmu with API for KVM.

Patches 18-48:
 - KVM mediated passthrough vPMU framework + Intel CPU implementation.

Patches 49-58:
 - AMD CPU implementation for vPMU.

Reference (patches in v2):
[1] [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
 - https://lore.kernel.org/all/20240506053020.3911940-43-mizhang@google.com/
[2] [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
 - https://lore.kernel.org/all/20240506053020.3911940-31-mizhang@google.com/

V2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/

Dapeng Mi (3):
  x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
  KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU
    MSRs

Kan Liang (8):
  perf: Support get/put passthrough PMU interfaces
  perf: Skip pmu_ctx based on event_type
  perf: Clean up perf ctx time
  perf: Add a EVENT_GUEST flag
  perf: Add generic exclude_guest support
  perf: Add switch_interrupt() interface
  perf/x86: Support switch_interrupt interface
  perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU

Manali Shukla (1):
  KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough
    PMU

Mingwei Zhang (24):
  perf/x86: Forbid PMI handler when guest own PMU
  perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
    x86_pmu_cap
  KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  KVM: x86/pmu: Add host_perf_cap and initialize it in
    kvm_x86_vendor_init()
  KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to
    guest
  KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough
    allowed
  KVM: x86/pmu: Create a function prototype to disable MSR interception
  KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in
    passthrough vPMU
  KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  KVM: x86/pmu: Add counter MSR and selector MSR index into struct
    kvm_pmc
  KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
    context
  KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  KVM: x86/pmu: Allow writing to event selector for GP counters if event
    is allowed
  KVM: x86/pmu: Allow writing to fixed counter selector if counter is
    exposed
  KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  KVM: x86/pmu: Introduce PMU operator to increment counter
  KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
  KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
  KVM: nVMX: Add nested virtualization support for passthrough PMU

Sandipan Das (12):
  perf/x86: Do not set bit width for unavailable counters
  x86/msr: Define PerfCntrGlobalStatusSet register
  KVM: x86/pmu: Always set global enable bits in passthrough mode
  KVM: x86/pmu/svm: Set passthrough capability for vcpus
  KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
  KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed
    to guest
  KVM: x86/pmu/svm: Implement callback to disable MSR interception
  KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
    write to event selectors
  KVM: x86/pmu/svm: Add registers to direct access list
  KVM: x86/pmu/svm: Implement handlers to save and restore context
  KVM: x86/pmu/svm: Implement callback to increment counters
  perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host

Sean Christopherson (2):
  sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  sched/core: Drop spinlocks on contention iff kernel is preemptible

Xiong Zhang (8):
  x86/irq: Factor out common code for installing kvm irq handler
  perf: core/x86: Register a new vector for KVM GUEST PMI
  KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
  KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  KVM: x86/pmu: Notify perf core at KVM context switch boundary
  KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
  KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter

 .../admin-guide/kernel-parameters.txt         |   4 +-
 arch/x86/events/amd/core.c                    |   2 +
 arch/x86/events/core.c                        |  44 +-
 arch/x86/events/intel/core.c                  |   5 +
 arch/x86/include/asm/hardirq.h                |   1 +
 arch/x86/include/asm/idtentry.h               |   1 +
 arch/x86/include/asm/irq.h                    |   2 +-
 arch/x86/include/asm/irq_vectors.h            |   5 +-
 arch/x86/include/asm/kvm-x86-pmu-ops.h        |   6 +
 arch/x86/include/asm/kvm_host.h               |   9 +
 arch/x86/include/asm/msr-index.h              |   2 +
 arch/x86/include/asm/perf_event.h             |   1 +
 arch/x86/include/asm/vmx.h                    |   1 +
 arch/x86/kernel/idt.c                         |   1 +
 arch/x86/kernel/irq.c                         |  39 +-
 arch/x86/kvm/cpuid.c                          |   3 +
 arch/x86/kvm/pmu.c                            | 154 +++++-
 arch/x86/kvm/pmu.h                            |  49 ++
 arch/x86/kvm/svm/pmu.c                        | 136 +++++-
 arch/x86/kvm/svm/svm.c                        |  31 ++
 arch/x86/kvm/svm/svm.h                        |   2 +-
 arch/x86/kvm/vmx/capabilities.h               |   1 +
 arch/x86/kvm/vmx/nested.c                     |  52 +++
 arch/x86/kvm/vmx/pmu_intel.c                  | 192 +++++++-
 arch/x86/kvm/vmx/vmx.c                        | 197 ++++++--
 arch/x86/kvm/vmx/vmx.h                        |   3 +-
 arch/x86/kvm/x86.c                            |  45 ++
 arch/x86/kvm/x86.h                            |   1 +
 include/linux/perf_event.h                    |  38 +-
 include/linux/preempt.h                       |  41 ++
 include/linux/sched.h                         |  41 --
 include/linux/spinlock.h                      |  14 +-
 kernel/events/core.c                          | 441 ++++++++++++++----
 .../beauty/arch/x86/include/asm/irq_vectors.h |   5 +-
 34 files changed, 1373 insertions(+), 196 deletions(-)


base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 01/58] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 02/58] sched/core: Drop spinlocks on contention iff kernel is preemptible Mingwei Zhang
                   ` (59 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sean Christopherson <seanjc@google.com>

Move the declarations and inlined implementations of the preempt_model_*()
helpers to preempt.h so that they can be referenced in spinlock.h without
creating a potential circular dependency between spinlock.h and sched.h.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/preempt.h | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h   | 41 -----------------------------------------
 2 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 7233e9cf1bab..ce76f1a45722 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -481,4 +481,45 @@ DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
 DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace())
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_PREEMPT_DYNAMIC
+
+extern bool preempt_model_none(void);
+extern bool preempt_model_voluntary(void);
+extern bool preempt_model_full(void);
+
+#else
+
+static inline bool preempt_model_none(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_NONE);
+}
+static inline bool preempt_model_voluntary(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
+}
+static inline bool preempt_model_full(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT);
+}
+
+#endif
+
+static inline bool preempt_model_rt(void)
+{
+	return IS_ENABLED(CONFIG_PREEMPT_RT);
+}
+
+/*
+ * Does the preemption model allow non-cooperative preemption?
+ *
+ * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
+ * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
+ * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
+ * PREEMPT_NONE model.
+ */
+static inline bool preempt_model_preemptible(void)
+{
+	return preempt_model_full() || preempt_model_rt();
+}
+
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6..90691d99027e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2064,47 +2064,6 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
-#ifdef CONFIG_PREEMPT_DYNAMIC
-
-extern bool preempt_model_none(void);
-extern bool preempt_model_voluntary(void);
-extern bool preempt_model_full(void);
-
-#else
-
-static inline bool preempt_model_none(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_NONE);
-}
-static inline bool preempt_model_voluntary(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY);
-}
-static inline bool preempt_model_full(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT);
-}
-
-#endif
-
-static inline bool preempt_model_rt(void)
-{
-	return IS_ENABLED(CONFIG_PREEMPT_RT);
-}
-
-/*
- * Does the preemption model allow non-cooperative preemption?
- *
- * For !CONFIG_PREEMPT_DYNAMIC kernels this is an exact match with
- * CONFIG_PREEMPTION; for CONFIG_PREEMPT_DYNAMIC this doesn't work as the
- * kernel is *built* with CONFIG_PREEMPTION=y but may run with e.g. the
- * PREEMPT_NONE model.
- */
-static inline bool preempt_model_preemptible(void)
-{
-	return preempt_model_full() || preempt_model_rt();
-}
-
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 02/58] sched/core: Drop spinlocks on contention iff kernel is preemptible
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 01/58] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 03/58] perf/x86: Do not set bit width for unavailable counters Mingwei Zhang
                   ` (58 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sean Christopherson <seanjc@google.com>

Use preempt_model_preemptible() to detect a preemptible kernel when
deciding whether or not to reschedule in order to drop a contended
spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
preemption model is "none" or "voluntary".  In short, make kernels with
dynamically selected models behave the same as kernels with statically
selected models.

Somewhat counter-intuitively, NOT yielding a lock can provide better
latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
rwlock, is often contended between an invalidation event (takes mmu_lock
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
read).  For _some_ setups, letting the invalidation task complete even
if there is mmu_lock contention provides lower latency for *all* tasks,
i.e. the invalidation completes sooner *and* the vCPU services the guest
page fault sooner.

But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
can vary depending on the host VMM, the guest workload, the number of
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.

In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
opposite and removing contention yielding entirely) needs to come with a
big pile of data proving that changing the status quo is a net positive.

Opportunistically document this side effect of preempt=full, as yielding
contended spinlocks can have significant, user-visible impact.

Fixes: c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode")
Link: https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Marco Elver <elver@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Friedrich Weber <f.weber@proxmox.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  4 +++-
 include/linux/spinlock.h                        | 14 ++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..ebb971a57d04 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4774,7 +4774,9 @@
 			none - Limited to cond_resched() calls
 			voluntary - Limited to cond_resched() and might_sleep() calls
 			full - Any section that isn't explicitly preempt disabled
-			       can be preempted anytime.
+			       can be preempted anytime.  Tasks will also yield
+			       contended spinlocks (if the critical section isn't
+			       explicitly preempt disabled beyond the lock itself).

 	print-fatal-signals=
 			[KNL] debug: print fatal signals
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3fcd20de6ca8..63dd8cf3c3c2 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -462,11 +462,10 @@ static __always_inline int spin_is_contended(spinlock_t *lock)
  */
 static inline int spin_needbreak(spinlock_t *lock)
 {
-#ifdef CONFIG_PREEMPTION
+	if (!preempt_model_preemptible())
+		return 0;
+
 	return spin_is_contended(lock);
-#else
-	return 0;
-#endif
 }

 /*
@@ -479,11 +478,10 @@ static inline int spin_needbreak(spinlock_t *lock)
  */
 static inline int rwlock_needbreak(rwlock_t *lock)
 {
-#ifdef CONFIG_PREEMPTION
+	if (!preempt_model_preemptible())
+		return 0;
+
 	return rwlock_is_contended(lock);
-#else
-	return 0;
-#endif
 }

 /*
-- 
2.46.0.rc1.232.g9752f9e123-goog

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 03/58] perf/x86: Do not set bit width for unavailable counters
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 01/58] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 02/58] sched/core: Drop spinlocks on contention iff kernel is preemptible Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 04/58] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
                   ` (57 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Not all x86 processors have fixed counters. It may also be the case that
a processor has only fixed counters and no general-purpose counters. Set
the bit widths corresponding to each counter type only if such counters
are available.

Fixes: b3d9468a8bd2 ("perf, x86: Expose perf capability to other modules")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..5bf78cd619bf 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2985,8 +2985,13 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->version		= x86_pmu.version;
 	cap->num_counters_gp	= x86_pmu.num_counters;
 	cap->num_counters_fixed	= x86_pmu.num_counters_fixed;
-	cap->bit_width_gp	= x86_pmu.cntval_bits;
-	cap->bit_width_fixed	= x86_pmu.cntval_bits;
+
+	if (cap->num_counters_gp)
+		cap->bit_width_gp = x86_pmu.cntval_bits;
+
+	if (cap->num_counters_fixed)
+		cap->bit_width_fixed = x86_pmu.cntval_bits;
+
 	cap->events_mask	= (unsigned int)x86_pmu.events_maskl;
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 	cap->pebs_ept		= x86_pmu.pebs_ept;
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 04/58] x86/msr: Define PerfCntrGlobalStatusSet register
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (2 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 03/58] perf/x86: Do not set bit width for unavailable counters Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 05/58] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
                   ` (56 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Define PerfCntrGlobalStatusSet (MSR 0xc0000303) as it is required by
passthrough PMU to set the overflow bits of PerfCntrGlobalStatus
(MSR 0xc0000300).

When using passthrough PMU, it is necessary to restore the guest state
of the overflow bits. Since PerfCntrGlobalStatus is read-only, this is
done by writing to PerfCntrGlobalStatusSet instead.

The register is available on AMD processors where the PerfMonV2 feature
bit of CPUID leaf 0x80000022 EAX is set.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/msr-index.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e022e6eb766c..b9f8744b47e5 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -681,6 +681,7 @@
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS	0xc0000300
 #define MSR_AMD64_PERF_CNTR_GLOBAL_CTL		0xc0000301
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR	0xc0000302
+#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET	0xc0000303
 
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 05/58] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (3 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 04/58] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
                   ` (55 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add additional PMU MSRs MSR_CORE_PERF_GLOBAL_STATUS_SET to allow
passthrough PMU operation on the read-only MSR IA32_PERF_GLOBAL_STATUS.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/msr-index.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b9f8744b47e5..1d7104713926 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1113,6 +1113,7 @@
 #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
 #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL	0x00000390
+#define MSR_CORE_PERF_GLOBAL_STATUS_SET 0x00000391
 
 #define MSR_PERF_METRICS		0x00000329
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (4 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 05/58] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-06 10:59   ` Mi, Dapeng
  2024-08-01  4:58 ` [RFC PATCH v3 07/58] perf: Skip pmu_ctx based on event_type Mingwei Zhang
                   ` (54 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Currently, the guest and host share the PMU resources when a guest is
running. KVM has to create an extra virtual event to simulate the
guest's event, which brings several issues, e.g., high overhead, not
accuracy and etc.

A new passthrough PMU method is proposed to address the issue. It requires
that the PMU resources can be fully occupied by the guest while it's
running. Two new interfaces are implemented to fulfill the requirement.
The hypervisor should invoke the interface while creating a guest which
wants the passthrough PMU capability.

The PMU resources should only be temporarily occupied as a whole when a
guest is running. When the guest is out, the PMU resources are still
shared among different users.

The exclude_guest event modifier is used to guarantee the exclusive
occupation of the PMU resources. When creating a guest, the hypervisor
should check whether there are !exclude_guest events in the system.
If yes, the creation should fail. Because some PMU resources have been
occupied by other users.
If no, the PMU resources can be safely accessed by the guest directly.
Perf guarantees that no new !exclude_guest events are created when a
guest is running.

Only the passthrough PMU is affected, but not for other PMU e.g., uncore
and SW PMU. The behavior of those PMUs are not changed. The guest
enter/exit interfaces should only impact the supported PMUs.
Add a new PERF_PMU_CAP_PASSTHROUGH_VPMU flag to indicate the PMUs that
support the feature.

Add nr_include_guest_events to track the !exclude_guest events of PMU
with PERF_PMU_CAP_PASSTHROUGH_VPMU.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h | 10 ++++++
 kernel/events/core.c       | 66 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a5304ae8c654..45d1ea82aa21 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -291,6 +291,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
 #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
+#define PERF_PMU_CAP_PASSTHROUGH_VPMU		0x0200
 
 struct perf_output_handle;
 
@@ -1728,6 +1729,8 @@ extern void perf_event_task_tick(void);
 extern int perf_event_account_interrupt(struct perf_event *event);
 extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
+int perf_get_mediated_pmu(void);
+void perf_put_mediated_pmu(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1814,6 +1817,13 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset)
 {
 	return 0;
 }
+
+static inline int perf_get_mediated_pmu(void)
+{
+	return 0;
+}
+
+static inline void perf_put_mediated_pmu(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f908f077935..45868d276cde 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -402,6 +402,20 @@ static atomic_t nr_bpf_events __read_mostly;
 static atomic_t nr_cgroup_events __read_mostly;
 static atomic_t nr_text_poke_events __read_mostly;
 static atomic_t nr_build_id_events __read_mostly;
+static atomic_t nr_include_guest_events __read_mostly;
+
+static atomic_t nr_mediated_pmu_vms;
+static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+
+/* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
+static inline bool is_include_guest_event(struct perf_event *event)
+{
+	if ((event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
+	    !event->attr.exclude_guest)
+		return true;
+
+	return false;
+}
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -5212,6 +5226,9 @@ static void _free_event(struct perf_event *event)
 
 	unaccount_event(event);
 
+	if (is_include_guest_event(event))
+		atomic_dec(&nr_include_guest_events);
+
 	security_perf_event_free(event);
 
 	if (event->rb) {
@@ -5769,6 +5786,36 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
 }
 EXPORT_SYMBOL_GPL(perf_event_pause);
 
+/*
+ * Currently invoked at VM creation to
+ * - Check whether there are existing !exclude_guest events of PMU with
+ *   PERF_PMU_CAP_PASSTHROUGH_VPMU
+ * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
+ *   PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU
+ *
+ * No impact for the PMU without PERF_PMU_CAP_PASSTHROUGH_VPMU. The perf
+ * still owns all the PMU resources.
+ */
+int perf_get_mediated_pmu(void)
+{
+	guard(mutex)(&perf_mediated_pmu_mutex);
+	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
+		return 0;
+
+	if (atomic_read(&nr_include_guest_events))
+		return -EBUSY;
+
+	atomic_inc(&nr_mediated_pmu_vms);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
+
+void perf_put_mediated_pmu(void)
+{
+	atomic_dec(&nr_mediated_pmu_vms);
+}
+EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
@@ -11907,6 +11954,17 @@ static void account_event(struct perf_event *event)
 	account_pmu_sb_event(event);
 }
 
+static int perf_account_include_guest_event(void)
+{
+	guard(mutex)(&perf_mediated_pmu_mutex);
+
+	if (atomic_read(&nr_mediated_pmu_vms))
+		return -EACCES;
+
+	atomic_inc(&nr_include_guest_events);
+	return 0;
+}
+
 /*
  * Allocate and initialize an event structure
  */
@@ -12114,11 +12172,19 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_callchain_buffer;
 
+	if (is_include_guest_event(event)) {
+		err = perf_account_include_guest_event();
+		if (err)
+			goto err_security_alloc;
+	}
+
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
 	return event;
 
+err_security_alloc:
+	security_perf_event_free(event);
 err_callchain_buffer:
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 07/58] perf: Skip pmu_ctx based on event_type
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (5 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-11 11:18   ` Peter Zijlstra
  2024-08-01  4:58 ` [RFC PATCH v3 08/58] perf: Clean up perf ctx time Mingwei Zhang
                   ` (53 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-passthrough PMUs. It doesn't
make sense to keep adding bool variables.

Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.

Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 kernel/events/core.c | 70 +++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 45868d276cde..7cb51dbf897a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -376,6 +376,7 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_CGROUP = 0x10,
+	EVENT_FLAGS = EVENT_CGROUP,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };
 
@@ -699,23 +700,32 @@ do {									\
 	___p;								\
 })
 
-static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
+static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
+			      enum event_type_t event_type)
+{
+	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
+		return true;
+
+	return false;
+}
+
+static void perf_ctx_disable(struct perf_event_context *ctx, enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		perf_pmu_disable(pmu_ctx->pmu);
 	}
 }
 
-static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
+static void perf_ctx_enable(struct perf_event_context *ctx, enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		perf_pmu_enable(pmu_ctx->pmu);
 	}
@@ -877,7 +887,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 		return;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-	perf_ctx_disable(&cpuctx->ctx, true);
+	perf_ctx_disable(&cpuctx->ctx, EVENT_CGROUP);
 
 	ctx_sched_out(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);
 	/*
@@ -893,7 +903,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 	 */
 	ctx_sched_in(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);
 
-	perf_ctx_enable(&cpuctx->ctx, true);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_CGROUP);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 
@@ -2732,9 +2742,9 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 
 	event_type &= EVENT_ALL;
 
-	perf_ctx_disable(&cpuctx->ctx, false);
+	perf_ctx_disable(&cpuctx->ctx, 0);
 	if (task_ctx) {
-		perf_ctx_disable(task_ctx, false);
+		perf_ctx_disable(task_ctx, 0);
 		task_ctx_sched_out(task_ctx, event_type);
 	}
 
@@ -2752,9 +2762,9 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 
 	perf_event_sched_in(cpuctx, task_ctx);
 
-	perf_ctx_enable(&cpuctx->ctx, false);
+	perf_ctx_enable(&cpuctx->ctx, 0);
 	if (task_ctx)
-		perf_ctx_enable(task_ctx, false);
+		perf_ctx_enable(task_ctx, 0);
 }
 
 void perf_pmu_resched(struct pmu *pmu)
@@ -3299,9 +3309,6 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3336,7 +3343,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
 		barrier();
 	}
 
-	ctx->is_active &= ~event_type;
+	ctx->is_active &= ~(event_type & ~EVENT_FLAGS);
 	if (!(ctx->is_active & EVENT_ALL))
 		ctx->is_active = 0;
 
@@ -3349,7 +3356,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
 	is_active ^= ctx->is_active; /* changed bits */
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		__pmu_ctx_sched_out(pmu_ctx, is_active);
 	}
@@ -3543,7 +3550,7 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {
 
-			perf_ctx_disable(ctx, false);
+			perf_ctx_disable(ctx, 0);
 
 			/* PMIs are disabled; ctx->nr_pending is stable. */
 			if (local_read(&ctx->nr_pending) ||
@@ -3563,7 +3570,7 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 			perf_ctx_sched_task_cb(ctx, false);
 			perf_event_swap_task_ctx_data(ctx, next_ctx);
 
-			perf_ctx_enable(ctx, false);
+			perf_ctx_enable(ctx, 0);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
@@ -3587,13 +3594,13 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);
 
 inside_switch:
 		perf_ctx_sched_task_cb(ctx, false);
 		task_ctx_sched_out(ctx, EVENT_ALL);
 
-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -3890,12 +3897,12 @@ static void pmu_groups_sched_in(struct perf_event_context *ctx,
 
 static void ctx_groups_sched_in(struct perf_event_context *ctx,
 				struct perf_event_groups *groups,
-				bool cgroup)
+				enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
 	}
@@ -3912,9 +3919,6 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;
 
 	lockdep_assert_held(&ctx->lock);
 
@@ -3932,7 +3936,7 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
 		barrier();
 	}
 
-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= ((event_type & ~EVENT_FLAGS) | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
 			cpuctx->task_ctx = ctx;
@@ -3947,11 +3951,11 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_groups_sched_in(ctx, &ctx->pinned_groups, cgroup);
+		ctx_groups_sched_in(ctx, &ctx->pinned_groups, event_type);
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_groups_sched_in(ctx, &ctx->flexible_groups, cgroup);
+		ctx_groups_sched_in(ctx, &ctx->flexible_groups, event_type);
 }
 
 static void perf_event_context_sched_in(struct task_struct *task)
@@ -3966,11 +3970,11 @@ static void perf_event_context_sched_in(struct task_struct *task)
 
 	if (cpuctx->task_ctx == ctx) {
 		perf_ctx_lock(cpuctx, ctx);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);
 
 		perf_ctx_sched_task_cb(ctx, true);
 
-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		perf_ctx_unlock(cpuctx, ctx);
 		goto rcu_unlock;
 	}
@@ -3983,7 +3987,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
 	if (!ctx->nr_events)
 		goto unlock;
 
-	perf_ctx_disable(ctx, false);
+	perf_ctx_disable(ctx, 0);
 	/*
 	 * We want to keep the following priority order:
 	 * cpu pinned (that don't need to move), task pinned,
@@ -3993,7 +3997,7 @@ static void perf_event_context_sched_in(struct task_struct *task)
 	 * events, no need to flip the cpuctx's events around.
 	 */
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
-		perf_ctx_disable(&cpuctx->ctx, false);
+		perf_ctx_disable(&cpuctx->ctx, 0);
 		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
 	}
 
@@ -4002,9 +4006,9 @@ static void perf_event_context_sched_in(struct task_struct *task)
 	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
 
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
-		perf_ctx_enable(&cpuctx->ctx, false);
+		perf_ctx_enable(&cpuctx->ctx, 0);
 
-	perf_ctx_enable(ctx, false);
+	perf_ctx_enable(ctx, 0);
 
 unlock:
 	perf_ctx_unlock(cpuctx, ctx);
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 08/58] perf: Clean up perf ctx time
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (6 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 07/58] perf: Skip pmu_ctx based on event_type Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-11 11:39   ` Peter Zijlstra
  2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
                   ` (52 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

The current perf tracks two timestamps for the normal ctx and cgroup.
The same type of variables and similar codes are used to track the
timestamps. In the following patch, the third timestamp to track the
guest time will be introduced.
To avoid the code duplication, add a new struct perf_time_ctx and factor
out a generic function update_perf_time_ctx().

No functional change.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h | 13 +++++----
 kernel/events/core.c       | 59 +++++++++++++++++++-------------------
 2 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 45d1ea82aa21..e22cdb6486e6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -906,6 +906,11 @@ struct perf_event_groups {
 	u64		index;
 };
 
+struct perf_time_ctx {
+	u64		time;
+	u64		stamp;
+	u64		offset;
+};
 
 /**
  * struct perf_event_context - event context structure
@@ -945,9 +950,7 @@ struct perf_event_context {
 	/*
 	 * Context clock, runs when context enabled.
 	 */
-	u64				time;
-	u64				timestamp;
-	u64				timeoffset;
+	struct perf_time_ctx		time;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -1040,9 +1043,7 @@ struct bpf_perf_event_data_kern {
  * This is a per-cpu dynamically allocated data structure.
  */
 struct perf_cgroup_info {
-	u64				time;
-	u64				timestamp;
-	u64				timeoffset;
+	struct perf_time_ctx		time;
 	int				active;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7cb51dbf897a..c25e2bf27001 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -775,7 +775,7 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
 	struct perf_cgroup_info *t;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	return t->time;
+	return t->time.time;
 }
 
 static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
@@ -784,20 +784,16 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
 	if (!__load_acquire(&t->active))
-		return t->time;
-	now += READ_ONCE(t->timeoffset);
+		return t->time.time;
+	now += READ_ONCE(t->time.offset);
 	return now;
 }
 
+static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
+
 static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
 {
-	if (adv)
-		info->time += now - info->timestamp;
-	info->timestamp = now;
-	/*
-	 * see update_context_time()
-	 */
-	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
+	update_perf_time_ctx(&info->time, now, adv);
 }
 
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
@@ -860,7 +856,7 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		__update_cgrp_time(info, ctx->timestamp, false);
+		__update_cgrp_time(info, ctx->time.stamp, false);
 		__store_release(&info->active, 1);
 	}
 }
@@ -1469,18 +1465,11 @@ static void perf_unpin_context(struct perf_event_context *ctx)
 	raw_spin_unlock_irqrestore(&ctx->lock, flags);
 }
 
-/*
- * Update the record of the current time in a context.
- */
-static void __update_context_time(struct perf_event_context *ctx, bool adv)
+static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
 {
-	u64 now = perf_clock();
-
-	lockdep_assert_held(&ctx->lock);
-
 	if (adv)
-		ctx->time += now - ctx->timestamp;
-	ctx->timestamp = now;
+		time->time += now - time->stamp;
+	time->stamp = now;
 
 	/*
 	 * The above: time' = time + (now - timestamp), can be re-arranged
@@ -1491,7 +1480,19 @@ static void __update_context_time(struct perf_event_context *ctx, bool adv)
 	 * it's (obviously) not possible to acquire ctx->lock in order to read
 	 * both the above values in a consistent manner.
 	 */
-	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
+	WRITE_ONCE(time->offset, time->time - time->stamp);
+}
+
+/*
+ * Update the record of the current time in a context.
+ */
+static void __update_context_time(struct perf_event_context *ctx, bool adv)
+{
+	u64 now = perf_clock();
+
+	lockdep_assert_held(&ctx->lock);
+
+	update_perf_time_ctx(&ctx->time, now, adv);
 }
 
 static void update_context_time(struct perf_event_context *ctx)
@@ -1509,7 +1510,7 @@ static u64 perf_event_time(struct perf_event *event)
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx->time;
+	return ctx->time.time;
 }
 
 static u64 perf_event_time_now(struct perf_event *event, u64 now)
@@ -1523,9 +1524,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
 		return perf_cgroup_event_time_now(event, now);
 
 	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
-		return ctx->time;
+		return ctx->time.time;
 
-	now += READ_ONCE(ctx->timeoffset);
+	now += READ_ONCE(ctx->time.offset);
 	return now;
 }
 
@@ -11302,14 +11303,14 @@ static void task_clock_event_update(struct perf_event *event, u64 now)
 
 static void task_clock_event_start(struct perf_event *event, int flags)
 {
-	local64_set(&event->hw.prev_count, event->ctx->time);
+	local64_set(&event->hw.prev_count, event->ctx->time.time);
 	perf_swevent_start_hrtimer(event);
 }
 
 static void task_clock_event_stop(struct perf_event *event, int flags)
 {
 	perf_swevent_cancel_hrtimer(event);
-	task_clock_event_update(event, event->ctx->time);
+	task_clock_event_update(event, event->ctx->time.time);
 }
 
 static int task_clock_event_add(struct perf_event *event, int flags)
@@ -11329,8 +11330,8 @@ static void task_clock_event_del(struct perf_event *event, int flags)
 static void task_clock_event_read(struct perf_event *event)
 {
 	u64 now = perf_clock();
-	u64 delta = now - event->ctx->timestamp;
-	u64 time = event->ctx->time + delta;
+	u64 delta = now - event->ctx->time.stamp;
+	u64 time = event->ctx->time.time + delta;
 
 	task_clock_event_update(event, time);
 }
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (7 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 08/58] perf: Clean up perf ctx time Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-21  5:27   ` Mi, Dapeng
                     ` (3 more replies)
  2024-08-01  4:58 ` [RFC PATCH v3 10/58] perf: Add generic exclude_guest support Mingwei Zhang
                   ` (51 subsequent siblings)
  60 siblings, 4 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.

However, either way doesn't work with the introduced passthrough vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

The ctx time is shared among different PMUs. The time cannot be stopped
when a guest is running. It is required to calculate the time for events
from other PMUs, e.g., uncore events. Add timeguest to track the guest
run time. For an exclude_guest event, the elapsed time equals
the ctx time - guest time.
Cgroup has dedicated times. Use the same method to deduct the guest time
from the cgroup time as well.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |   6 ++
 kernel/events/core.c       | 178 +++++++++++++++++++++++++++++++------
 2 files changed, 155 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e22cdb6486e6..81a5f8399cb8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -952,6 +952,11 @@ struct perf_event_context {
 	 */
 	struct perf_time_ctx		time;
 
+	/*
+	 * Context clock, runs when in the guest mode.
+	 */
+	struct perf_time_ctx		timeguest;
+
 	/*
 	 * These fields let us detect when two contexts have both
 	 * been cloned (inherited) from a common ancestor.
@@ -1044,6 +1049,7 @@ struct bpf_perf_event_data_kern {
  */
 struct perf_cgroup_info {
 	struct perf_time_ctx		time;
+	struct perf_time_ctx		timeguest;
 	int				active;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c25e2bf27001..57648736e43e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -376,7 +376,8 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_CGROUP = 0x10,
-	EVENT_FLAGS = EVENT_CGROUP,
+	EVENT_GUEST = 0x20,
+	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };
 
@@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;
 
 static atomic_t nr_mediated_pmu_vms;
 static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+static DEFINE_PER_CPU(bool, perf_in_guest);
 
 /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
 static inline bool is_include_guest_event(struct perf_event *event)
@@ -706,6 +708,10 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
 	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
 		return true;
 
+	if ((event_type & EVENT_GUEST) &&
+	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+		return true;
+
 	return false;
 }
 
@@ -770,12 +776,21 @@ static inline int is_cgroup_event(struct perf_event *event)
 	return event->cgrp != NULL;
 }
 
+static inline u64 __perf_event_time_ctx(struct perf_event *event,
+					struct perf_time_ctx *time,
+					struct perf_time_ctx *timeguest);
+
+static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
+					    struct perf_time_ctx *time,
+					    struct perf_time_ctx *timeguest,
+					    u64 now);
+
 static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
 	struct perf_cgroup_info *t;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	return t->time.time;
+	return __perf_event_time_ctx(event, &t->time, &t->timeguest);
 }
 
 static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
@@ -784,9 +799,9 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
 	if (!__load_acquire(&t->active))
-		return t->time.time;
-	now += READ_ONCE(t->time.offset);
-	return now;
+		return __perf_event_time_ctx(event, &t->time, &t->timeguest);
+
+	return __perf_event_time_ctx_now(event, &t->time, &t->timeguest, now);
 }
 
 static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
@@ -796,6 +811,18 @@ static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bo
 	update_perf_time_ctx(&info->time, now, adv);
 }
 
+static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
+{
+	update_perf_time_ctx(&info->timeguest, now, adv);
+}
+
+static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
+{
+	__update_cgrp_time(info, now, true);
+	if (__this_cpu_read(perf_in_guest))
+		__update_cgrp_guest_time(info, now, true);
+}
+
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
@@ -809,7 +836,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
 			cgrp = container_of(css, struct perf_cgroup, css);
 			info = this_cpu_ptr(cgrp->info);
 
-			__update_cgrp_time(info, now, true);
+			update_cgrp_time(info, now);
 			if (final)
 				__store_release(&info->active, 0);
 		}
@@ -832,11 +859,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 	 * Do not update time when cgroup is not active
 	 */
 	if (info->active)
-		__update_cgrp_time(info, perf_clock(), true);
+		update_cgrp_time(info, perf_clock());
 }
 
 static inline void
-perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
+perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
 {
 	struct perf_event_context *ctx = &cpuctx->ctx;
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
@@ -856,8 +883,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		__update_cgrp_time(info, ctx->time.stamp, false);
-		__store_release(&info->active, 1);
+		if (guest) {
+			__update_cgrp_guest_time(info, ctx->time.stamp, false);
+		} else {
+			__update_cgrp_time(info, ctx->time.stamp, false);
+			__store_release(&info->active, 1);
+		}
 	}
 }
 
@@ -1061,7 +1092,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
 }
 
 static inline void
-perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
+perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
 {
 }
 
@@ -1488,16 +1519,34 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
  */
 static void __update_context_time(struct perf_event_context *ctx, bool adv)
 {
-	u64 now = perf_clock();
+	lockdep_assert_held(&ctx->lock);
+
+	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
+}
 
+static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
+{
 	lockdep_assert_held(&ctx->lock);
 
-	update_perf_time_ctx(&ctx->time, now, adv);
+	/* must be called after __update_context_time(); */
+	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
 }
 
 static void update_context_time(struct perf_event_context *ctx)
 {
 	__update_context_time(ctx, true);
+	if (__this_cpu_read(perf_in_guest))
+		__update_context_guest_time(ctx, true);
+}
+
+static inline u64 __perf_event_time_ctx(struct perf_event *event,
+					struct perf_time_ctx *time,
+					struct perf_time_ctx *timeguest)
+{
+	if (event->attr.exclude_guest)
+		return time->time - timeguest->time;
+	else
+		return time->time;
 }
 
 static u64 perf_event_time(struct perf_event *event)
@@ -1510,7 +1559,26 @@ static u64 perf_event_time(struct perf_event *event)
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx->time.time;
+	return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
+}
+
+static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
+					    struct perf_time_ctx *time,
+					    struct perf_time_ctx *timeguest,
+					    u64 now)
+{
+	/*
+	 * The exclude_guest event time should be calculated from
+	 * the ctx time -  the guest time.
+	 * The ctx time is now + READ_ONCE(time->offset).
+	 * The guest time is now + READ_ONCE(timeguest->offset).
+	 * So the exclude_guest time is
+	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
+	 */
+	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
+		return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
+	else
+		return now + READ_ONCE(time->offset);
 }
 
 static u64 perf_event_time_now(struct perf_event *event, u64 now)
@@ -1524,10 +1592,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
 		return perf_cgroup_event_time_now(event, now);
 
 	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
-		return ctx->time.time;
+		return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
 
-	now += READ_ONCE(ctx->time.offset);
-	return now;
+	return __perf_event_time_ctx_now(event, &ctx->time, &ctx->timeguest, now);
 }
 
 static enum event_type_t get_event_type(struct perf_event *event)
@@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
 	 * would only update time for the pinned events.
 	 */
 	if (is_active & EVENT_TIME) {
+		bool stop;
+
+		/* vPMU should not stop time */
+		stop = !(event_type & EVENT_GUEST) &&
+		       ctx == &cpuctx->ctx;
+
 		/* update (and stop) ctx time */
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
+		update_cgrp_time_from_cpuctx(cpuctx, stop);
 		/*
 		 * CPU-release for the below ->is_active store,
 		 * see __load_acquire() in perf_event_time_now()
@@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
 			cpuctx->task_ctx = NULL;
 	}
 
-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule out all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+		__update_context_guest_time(ctx, false);
+		perf_cgroup_set_timestamp(cpuctx, true);
+		barrier();
+	} else {
+		is_active ^= ctx->is_active; /* changed bits */
+	}
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
@@ -3853,10 +3937,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
 		event_update_userpage(event);
 }
 
+struct merge_sched_data {
+	int can_add_hw;
+	enum event_type_t event_type;
+};
+
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	int *can_add_hw = data;
+	struct merge_sched_data *msd = data;
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
@@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	if (!event_filter_match(event))
 		return 0;
 
-	if (group_can_go_on(event, *can_add_hw)) {
+	/*
+	 * Don't schedule in any exclude_guest events of PMU with
+	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
+	 */
+	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
+	    !(msd->event_type & EVENT_GUEST))
+		return 0;
+
+	if (group_can_go_on(event, msd->can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
 	}
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		*can_add_hw = 0;
+		msd->can_add_hw = 0;
 		if (event->attr.pinned) {
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
@@ -3889,11 +3987,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
 
 static void pmu_groups_sched_in(struct perf_event_context *ctx,
 				struct perf_event_groups *groups,
-				struct pmu *pmu)
+				struct pmu *pmu,
+				enum event_type_t event_type)
 {
-	int can_add_hw = 1;
+	struct merge_sched_data msd = {
+		.can_add_hw = 1,
+		.event_type = event_type,
+	};
 	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
-			   merge_sched_in, &can_add_hw);
+			   merge_sched_in, &msd);
 }
 
 static void ctx_groups_sched_in(struct perf_event_context *ctx,
@@ -3905,14 +4007,14 @@ static void ctx_groups_sched_in(struct perf_event_context *ctx,
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
-		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
+		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
 	}
 }
 
 static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
 			       struct pmu *pmu)
 {
-	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
+	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
 }
 
 static void
@@ -3927,9 +4029,11 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
 		return;
 
 	if (!(is_active & EVENT_TIME)) {
+		/* EVENT_TIME should be active while the guest runs */
+		WARN_ON_ONCE(event_type & EVENT_GUEST);
 		/* start ctx time */
 		__update_context_time(ctx, false);
-		perf_cgroup_set_timestamp(cpuctx);
+		perf_cgroup_set_timestamp(cpuctx, false);
 		/*
 		 * CPU-release for the below ->is_active store,
 		 * see __load_acquire() in perf_event_time_now()
@@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 	}
 
-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule in all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+
+		/*
+		 * Update ctx time to set the new start time for
+		 * the exclude_guest events.
+		 */
+		update_context_time(ctx);
+		update_cgrp_time_from_cpuctx(cpuctx, false);
+		barrier();
+	} else {
+		is_active ^= ctx->is_active; /* changed bits */
+	}
 
 	/*
 	 * First go through the list and put on any pinned groups
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 10/58] perf: Add generic exclude_guest support
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (8 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-14 11:20   ` Peter Zijlstra
  2024-08-01  4:58 ` [RFC PATCH v3 11/58] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
                   ` (50 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Only KVM knows the exact time when a guest is entering/exiting. Expose
two interfaces to KVM to switch the ownership of the PMU resources.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |  4 +++
 kernel/events/core.c       | 54 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 81a5f8399cb8..75773f9890cc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1738,6 +1738,8 @@ extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 int perf_get_mediated_pmu(void);
 void perf_put_mediated_pmu(void);
+void perf_guest_enter(void);
+void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1831,6 +1833,8 @@ static inline int perf_get_mediated_pmu(void)
 }
 
 static inline void perf_put_mediated_pmu(void)			{ }
+static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_exit(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 57648736e43e..57ff737b922b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5941,6 +5941,60 @@ void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
 
+/* When entering a guest, schedule out all exclude_guest events. */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
+		goto unlock;
+
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+
+	__this_cpu_write(perf_in_guest, true);
+
+unlock:
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+EXPORT_SYMBOL_GPL(perf_guest_enter);
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
+		goto unlock;
+
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+
+	__this_cpu_write(perf_in_guest, false);
+unlock:
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+EXPORT_SYMBOL_GPL(perf_guest_exit);
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 11/58] x86/irq: Factor out common code for installing kvm irq handler
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (9 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 10/58] perf: Add generic exclude_guest support Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
                   ` (49 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

KVM will register irq handler for POSTED_INTR_WAKEUP_VECTOR and
KVM_GUEST_PMI_VECTOR, the existing kvm_set_posted_intr_wakeup_handler() is
renamed to x86_set_kvm_irq_handler(), and vector input parameter is used
to distinguish POSTED_INTR_WARKUP_VECTOR and KVM_GUEST_PMI_VECTOR.

Caller should call x86_set_kvm_irq_handler() once to register
a non-dummy handler for each vector. If caller register one
handler for a vector, later the caller register the same or different
non-dummy handler again, the second call will output warn message.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/irq.h |  2 +-
 arch/x86/kernel/irq.c      | 18 ++++++++++++------
 arch/x86/kvm/vmx/vmx.c     |  4 ++--
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 194dfff84cb1..050a247b69b4 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -30,7 +30,7 @@ struct irq_desc;
 extern void fixup_irqs(void);
 
 #if IS_ENABLED(CONFIG_KVM)
-extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
+void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void));
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 385e3a5fc304..18cd418fe106 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -312,16 +312,22 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
 
-void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
+void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 {
-	if (handler)
+	if (!handler)
+		handler = dummy_handler;
+
+	if (vector == POSTED_INTR_WAKEUP_VECTOR &&
+	    (handler == dummy_handler ||
+	     kvm_posted_intr_wakeup_handler == dummy_handler))
 		kvm_posted_intr_wakeup_handler = handler;
-	else {
-		kvm_posted_intr_wakeup_handler = dummy_handler;
+	else
+		WARN_ON_ONCE(1);
+
+	if (handler == dummy_handler)
 		synchronize_rcu();
-	}
 }
-EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
+EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);
 
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3c83c06f826..ad465881b043 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8292,7 +8292,7 @@ void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 
 void vmx_hardware_unsetup(void)
 {
-	kvm_set_posted_intr_wakeup_handler(NULL);
+	x86_set_kvm_irq_handler(POSTED_INTR_WAKEUP_VECTOR, NULL);
 
 	if (nested)
 		nested_vmx_hardware_unsetup();
@@ -8602,7 +8602,7 @@ __init int vmx_hardware_setup(void)
 	if (r && nested)
 		nested_vmx_hardware_unsetup();
 
-	kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler);
+	x86_set_kvm_irq_handler(POSTED_INTR_WAKEUP_VECTOR, pi_wakeup_handler);
 
 	return r;
 }
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (10 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 11/58] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-09 22:11   ` Colton Lewis
  2024-08-01  4:58 ` [RFC PATCH v3 13/58] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler Mingwei Zhang
                   ` (48 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Create a new vector in the host IDT for kvm guest PMI handling within
mediated passthrough vPMU. In addition, guest PMI handler registration
is added into x86_set_kvm_irq_handler().

This is the preparation work to support mediated passthrough vPMU to
handle kvm guest PMIs without interference from PMI handler of the host
PMU.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/hardirq.h                |  1 +
 arch/x86/include/asm/idtentry.h               |  1 +
 arch/x86/include/asm/irq_vectors.h            |  5 ++++-
 arch/x86/kernel/idt.c                         |  1 +
 arch/x86/kernel/irq.c                         | 21 +++++++++++++++++++
 .../beauty/arch/x86/include/asm/irq_vectors.h |  5 ++++-
 6 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index c67fa6ad098a..42a396763c8d 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -19,6 +19,7 @@ typedef struct {
 	unsigned int kvm_posted_intr_ipis;
 	unsigned int kvm_posted_intr_wakeup_ipis;
 	unsigned int kvm_posted_intr_nested_ipis;
+	unsigned int kvm_guest_pmis;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index d4f24499b256..7b1e3e542b1d 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
+DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	        sysvec_kvm_guest_pmi_handler);
 #else
 # define fred_sysvec_kvm_posted_intr_ipi		NULL
 # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 13aea8fc3d45..ada270e6f5cb 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -77,7 +77,10 @@
  */
 #define IRQ_WORK_VECTOR			0xf6
 
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
+#if IS_ENABLED(CONFIG_KVM)
+#define KVM_GUEST_PMI_VECTOR		0xf5
+#endif
+
 #define DEFERRED_ERROR_VECTOR		0xf4
 
 /* Vector on which hypervisor callbacks will be delivered */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index f445bec516a0..0bec4c7e2308 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
 	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
 	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
+	INTG(KVM_GUEST_PMI_VECTOR,		asm_sysvec_kvm_guest_pmi_handler),
 # endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 18cd418fe106..b29714e23fc4 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -183,6 +183,12 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+
+	seq_printf(p, "%*s: ", prec, "VPMU");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->kvm_guest_pmis);
+	seq_puts(p, " KVM GUEST PMI\n");
 #endif
 #ifdef CONFIG_X86_POSTED_MSI
 	seq_printf(p, "%*s: ", prec, "PMN");
@@ -311,6 +317,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 #if IS_ENABLED(CONFIG_KVM)
 static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
+static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
 
 void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 {
@@ -321,6 +328,10 @@ void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 	    (handler == dummy_handler ||
 	     kvm_posted_intr_wakeup_handler == dummy_handler))
 		kvm_posted_intr_wakeup_handler = handler;
+	else if (vector == KVM_GUEST_PMI_VECTOR &&
+		 (handler == dummy_handler ||
+		  kvm_guest_pmi_handler == dummy_handler))
+		kvm_guest_pmi_handler = handler;
 	else
 		WARN_ON_ONCE(1);
 
@@ -356,6 +367,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 	apic_eoi();
 	inc_irq_stat(kvm_posted_intr_nested_ipis);
 }
+
+/*
+ * Handler for KVM_GUEST_PMI_VECTOR.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_guest_pmi_handler)
+{
+	apic_eoi();
+	inc_irq_stat(kvm_guest_pmis);
+	kvm_guest_pmi_handler();
+}
 #endif
 
 #ifdef CONFIG_X86_POSTED_MSI
diff --git a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
index 13aea8fc3d45..670dcee46631 100644
--- a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
+++ b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
@@ -77,7 +77,10 @@
  */
 #define IRQ_WORK_VECTOR			0xf6
 
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
+#if IS_ENABLED(CONFIG_KVM)
+#define KVM_GUEST_PMI_VECTOR           0xf5
+#endif
+
 #define DEFERRED_ERROR_VECTOR		0xf4
 
 /* Vector on which hypervisor callbacks will be delivered */
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 13/58] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (11 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
                   ` (47 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Add function to register/unregister guest KVM PMI handler at KVM module
initialization and destroy. This allows the host PMU with passthough
capability enabled can switch PMI handler at PMU context switch.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8c9e4281d978..f1d589c07068 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13946,6 +13946,16 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+static void kvm_handle_guest_pmi(void)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	if (WARN_ON_ONCE(!vcpu))
+		return;
+
+	kvm_make_request(KVM_REQ_PMI, vcpu);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
@@ -13980,12 +13990,14 @@ static int __init kvm_x86_init(void)
 {
 	kvm_mmu_x86_module_init();
 	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
+	x86_set_kvm_irq_handler(KVM_GUEST_PMI_VECTOR, kvm_handle_guest_pmi);
 	return 0;
 }
 module_init(kvm_x86_init);
 
 static void __exit kvm_x86_exit(void)
 {
+	x86_set_kvm_irq_handler(KVM_GUEST_PMI_VECTOR, NULL);
 	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
 }
 module_exit(kvm_x86_exit);
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (12 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 13/58] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-19  6:02   ` Manali Shukla
                     ` (3 more replies)
  2024-08-01  4:58 ` [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface Mingwei Zhang
                   ` (46 subsequent siblings)
  60 siblings, 4 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

There will be a dedicated interrupt vector for guests on some platforms,
e.g., Intel. Add an interface to switch the interrupt vector while
entering/exiting a guest.

When PMI switch into a new guest vector, guest_lvtpc value need to be
reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
bit should be cleared also, then PMI can be generated continuously
for guest. So guest_lvtpc parameter is added into perf_guest_enter()
and switch_interrupt().

At switch_interrupt(), the target pmu with PASSTHROUGH cap should
be found. Since only one passthrough pmu is supported, we keep the
implementation simply by tracking the pmu as a global variable.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>

[Simplify the commit with removal of srcu lock/unlock since only one pmu is
supported.]

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |  9 +++++++--
 kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 75773f9890cc..aeb08f78f539 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -541,6 +541,11 @@ struct pmu {
 	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
 	 */
 	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
+
+	/*
+	 * Switch the interrupt vectors, e.g., guest enter/exit.
+	 */
+	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
 };
 
 enum perf_addr_filter_action_t {
@@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 int perf_get_mediated_pmu(void);
 void perf_put_mediated_pmu(void);
-void perf_guest_enter(void);
+void perf_guest_enter(u32 guest_lvtpc);
 void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
@@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
 }
 
 static inline void perf_put_mediated_pmu(void)			{ }
-static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
 static inline void perf_guest_exit(void)			{ }
 #endif
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 57ff737b922b..047ca5748ee2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
+static struct pmu *passthru_pmu;
 static struct srcu_struct pmus_srcu;
 static cpumask_var_t perf_online_mask;
 static struct kmem_cache *perf_event_cache;
@@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
 
+static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
+{
+	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
+	if (!passthru_pmu)
+		return;
+
+	if (passthru_pmu->switch_interrupt &&
+	    try_module_get(passthru_pmu->module)) {
+		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
+		module_put(passthru_pmu->module);
+	}
+}
+
 /* When entering a guest, schedule out all exclude_guest events. */
-void perf_guest_enter(void)
+void perf_guest_enter(u32 guest_lvtpc)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 
@@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
 	}
 
+	perf_switch_interrupt(true, guest_lvtpc);
+
 	__this_cpu_write(perf_in_guest, true);
 
 unlock:
@@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
 	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
 		goto unlock;
 
+	perf_switch_interrupt(false, 0);
+
 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
 	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
@@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 	if (!pmu->event_idx)
 		pmu->event_idx = perf_event_idx_default;
 
-	list_add_rcu(&pmu->entry, &pmus);
+	/*
+	 * Initialize passthru_pmu with the core pmu that has
+	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
+	 */
+	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
+		if (!passthru_pmu)
+			passthru_pmu = pmu;
+
+		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
+			ret = -EINVAL;
+			goto free_dev;
+		}
+	}
+
+	list_add_tail_rcu(&pmu->entry, &pmus);
 	atomic_set(&pmu->exclusive_cnt, 0);
 	ret = 0;
 unlock:
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (13 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-09 22:11   ` Colton Lewis
  2024-08-01  4:58 ` [RFC PATCH v3 16/58] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
                   ` (45 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Implement switch_interrupt interface for x86 PMU, switch PMI to dedicated
KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
NMI at perf guest exit.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5bf78cd619bf..b17ef8b6c1a6 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2673,6 +2673,15 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
 	return ret;
 }
 
+static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
+{
+	if (enter)
+		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
+			   (guest_lvtpc & APIC_LVT_MASKED));
+	else
+		apic_write(APIC_LVTPC, APIC_DM_NMI);
+}
+
 static struct pmu pmu = {
 	.pmu_enable		= x86_pmu_enable,
 	.pmu_disable		= x86_pmu_disable,
@@ -2702,6 +2711,8 @@ static struct pmu pmu = {
 	.aux_output_match	= x86_pmu_aux_output_match,
 
 	.filter			= x86_pmu_filter,
+
+	.switch_interrupt	= x86_pmu_switch_interrupt,
 };
 
 void arch_perf_update_userpage(struct perf_event *event,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 16/58] perf/x86: Forbid PMI handler when guest own PMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (14 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-02  7:56   ` Mi, Dapeng
  2024-08-01  4:58 ` [RFC PATCH v3 17/58] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
                   ` (44 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
be held pending until EFLAGS.IF is set. In the meantime, if the logical
processor receives an NMI for any reason at all, perf_event_nmi_handler()
will be invoked. If there is any active perf event anywhere on the system,
x86_pmu_handle_irq() will be invoked, and it will clear
IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
be a mystery which counter(s) overflowed.

When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
environment.

So modify perf_event_nmi_handler() to check perf_in_guest per cpu variable,
and if so, to simply return without calling x86_pmu_handle_irq().

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b17ef8b6c1a6..cb5d8f5fd9ce 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -52,6 +52,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
 	.pmu = &pmu,
 };
 
+DEFINE_PER_CPU(bool, pmi_vector_is_nmi) = true;
+
 DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
 DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
 DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
@@ -1733,6 +1735,24 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	u64 finish_clock;
 	int ret;
 
+	/*
+	 * When guest pmu context is loaded this handler should be forbidden from
+	 * running, the reasons are:
+	 * 1. After perf_guest_enter() is called, and before cpu enter into
+	 *    non-root mode, NMI could happen, but x86_pmu_handle_irq() restore PMU
+	 *    to use NMI vector, which destroy KVM PMI vector setting.
+	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
+	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
+	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
+	 *    clear global_status MSR which has guest status now, then this destroy
+	 *    guest PMU status.
+	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
+	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
+	 *    which has guest status now, then this destroy guest PMU status.
+	 */
+	if (!this_cpu_read(pmi_vector_is_nmi))
+		return 0;
+
 	/*
 	 * All PMUs/events that share this PMI handler should make sure to
 	 * increment active_events for their events.
@@ -2675,11 +2695,14 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
 
 static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
 {
-	if (enter)
+	if (enter) {
 		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
 			   (guest_lvtpc & APIC_LVT_MASKED));
-	else
+		this_cpu_write(pmi_vector_is_nmi, false);
+	} else {
 		apic_write(APIC_LVTPC, APIC_DM_NMI);
+		this_cpu_write(pmi_vector_is_nmi, true);
+	}
 }
 
 static struct pmu pmu = {
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 17/58] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (15 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 16/58] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
                   ` (43 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Plumb passthrough PMU capability to x86_pmu_cap in order to let any kernel
entity such as KVM know that host PMU support passthrough PMU mode and has
the implementation.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/events/core.c            | 1 +
 arch/x86/include/asm/perf_event.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index cb5d8f5fd9ce..c16ceebf2d70 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3029,6 +3029,7 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->events_mask	= (unsigned int)x86_pmu.events_maskl;
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 	cap->pebs_ept		= x86_pmu.pebs_ept;
+	cap->passthrough = !!(pmu.capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU);
 }
 EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
 
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7f1e17250546..5cf37fe1f30a 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -258,6 +258,7 @@ struct x86_pmu_capability {
 	unsigned int	events_mask;
 	int		events_mask_len;
 	unsigned int	pebs_ept	:1;
+	unsigned int	passthrough	:1;
 };
 
 /*
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (16 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 17/58] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 14:30   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
                   ` (42 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
variable is true only when the following conditions satisfies:
 - set to true when module loaded.
 - enable_pmu is true.
 - is running on Intel CPU.
 - supports PerfMon v4.
 - host PMU supports passthrough mode.

The value is always read-only because passthrough PMU currently does not
support features like LBR and PEBS, while emualted PMU does. This will end
up with two different values for kvm_cap.supported_perf_cap, which is
initialized at module load time. Maintaining two different perf
capabilities will add complexity. Further, there is not enough motivation
to support running two types of PMU implementations at the same time,
although it is possible/feasible in reality.

Finally, always propagate enable_passthrough_pmu and perf_capabilities into
kvm->arch for each KVM instance.

Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h              | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  7 +++++--
 arch/x86/kvm/x86.c              |  8 ++++++++
 arch/x86/kvm/x86.h              |  1 +
 5 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8ca74e7678f..a15c783f20b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1406,6 +1406,7 @@ struct kvm_arch {
 
 	bool bus_lock_detection_enabled;
 	bool enable_pmu;
+	bool enable_passthrough_pmu;
 
 	u32 notify_window;
 	u32 notify_vmexit_flags;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 4d52b0b539ba..cf93be5e7359 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 			enable_pmu = false;
 	}
 
+	/* Pass-through vPMU is only supported in Intel CPUs. */
+	if (!is_intel)
+		enable_passthrough_pmu = false;
+
+	/*
+	 * Pass-through vPMU requires at least PerfMon version 4 because the
+	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
+	 * for counter emulation as well as PMU context switch.  In addition, it
+	 * requires host PMU support on passthrough mode. Disable pass-through
+	 * vPMU if any condition fails.
+	 */
+	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
+		enable_passthrough_pmu = false;
+
 	if (!enable_pmu) {
 		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
 		return;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ad465881b043..2ad122995f11 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -146,6 +146,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 extern bool __read_mostly allow_smaller_maxphyaddr;
 module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
 
+module_param(enable_passthrough_pmu, bool, 0444);
+
 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
 #define KVM_VM_CR0_ALWAYS_ON				\
@@ -7924,7 +7926,8 @@ static __init u64 vmx_get_perf_capabilities(void)
 	if (boot_cpu_has(X86_FEATURE_PDCM))
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
 
-	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
+	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
+	    !enable_passthrough_pmu) {
 		x86_perf_get_lbr(&vmx_lbr_caps);
 
 		/*
@@ -7938,7 +7941,7 @@ static __init u64 vmx_get_perf_capabilities(void)
 			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
 	}
 
-	if (vmx_pebs_supported()) {
+	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
 		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
 
 		/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f1d589c07068..0c40f551130e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -187,6 +187,10 @@ bool __read_mostly enable_pmu = true;
 EXPORT_SYMBOL_GPL(enable_pmu);
 module_param(enable_pmu, bool, 0444);
 
+/* Enable/disable mediated passthrough PMU virtualization */
+bool __read_mostly enable_passthrough_pmu;
+EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
+
 bool __read_mostly eager_page_split = true;
 module_param(eager_page_split, bool, 0644);
 
@@ -6682,6 +6686,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		mutex_lock(&kvm->lock);
 		if (!kvm->created_vcpus) {
 			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
+			/* Disable passthrough PMU if enable_pmu is false. */
+			if (!kvm->arch.enable_pmu)
+				kvm->arch.enable_passthrough_pmu = false;
 			r = 0;
 		}
 		mutex_unlock(&kvm->lock);
@@ -12623,6 +12630,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
 	kvm->arch.guest_can_read_msr_platform_info = true;
 	kvm->arch.enable_pmu = enable_pmu;
+	kvm->arch.enable_passthrough_pmu = enable_passthrough_pmu;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d80a4c6b5a38..dc45ba42bec2 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -332,6 +332,7 @@ extern u64 host_arch_capabilities;
 extern struct kvm_caps kvm_caps;
 
 extern bool enable_pmu;
+extern bool enable_passthrough_pmu;
 
 /*
  * Get a filtered version of KVM's supported XCR0 that strips out dynamic
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (17 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 14:54   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
                   ` (41 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Plumb through pass-through PMU setting from kvm->arch into kvm_pmu on each
vcpu created. Note that enabling PMU is decided by VMM when it sets the
CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
in intel_pmu_refresh().

Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/pmu.c              |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 12 +++++++++---
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a15c783f20b9..4b3ce6194bdb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -595,6 +595,8 @@ struct kvm_pmu {
 	 * redundant check before cleanup if guest don't use vPMU at all.
 	 */
 	u8 event_count;
+
+	bool passthrough;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a593b03c9aed..5768ea2935e9 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -797,6 +797,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
 
 	memset(pmu, 0, sizeof(*pmu));
 	static_call(kvm_x86_pmu_init)(vcpu);
+	pmu->passthrough = false;
 	kvm_pmu_refresh(vcpu);
 }
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index be40474de6e4..e417fd91e5fe 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -470,15 +470,21 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		return;
 
 	entry = kvm_find_cpuid_entry(vcpu, 0xa);
-	if (!entry)
+	if (!entry || !vcpu->kvm->arch.enable_pmu) {
+		pmu->passthrough = false;
 		return;
-
+	}
 	eax.full = entry->eax;
 	edx.full = entry->edx;
 
 	pmu->version = eax.split.version_id;
-	if (!pmu->version)
+	if (!pmu->version) {
+		pmu->passthrough = false;
 		return;
+	}
+
+	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu &&
+			   lapic_in_kernel(vcpu);
 
 	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
 					 kvm_pmu_cap.num_counters_gp);
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (18 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 15:37   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 21/58] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
                   ` (40 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Currently, the global control bits for a vcpu are restored to the reset
state only if the guest PMU version is less than 2. This works for
emulated PMU as the MSRs are intercepted and backing events are created
for and managed by the host PMU [1].

If such a guest in run with passthrough PMU, the counters no longer work
because the global enable bits are cleared. Hence, set the global enable
bits to their reset state if passthrough PMU is used.

A passthrough-capable host may not necessarily support PMU version 2 and
it can choose to restore or save the global control state from struct
kvm_pmu in the PMU context save and restore helpers depending on the
availability of the global control register.

[1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");

Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
[removed the fixes tag]
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5768ea2935e9..e656f72fdace 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 	 * in the global controls).  Emulate that behavior when refreshing the
 	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
 	 */
-	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
+	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
 		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
 }
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 21/58] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (19 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
                   ` (39 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Add a helper to check if passthrough PMU is enabled for convenience as it
is vendor neutral.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/pmu.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index cf93be5e7359..56ba0772568c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -48,6 +48,11 @@ struct kvm_pmu_ops {
 
 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
 
+static inline bool is_passthrough_pmu_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu_to_pmu(vcpu)->passthrough;
+}
+
 static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
 {
 	/*
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init()
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (20 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 21/58] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 15:43   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
                   ` (38 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Initialize host_perf_cap early in kvm_x86_vendor_init(). This helps KVM
recognize the HW PMU capability against its guest VMs.  This awareness
directly decides the feasibility of passing through RDPMC and indirectly
affects the performance in PMU context switch. Having the host PMU feature
set cached in host_perf_cap saves a rdmsrl() to IA32_PERF_CAPABILITY MSR on
each PMU context switch.

In addition, just opportunistically remove the host_perf_cap initialization
in vmx_get_perf_capabilities() so the value is not dependent on module
parameter "enable_pmu".

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/pmu.h     | 1 +
 arch/x86/kvm/vmx/vmx.c | 4 ----
 arch/x86/kvm/x86.c     | 6 ++++++
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 56ba0772568c..e041c8a23e2f 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -295,4 +295,5 @@ bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
+extern u64 __read_mostly host_perf_cap;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2ad122995f11..4d60a8cf2dd1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7918,14 +7918,10 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 static __init u64 vmx_get_perf_capabilities(void)
 {
 	u64 perf_cap = PMU_CAP_FW_WRITES;
-	u64 host_perf_cap = 0;
 
 	if (!enable_pmu)
 		return 0;
 
-	if (boot_cpu_has(X86_FEATURE_PDCM))
-		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
-
 	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
 	    !enable_passthrough_pmu) {
 		x86_perf_get_lbr(&vmx_lbr_caps);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c40f551130e..6db4dc496d2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -239,6 +239,9 @@ EXPORT_SYMBOL_GPL(host_xss);
 u64 __read_mostly host_arch_capabilities;
 EXPORT_SYMBOL_GPL(host_arch_capabilities);
 
+u64 __read_mostly host_perf_cap;
+EXPORT_SYMBOL_GPL(host_perf_cap);
+
 const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
 	KVM_GENERIC_VM_STATS(),
 	STATS_DESC_COUNTER(VM, mmu_shadow_zapped),
@@ -9793,6 +9796,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
 		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, host_arch_capabilities);
 
+	if (boot_cpu_has(X86_FEATURE_PDCM))
+		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
+
 	r = ops->hardware_setup();
 	if (r != 0)
 		goto out_mmu_exit;
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (21 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 16:32   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
                   ` (37 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Clear RDPMC_EXITING in vmcs when all counters on the host side are exposed
to guest VM. This gives performance to passthrough PMU. However, when guest
does not get all counters, intercept RDPMC to prevent access to unexposed
counters. Make decision in vmx_vcpu_after_set_cpuid() when guest enables
PMU and passthrough PMU is enabled.

Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/pmu.c     | 16 ++++++++++++++++
 arch/x86/kvm/pmu.h     |  1 +
 arch/x86/kvm/vmx/vmx.c |  5 +++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e656f72fdace..19104e16a986 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -96,6 +96,22 @@ void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops)
 #undef __KVM_X86_PMU_OP
 }
 
+bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	if (is_passthrough_pmu_enabled(vcpu) &&
+	    !enable_vmware_backdoor &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
+	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
+	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
+	    pmu->counter_bitmask[KVM_PMC_FIXED] == (((u64)1 << kvm_pmu_cap.bit_width_fixed)  - 1))
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(kvm_pmu_check_rdpmc_passthrough);
+
 static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e041c8a23e2f..91941a0f6e47 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -290,6 +290,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
+bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4d60a8cf2dd1..339742350b7a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7911,6 +7911,11 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		vmx->msr_ia32_feature_control_valid_bits &=
 			~FEAT_CTL_SGX_LC_ENABLED;
 
+	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))
+		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
+	else
+		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (22 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 17:03   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
                   ` (36 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Define macro PMU_CAP_PERF_METRICS to represent bit[15] of
MSR_IA32_PERF_CAPABILITIES MSR. This bit is used to represent whether
perf metrics feature is enabled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/capabilities.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..d8317552b634 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
 #define PT_MODE_HOST_GUEST	1
 
 #define PMU_CAP_FW_WRITES	(1ULL << 13)
+#define PMU_CAP_PERF_METRICS	BIT_ULL(15)
 #define PMU_CAP_LBR_FMT		0x3f
 
 struct nested_vmx_msrs {
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (23 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 17:32   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
                   ` (35 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Introduce a vendor specific API to check if rdpmc passthrough allowed.
RDPMC passthrough requires guest VM have the full ownership of all
counters. These include general purpose counters and fixed counters and
some vendor specific MSRs such as PERF_METRICS. Since PERF_METRICS MSR is
Intel specific, putting the check into vendor specific code.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  1 +
 arch/x86/kvm/pmu.c                     |  1 +
 arch/x86/kvm/pmu.h                     |  1 +
 arch/x86/kvm/svm/pmu.c                 |  6 ++++++
 arch/x86/kvm/vmx/pmu_intel.c           | 16 ++++++++++++++++
 5 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index f852b13aeefe..fd986d5146e4 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -20,6 +20,7 @@ KVM_X86_PMU_OP(get_msr)
 KVM_X86_PMU_OP(set_msr)
 KVM_X86_PMU_OP(refresh)
 KVM_X86_PMU_OP(init)
+KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
 KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 19104e16a986..3afefe4cf6e2 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -102,6 +102,7 @@ bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
 
 	if (is_passthrough_pmu_enabled(vcpu) &&
 	    !enable_vmware_backdoor &&
+	    static_call(kvm_x86_pmu_is_rdpmc_passthru_allowed)(vcpu) &&
 	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
 	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
 	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 91941a0f6e47..e1af6d07b191 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -40,6 +40,7 @@ struct kvm_pmu_ops {
 	void (*reset)(struct kvm_vcpu *vcpu);
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
+	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index dfcc38bd97d3..6b471b1ec9b8 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -228,6 +228,11 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool amd_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
+{
+	return true;
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -237,6 +242,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.set_msr = amd_pmu_set_msr,
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
+	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index e417fd91e5fe..02c9019c6f85 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -725,6 +725,21 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
 	}
 }
 
+static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Per Intel SDM vol. 2 for RDPMC, MSR_PERF_METRICS is accessible by
+	 * with type 0x2000 in ECX[31:16], while the index value in ECX[15:0] is
+	 * implementation specific. Therefore, if the host has this MSR, but
+	 * does not expose it to the guest, RDPMC has to be intercepted.
+	 */
+	if ((host_perf_cap & PMU_CAP_PERF_METRICS) &&
+	    !(vcpu_get_perf_capabilities(vcpu) & PMU_CAP_PERF_METRICS))
+		return false;
+
+	return true;
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -736,6 +751,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.reset = intel_pmu_reset,
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
+	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (24 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-06  7:04   ` Mi, Dapeng
                     ` (2 more replies)
  2024-08-01  4:58 ` [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
                   ` (34 subsequent siblings)
  60 siblings, 3 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

In PMU passthrough mode, there are three requirements to manage
IA32_PERF_GLOBAL_CTRL:
 - guest IA32_PERF_GLOBAL_CTRL MSR must be saved at vm exit.
 - IA32_PERF_GLOBAL_CTRL MSR must be cleared at vm exit to avoid any
   counter of running within KVM runloop.
 - guest IA32_PERF_GLOBAL_CTRL MSR must be restored at vm entry.

Introduce vmx_set_perf_global_ctrl() function to auto switching
IA32_PERF_GLOBAL_CTR and invoke it after the VMM finishes setting up the
CPUID bits.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/vmx.h |   1 +
 arch/x86/kvm/vmx/vmx.c     | 117 +++++++++++++++++++++++++++++++------
 arch/x86/kvm/vmx/vmx.h     |   3 +-
 3 files changed, 103 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index d77a31039f24..5ed89a099533 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -106,6 +106,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL      0x40000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 339742350b7a..34a420fa98c5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4394,6 +4394,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 	return pin_based_exec_ctrl;
 }
 
+static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
+{
+	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
+	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
+	struct vmx_msrs *m;
+	int i;
+
+	if (cpu_has_perf_global_ctrl_bug() ||
+	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
+		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
+	}
+
+	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
+		/*
+		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
+		 */
+		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
+			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
+		} else {
+			m = &vmx->msr_autoload.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = m->nr++;
+				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
+			}
+			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			m->val[i].value = 0;
+		}
+		/*
+		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
+		 */
+		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
+			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
+		} else {
+			m = &vmx->msr_autoload.host;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = m->nr++;
+				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+			}
+			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			m->val[i].value = 0;
+		}
+		/*
+		 * Setup auto save guest PERF_GLOBAL_CTRL msr at vm exit
+		 */
+		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autostore.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = m->nr++;
+				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
+			}
+			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+		}
+	} else {
+		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autoload.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i >= 0) {
+				m->nr--;
+				m->val[i] = m->val[m->nr];
+				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
+			}
+		}
+		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autoload.host;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i >= 0) {
+				m->nr--;
+				m->val[i] = m->val[m->nr];
+				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+			}
+		}
+		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autostore.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i >= 0) {
+				m->nr--;
+				m->val[i] = m->val[m->nr];
+				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
+			}
+		}
+	}
+
+	vm_entry_controls_set(vmx, vmentry_ctrl);
+	vm_exit_controls_set(vmx, vmexit_ctrl);
+}
+
 static u32 vmx_vmentry_ctrl(void)
 {
 	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
@@ -4401,17 +4492,10 @@ static u32 vmx_vmentry_ctrl(void)
 	if (vmx_pt_mode_is_system())
 		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
 				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
-	/*
-	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
-	 */
-	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
-			  VM_ENTRY_LOAD_IA32_EFER |
-			  VM_ENTRY_IA32E_MODE);
-
-	if (cpu_has_perf_global_ctrl_bug())
-		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
-
-	return vmentry_ctrl;
+	 /*
+	  * IA32e mode, and loading of EFER is toggled dynamically.
+	  */
+	return vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_IA32E_MODE);
 }
 
 static u32 vmx_vmexit_ctrl(void)
@@ -4429,12 +4513,8 @@ static u32 vmx_vmexit_ctrl(void)
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
 				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
 
-	if (cpu_has_perf_global_ctrl_bug())
-		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
-
-	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
-	return vmexit_ctrl &
-		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+	/* Loading of EFER is toggled dynamically */
+	return vmexit_ctrl & ~VM_EXIT_LOAD_IA32_EFER;
 }
 
 void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -4777,6 +4857,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write64(VM_FUNCTION_CONTROL, 0);
 
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
+	vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autostore.guest.val));
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
@@ -7916,6 +7997,8 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	else
 		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
 
+	vmx_set_perf_global_ctrl(vmx);
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 7b64e271a931..32e3974c1a2c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -510,7 +510,8 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |                            \
+	       VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (25 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-24 19:58   ` Chen, Zide
  2024-11-19 18:17   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
                   ` (33 subsequent siblings)
  60 siblings, 2 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
interception.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
 arch/x86/kvm/cpuid.c                   | 4 ++++
 arch/x86/kvm/pmu.c                     | 5 +++++
 arch/x86/kvm/pmu.h                     | 2 ++
 4 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index fd986d5146e4..1b7876dcb3c3 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -24,6 +24,7 @@ KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
 KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
+KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f2f2be5d1141..3deb79b39847 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -381,6 +381,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
 	kvm_pmu_refresh(vcpu);
+
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_passthrough_pmu_msrs(vcpu);
+
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 3afefe4cf6e2..bd94f2d67f5c 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1059,3 +1059,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
 	kfree(filter);
 	return r;
 }
+
+void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e1af6d07b191..63f876557716 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,7 @@ struct kvm_pmu_ops {
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
 	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
+	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -292,6 +293,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
 bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
+void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (26 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 18:24   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 29/58] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
                   ` (32 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Event selectors for GP counters and fixed counters control MSR are
intercepted for the purpose of security, i.e., preventing guest from using
unallowed events to steal information or take advantages of any CPU errata.

Other than event selectors, disable PMU counter MSR interception specified
in guest CPUID, counter MSR index outside of exported range will still be
intercepted.

Global registers like global_ctrl will passthrough only if pmu version is
greater than 1.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/cpuid.c         |  3 +--
 arch/x86/kvm/vmx/pmu_intel.c | 47 ++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3deb79b39847..f01e2f1ccce1 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -382,8 +382,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	kvm_pmu_refresh(vcpu);
 
-	if (is_passthrough_pmu_enabled(vcpu))
-		kvm_pmu_passthrough_pmu_msrs(vcpu);
+	kvm_pmu_passthrough_pmu_msrs(vcpu);
 
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 02c9019c6f85..737de5bf1eee 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -740,6 +740,52 @@ static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+/*
+ * Setup PMU MSR interception for both mediated passthrough vPMU and legacy
+ * emulated vPMU. Note that this function is called after each time userspace
+ * set CPUID.
+ */
+static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	bool msr_intercept = !is_passthrough_pmu_enabled(vcpu);
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	int i;
+
+	/*
+	 * Unexposed PMU MSRs are intercepted by default. However,
+	 * KVM_SET_CPUID{,2} may be invoked multiple times. To ensure MSR
+	 * interception is correct after each call of setting CPUID, explicitly
+	 * touch msr bitmap for each PMU MSR.
+	 */
+	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
+		if (i >= pmu->nr_arch_gp_counters)
+			msr_intercept = true;
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, msr_intercept);
+		if (fw_writes_is_enabled(vcpu))
+			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, msr_intercept);
+		else
+			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, true);
+	}
+
+	msr_intercept = !is_passthrough_pmu_enabled(vcpu);
+	for (i = 0; i < kvm_pmu_cap.num_counters_fixed; i++) {
+		if (i >= pmu->nr_arch_fixed_counters)
+			msr_intercept = true;
+		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, msr_intercept);
+	}
+
+	if (pmu->version > 1 && is_passthrough_pmu_enabled(vcpu) &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
+	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed)
+		msr_intercept = false;
+	else
+		msr_intercept = true;
+
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_STATUS, MSR_TYPE_RW, msr_intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, MSR_TYPE_RW, msr_intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -752,6 +798,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
 	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
+	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 29/58] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (27 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
                   ` (31 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
when passthrough vPMU is enabled. Note that even when passthrough vPMU is
enabled, global_ctrl may still be intercepted if guest VM only sees a
subset of the counters.

Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/pmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index bd94f2d67f5c..e9047051489e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (pmu->global_ctrl != data) {
 			diff = pmu->global_ctrl ^ data;
 			pmu->global_ctrl = data;
-			reprogram_counters(pmu, diff);
+			if (!is_passthrough_pmu_enabled(vcpu))
+				reprogram_counters(pmu, diff);
 		}
 		break;
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (28 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 29/58] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-02  7:51   ` Mi, Dapeng
  2024-08-01  4:58 ` [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
                   ` (30 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Reject PMU MSRs interception explicitly in
vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
specially handled in intel_passthrough_pmu_msrs().

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 34a420fa98c5..41102658ed21 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -166,7 +166,7 @@ module_param(enable_passthrough_pmu, bool, 0444);
 
 /*
  * List of MSRs that can be directly passed to the guest.
- * In addition to these x2apic, PT and LBR MSRs are handled specially.
+ * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
  */
 static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_IA32_SPEC_CTRL,
@@ -695,6 +695,13 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
 	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
+	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
+	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
+	case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
 		return -ENOENT;
 	}
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (29 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-19 18:58   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 32/58] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
                   ` (29 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Add the MSR indices for both selector and counter in each kvm_pmc. Giving
convenience to mediated passthrough vPMU in scenarios of querying MSR from
a given pmc. Note that legacy vPMU does not need this because it never
directly accesses PMU MSRs, instead each kvm_pmc is bound to a perf_event.

For actual Zen 4 and later hardware, it will never be the case that the
PerfMonV2 CPUID bit is set but the PerfCtrCore bit is not. However, a
guest can be booted with PerfMonV2 enabled and PerfCtrCore disabled.
KVM does not clear the PerfMonV2 bit from guest CPUID as long as the
host has the PerfCtrCore capability.

In this case, passthrough mode will use the K7 legacy MSRs to program
events but with the incorrect assumption that there are 6 such counters
instead of 4 as advertised by CPUID leaf 0x80000022 EBX. The host kernel
will also report unchecked MSR accesses for the absent counters while
saving or restoring guest PMU contexts.

Ensure that K7 legacy MSRs are not used as long as the guest CPUID has
either PerfCtrCore or PerfMonV2 set.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/svm/pmu.c          | 13 +++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c    | 13 +++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b3ce6194bdb..603727312f9c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -522,6 +522,8 @@ struct kvm_pmc {
 	 */
 	u64 emulated_counter;
 	u64 eventsel;
+	u64 msr_counter;
+	u64 msr_eventsel;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
 	/*
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 6b471b1ec9b8..64060cbd8210 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -177,6 +177,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	union cpuid_0x80000022_ebx ebx;
+	int i;
 
 	pmu->version = 1;
 	if (guest_cpuid_has(vcpu, X86_FEATURE_PERFMON_V2)) {
@@ -210,6 +211,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
 	pmu->nr_arch_fixed_counters = 0;
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+
+	if (pmu->version > 1 || guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
+		for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+			pmu->gp_counters[i].msr_eventsel = MSR_F15H_PERF_CTL0 + 2 * i;
+			pmu->gp_counters[i].msr_counter = MSR_F15H_PERF_CTR0 + 2 * i;
+		}
+	} else {
+		for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+			pmu->gp_counters[i].msr_eventsel = MSR_K7_EVNTSEL0 + i;
+			pmu->gp_counters[i].msr_counter = MSR_K7_PERFCTR0 + i;
+		}
+	}
 }
 
 static void amd_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 737de5bf1eee..0de918dc14ea 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -562,6 +562,19 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 				~((1ull << pmu->nr_arch_gp_counters) - 1);
 		}
 	}
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmu->gp_counters[i].msr_eventsel = MSR_P6_EVNTSEL0 + i;
+		if (fw_writes_is_enabled(vcpu))
+			pmu->gp_counters[i].msr_counter = MSR_IA32_PMC0 + i;
+		else
+			pmu->gp_counters[i].msr_counter = MSR_IA32_PERFCTR0 + i;
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
+		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
+	}
 }
 
 static void intel_pmu_init(struct kvm_vcpu *vcpu)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 32/58] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (30 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
                   ` (28 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Plumb through kvm_pmu_ops with these two extra functions to allow pmu
context switch.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 ++
 arch/x86/kvm/pmu.c                     | 14 ++++++++++++++
 arch/x86/kvm/pmu.h                     |  4 ++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 1b7876dcb3c3..1a848ba6a7a7 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -25,6 +25,8 @@ KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
 KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
+KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
+KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e9047051489e..782b564bdf96 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1065,3 +1065,17 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 {
 	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
 }
+
+void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
+}
+
+void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 63f876557716..8bd4b79e363f 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -42,6 +42,8 @@ struct kvm_pmu_ops {
 	void (*cleanup)(struct kvm_vcpu *vcpu);
 	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
 	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
+	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
+	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -294,6 +296,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
 bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
 void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
+void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu);
+void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (31 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 32/58] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-06  7:27   ` Mi, Dapeng
  2024-08-01  4:58 ` [RFC PATCH v3 34/58] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
                   ` (27 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Implement the save/restore of PMU state for pasthrough PMU in Intel. In
passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
and gains the full HW PMU ownership. On the contrary, host regains the
ownership of PMU HW from KVM when control flow leaves the scope of
passthrough PMU.

Implement PMU context switches for Intel CPUs and opptunistically use
rdpmcl() instead of rdmsrl() when reading counters since the former has
lower latency in Intel CPUs.

Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/pmu.c           | 46 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c | 41 +++++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 782b564bdf96..9bb733384069 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1068,14 +1068,60 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 
 void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	u32 i;
+
 	lockdep_assert_irqs_disabled();
 
 	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
+
+	/*
+	 * Clear hardware selector MSR content and its counter to avoid
+	 * leakage and also avoid this guest GP counter get accidentally
+	 * enabled during host running when host enable global ctrl.
+	 */
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		rdpmcl(i, pmc->counter);
+		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
+		if (pmc->counter)
+			wrmsrl(pmc->msr_counter, 0);
+		if (pmc->eventsel)
+			wrmsrl(pmc->msr_eventsel, 0);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
+		if (pmc->counter)
+			wrmsrl(pmc->msr_counter, 0);
+	}
 }
 
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	int i;
+
 	lockdep_assert_irqs_disabled();
 
 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
+
+	/*
+	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
+	 * in this case will be intercepted. Accessing to these counters and
+	 * selectors will cause #GP in the guest.
+	 */
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		wrmsrl(pmc->msr_counter, pmc->counter);
+		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		wrmsrl(pmc->msr_counter, pmc->counter);
+	}
 }
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0de918dc14ea..89c8f73a48c8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -572,7 +572,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	}
 
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
-		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
+		pmu->fixed_counters[i].msr_eventsel = 0;
 		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
 	}
 }
@@ -799,6 +799,43 @@ static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
 }
 
+static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	/* Global ctrl register is already saved at VM-exit. */
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
+	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
+	if (pmu->global_status)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
+
+	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	/*
+	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
+	 * also avoid these guest fixed counters get accidentially enabled
+	 * during host running when host enable global ctrl.
+	 */
+	if (pmu->fixed_ctr_ctrl)
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+}
+
+static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 global_status, toggle;
+
+	/* Clear host global_ctrl MSR if non-zero. */
+	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
+	toggle = pmu->global_status ^ global_status;
+	if (global_status & toggle)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
+	if (pmu->global_status & toggle)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
+
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -812,6 +849,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.cleanup = intel_pmu_cleanup,
 	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
 	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
+	.save_pmu_context = intel_save_guest_pmu_context,
+	.restore_pmu_context = intel_restore_guest_pmu_context,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 34/58] KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (32 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 35/58] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
                   ` (26 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Make check_pmu_event_filter() exported to usable by vendor modules like
kvm_intel. This is because passthrough PMU intercept the guest writes to
event selectors and directly do the event filter checking inside the vendor
specific set_msr() instead of deferring to the KVM_REQ_PMU handler.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/pmu.c | 3 ++-
 arch/x86/kvm/pmu.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 9bb733384069..9aa08472b7df 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -443,7 +443,7 @@ static bool is_fixed_event_allowed(struct kvm_x86_pmu_event_filter *filter,
 	return true;
 }
 
-static bool check_pmu_event_filter(struct kvm_pmc *pmc)
+bool check_pmu_event_filter(struct kvm_pmc *pmc)
 {
 	struct kvm_x86_pmu_event_filter *filter;
 	struct kvm *kvm = pmc->vcpu->kvm;
@@ -457,6 +457,7 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
 
 	return is_fixed_event_allowed(filter, pmc->idx);
 }
+EXPORT_SYMBOL_GPL(check_pmu_event_filter);
 
 static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
 {
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 8bd4b79e363f..9cde62f3988e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -298,6 +298,7 @@ bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
 void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
 void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu);
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu);
+bool check_pmu_event_filter(struct kvm_pmc *pmc);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 35/58] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (33 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 34/58] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
                   ` (25 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Only allow writing to event selector if event is allowed in filter. Since
passthrough PMU implementation does the PMU context switch at VM Enter/Exit
boudary, even if the value of event selector passes the checking, it cannot
be written directly to HW since PMU HW is owned by the host PMU at the
moment.  Because of that, introduce eventsel_hw to cache that value which
will be assigned into HW just before VM entry.

Note that regardless of whether an event value is allowed, the value will
be cached in pmc->eventsel and guest VM can always read the cached value
back. This implementation is consistent with the HW CPU design.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.c              |  5 ++---
 arch/x86/kvm/vmx/pmu_intel.c    | 13 ++++++++++++-
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 603727312f9c..e5c288d4264f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -522,6 +522,7 @@ struct kvm_pmc {
 	 */
 	u64 emulated_counter;
 	u64 eventsel;
+	u64 eventsel_hw;
 	u64 msr_counter;
 	u64 msr_eventsel;
 	struct perf_event *perf_event;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 9aa08472b7df..545930f743b9 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1085,10 +1085,9 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
 		pmc = &pmu->gp_counters[i];
 		rdpmcl(i, pmc->counter);
-		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
 		if (pmc->counter)
 			wrmsrl(pmc->msr_counter, 0);
-		if (pmc->eventsel)
+		if (pmc->eventsel_hw)
 			wrmsrl(pmc->msr_eventsel, 0);
 	}
 
@@ -1118,7 +1117,7 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
 		pmc = &pmu->gp_counters[i];
 		wrmsrl(pmc->msr_counter, pmc->counter);
-		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);
+		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel_hw);
 	}
 
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 89c8f73a48c8..0cd38c5632ee 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -399,7 +399,18 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (data & reserved_bits)
 				return 1;
 
-			if (data != pmc->eventsel) {
+			if (is_passthrough_pmu_enabled(vcpu)) {
+				pmc->eventsel = data;
+				if (!check_pmu_event_filter(pmc)) {
+					if (pmc->eventsel_hw &
+					    ARCH_PERFMON_EVENTSEL_ENABLE) {
+						pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
+						pmc->counter = 0;
+					}
+					return 0;
+				}
+				pmc->eventsel_hw = data;
+			} else if (data != pmc->eventsel) {
 				pmc->eventsel = data;
 				kvm_pmu_request_counter_reprogram(pmc);
 			}
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (34 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 35/58] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-09-02  7:59   ` Mi, Dapeng
  2024-08-01  4:58 ` [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
                   ` (24 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Allow writing to fixed counter selector if counter is exposed. If this
fixed counter is filtered out, this counter won't be enabled on HW.

Passthrough PMU implements the context switch at VM Enter/Exit boundary the
guest value cannot be directly written to HW since the HW PMU is owned by
the host. Introduce a new field fixed_ctr_ctrl_hw in kvm_pmu to cache the
guest value.  which will be assigne to HW at PMU context restore.

Since passthrough PMU intercept writes to fixed counter selector, there is
no need to read the value at pmu context save, but still clear the fix
counter ctrl MSR and counters when switching out to host PMU.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 28 ++++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e5c288d4264f..93c17da8271d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -549,6 +549,7 @@ struct kvm_pmu {
 	unsigned nr_arch_fixed_counters;
 	unsigned available_event_types;
 	u64 fixed_ctr_ctrl;
+	u64 fixed_ctr_ctrl_hw;
 	u64 fixed_ctr_ctrl_mask;
 	u64 global_ctrl;
 	u64 global_status;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0cd38c5632ee..c61936266cbd 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -34,6 +34,25 @@
 
 #define MSR_PMC_FULL_WIDTH_BIT      (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)
 
+static void reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64 data)
+{
+	struct kvm_pmc *pmc;
+	u64 new_data = 0;
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
+		if (check_pmu_event_filter(pmc)) {
+			pmc->current_config = fixed_ctrl_field(data, i);
+			new_data |= (pmc->current_config << (i * 4));
+		} else {
+			pmc->counter = 0;
+		}
+	}
+	pmu->fixed_ctr_ctrl_hw = new_data;
+	pmu->fixed_ctr_ctrl = data;
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	struct kvm_pmc *pmc;
@@ -351,7 +370,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & pmu->fixed_ctr_ctrl_mask)
 			return 1;
 
-		if (pmu->fixed_ctr_ctrl != data)
+		if (is_passthrough_pmu_enabled(vcpu))
+			reprogram_fixed_counters_in_passthrough_pmu(pmu, data);
+		else if (pmu->fixed_ctr_ctrl != data)
 			reprogram_fixed_counters(pmu, data);
 		break;
 	case MSR_IA32_PEBS_ENABLE:
@@ -820,13 +841,12 @@ static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
 	if (pmu->global_status)
 		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
 
-	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
 	/*
 	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
 	 * also avoid these guest fixed counters get accidentially enabled
 	 * during host running when host enable global ctrl.
 	 */
-	if (pmu->fixed_ctr_ctrl)
+	if (pmu->fixed_ctr_ctrl_hw)
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
 }
 
@@ -844,7 +864,7 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
 	if (pmu->global_status & toggle)
 		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
 
-	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
 }
 
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (35 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-24 20:26   ` Chen, Zide
  2024-10-31  3:14   ` Mi, Dapeng
  2024-08-01  4:58 ` [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
                   ` (23 subsequent siblings)
  60 siblings, 2 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

In PMU passthrough mode, use global_ctrl field in struct kvm_pmu as the
cached value. This is convenient for KVM to set and get the value from the
host side. In addition, load and save the value across VM enter/exit
boundary in the following way:

 - At VM exit, if processor supports
   GUEST_VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, read guest
   IA32_PERF_GLOBAL_CTRL GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else read
   it from VM-exit MSR-stroe array in VMCS. The value is then assigned to
   global_ctrl.

 - At VM Entry, if processor supports
   GUEST_VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, read guest
   IA32_PERF_GLOBAL_CTRL from GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else
   read it from VM-entry MSR-load array in VMCS. The value is then
   assigned to global ctrl.

Implement the above logic into two helper functions and invoke them around
VM Enter/exit boundary.

Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx/vmx.c          | 49 ++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 93c17da8271d..7bf901a53543 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -601,6 +601,8 @@ struct kvm_pmu {
 	u8 event_count;
 
 	bool passthrough;
+	int global_ctrl_slot_in_autoload;
+	int global_ctrl_slot_in_autostore;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 41102658ed21..b126de6569c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4430,6 +4430,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 			}
 			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
 			m->val[i].value = 0;
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autoload = i;
 		}
 		/*
 		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
@@ -4457,6 +4458,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
 			}
 			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autostore = i;
 		}
 	} else {
 		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
@@ -4467,6 +4469,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 				m->val[i] = m->val[m->nr];
 				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
 			}
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autoload = -ENOENT;
 		}
 		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
 			m = &vmx->msr_autoload.host;
@@ -4485,6 +4488,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 				m->val[i] = m->val[m->nr];
 				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
 			}
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autostore = -ENOENT;
 		}
 	}
 
@@ -7272,7 +7276,7 @@ void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
-static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+static void __atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 {
 	int i, nr_msrs;
 	struct perf_guest_switch_msr *msrs;
@@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 					msrs[i].host, false);
 }
 
+static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
+	int i;
+
+	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
+		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
+	} else {
+		i = pmu->global_ctrl_slot_in_autostore;
+		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
+	}
+}
+
+static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
+	u64 global_ctrl = pmu->global_ctrl;
+	int i;
+
+	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
+		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
+	} else {
+		i = pmu->global_ctrl_slot_in_autoload;
+		vmx->msr_autoload.guest.val[i].value = global_ctrl;
+	}
+}
+
+static void __atomic_switch_perf_msrs_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	load_perf_global_ctrl_in_passthrough_pmu(vmx);
+}
+
+static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+{
+	if (is_passthrough_pmu_enabled(&vmx->vcpu))
+		__atomic_switch_perf_msrs_in_passthrough_pmu(vmx);
+	else
+		__atomic_switch_perf_msrs(vmx);
+}
+
 static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7405,6 +7449,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr2 = native_read_cr2();
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		save_perf_global_ctrl_in_passthrough_pmu(vmx);
+
 	vmx->idt_vectoring_info = 0;
 
 	vmx_enable_fb_clear(vmx);
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (36 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-20 18:42   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 39/58] KVM: x86/pmu: Notify perf core at KVM context switch boundary Mingwei Zhang
                   ` (22 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Excluding existing vLBR logic from the passthrough PMU because the it does
not support LBR related MSRs. So to avoid any side effect, do not call
vLBR related code in both vcpu_enter_guest() and pmi injection function.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 13 ++++++++-----
 arch/x86/kvm/vmx/vmx.c       |  2 +-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index c61936266cbd..40c503cd263b 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -660,13 +660,16 @@ static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
 
 static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
-	u8 version = vcpu_to_pmu(vcpu)->version;
+	u8 version;
 
-	if (!intel_pmu_lbr_is_enabled(vcpu))
-		return;
+	if (!is_passthrough_pmu_enabled(vcpu)) {
+		if (!intel_pmu_lbr_is_enabled(vcpu))
+			return;
 
-	if (version > 1 && version < 4)
-		intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+		version = vcpu_to_pmu(vcpu)->version;
+		if (version > 1 && version < 4)
+			intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+	}
 }
 
 static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b126de6569c8..a4b2b0b69a68 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7561,7 +7561,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	pt_guest_enter(vmx);
 
 	atomic_switch_perf_msrs(vmx);
-	if (intel_pmu_lbr_is_enabled(vcpu))
+	if (!is_passthrough_pmu_enabled(&vmx->vcpu) && intel_pmu_lbr_is_enabled(vcpu))
 		vmx_passthrough_lbr_msrs(vcpu);
 
 	if (enable_preemption_timer)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 39/58] KVM: x86/pmu: Notify perf core at KVM context switch boundary
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (37 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
                   ` (21 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Before restore guest PMU context, call perf_guest_enter() to let perf core
sched out all exclude_guest perf events and switch PMI to dedicated
KVM_GUEST_PMI_VECTOR.

After save guest PMU context, call perf_guest_exit() to let perf core
switch PMI back to NMI and sched in exclude_guest perf events.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 545930f743b9..5cc539bdcc7e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1097,6 +1097,8 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 		if (pmc->counter)
 			wrmsrl(pmc->msr_counter, 0);
 	}
+
+	perf_guest_exit();
 }
 
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
@@ -1107,6 +1109,8 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_irqs_disabled();
 
+	perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
+
 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
 
 	/*
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (38 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 39/58] KVM: x86/pmu: Notify perf core at KVM context switch boundary Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-20 18:46   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
                   ` (20 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

When passthrough PMU is enabled by kvm and perf, KVM call
perf_get_mediated_pmu() to exclusive own x86 core PMU at VM creation, KVM
call perf_put_mediated_pmu() to return x86 core PMU to host perf at VM
destroy.

When perf_get_mediated_pmu() fail, the host has system wide perf events
without exclude_guest = 1 which must be disabled to enable VM with
passthrough PMU.

Once VM with passthrough PMU starts, perf will refuse to create system wide
perf event without exclude_guest = 1 until the vm is closed.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6db4dc496d2b..dd6d2c334d90 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6690,8 +6690,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		if (!kvm->created_vcpus) {
 			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
 			/* Disable passthrough PMU if enable_pmu is false. */
-			if (!kvm->arch.enable_pmu)
+			if (!kvm->arch.enable_pmu) {
+				if (kvm->arch.enable_passthrough_pmu)
+					perf_put_mediated_pmu();
 				kvm->arch.enable_passthrough_pmu = false;
+			}
 			r = 0;
 		}
 		mutex_unlock(&kvm->lock);
@@ -12637,6 +12640,14 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.guest_can_read_msr_platform_info = true;
 	kvm->arch.enable_pmu = enable_pmu;
 	kvm->arch.enable_passthrough_pmu = enable_passthrough_pmu;
+	if (kvm->arch.enable_passthrough_pmu) {
+		ret = perf_get_mediated_pmu();
+		if (ret < 0) {
+			kvm_err("failed to enable mediated passthrough pmu, please disable system wide perf events\n");
+			goto out_uninit_mmu;
+		}
+	}
+
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
@@ -12785,6 +12796,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
 		mutex_unlock(&kvm->slots_lock);
 	}
+	if (kvm->arch.enable_passthrough_pmu)
+		perf_put_mediated_pmu();
 	kvm_unload_vcpu_mmus(kvm);
 	static_call_cond(kvm_x86_vm_destroy)(kvm);
 	kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1));
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (39 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-24 19:57   ` Chen, Zide
  2024-08-01  4:58 ` [RFC PATCH v3 42/58] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
                   ` (19 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Add correct PMU context switch at VM_entry/exit boundary.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dd6d2c334d90..70274c0da017 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11050,6 +11050,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_restore_pmu_context(vcpu);
+
 	guest_timing_enter_irqoff();
 
 	for (;;) {
@@ -11078,6 +11081,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		++vcpu->stat.exits;
 	}
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_save_pmu_context(vcpu);
+
 	/*
 	 * Do this here before restoring debug registers on the host.  And
 	 * since we do this before handling the vmexit, a DR access vmexit
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 42/58] KVM: x86/pmu: Introduce PMU operator to increment counter
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (40 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
                   ` (18 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Introduce PMU operator to increment counter because in passthrough PMU
there is no common backend implementation like host perf API. Having a PMU
operator for counter increment and overflow checking will help hiding
architectural differences.

So Introduce the operator function to make it convenient for passthrough
PMU to synthesize a PMI.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  1 +
 arch/x86/kvm/pmu.h                     |  1 +
 arch/x86/kvm/vmx/pmu_intel.c           | 12 ++++++++++++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 1a848ba6a7a7..72ca78df8d2b 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -27,6 +27,7 @@ KVM_X86_PMU_OP_OPTIONAL(cleanup)
 KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
 KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
+KVM_X86_PMU_OP_OPTIONAL(incr_counter)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 9cde62f3988e..325f17673a00 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -44,6 +44,7 @@ struct kvm_pmu_ops {
 	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
 	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
 	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
+	bool (*incr_counter)(struct kvm_pmc *pmc);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 40c503cd263b..42af2404bdb9 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -74,6 +74,17 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 	}
 }
 
+static bool intel_incr_counter(struct kvm_pmc *pmc)
+{
+	pmc->counter += 1;
+	pmc->counter &= pmc_bitmask(pmc);
+
+	if (!pmc->counter)
+		return true;
+
+	return false;
+}
+
 static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
 					    unsigned int idx, u64 *mask)
 {
@@ -885,6 +896,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
 	.save_pmu_context = intel_save_guest_pmu_context,
 	.restore_pmu_context = intel_restore_guest_pmu_context,
+	.incr_counter = intel_incr_counter,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (41 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 42/58] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-10-25 16:16   ` Chen, Zide
  2024-08-01  4:58 ` [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
                   ` (17 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Introduce PMU operator for setting counter overflow. When emulating counter
increment, multiple counters could overflow at the same time, i.e., during
the execution of the same instruction. In passthrough PMU, having an PMU
operator provides convenience to update the PMU global status in one shot
with details hidden behind the vendor specific implementation.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
 arch/x86/kvm/pmu.h                     | 1 +
 arch/x86/kvm/vmx/pmu_intel.c           | 5 +++++
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 72ca78df8d2b..bd5b118a5ce5 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -28,6 +28,7 @@ KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
 KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
 KVM_X86_PMU_OP_OPTIONAL(incr_counter)
+KVM_X86_PMU_OP_OPTIONAL(set_overflow)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 325f17673a00..78a7f0c5f3ba 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -45,6 +45,7 @@ struct kvm_pmu_ops {
 	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
 	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
 	bool (*incr_counter)(struct kvm_pmc *pmc);
+	void (*set_overflow)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 42af2404bdb9..2d46c911f0b7 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -881,6 +881,10 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
 	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
 }
 
+static void intel_set_overflow(struct kvm_vcpu *vcpu)
+{
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -897,6 +901,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.save_pmu_context = intel_save_guest_pmu_context,
 	.restore_pmu_context = intel_restore_guest_pmu_context,
 	.incr_counter = intel_incr_counter,
+	.set_overflow = intel_set_overflow,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (42 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-20 20:13   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
                   ` (16 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Implement emulated counter increment for passthrough PMU under KVM_REQ_PMU.
Defer the counter increment to KVM_REQ_PMU handler because counter
increment requests come from kvm_pmu_trigger_event() which can be triggered
within the KVM_RUN inner loop or outside of the inner loop. This means the
counter increment could happen before or after PMU context switch.

So process counter increment in one place makes the implementation simple.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5cc539bdcc7e..41057d0122bd 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -510,6 +510,18 @@ static int reprogram_counter(struct kvm_pmc *pmc)
 				     eventsel & ARCH_PERFMON_EVENTSEL_INT);
 }
 
+static void kvm_pmu_handle_event_in_passthrough_pmu(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	static_call_cond(kvm_x86_pmu_set_overflow)(vcpu);
+
+	if (atomic64_read(&pmu->__reprogram_pmi)) {
+		kvm_make_request(KVM_REQ_PMI, vcpu);
+		atomic64_set(&pmu->__reprogram_pmi, 0ull);
+	}
+}
+
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 {
 	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
@@ -517,6 +529,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 	struct kvm_pmc *pmc;
 	int bit;
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		return kvm_pmu_handle_event_in_passthrough_pmu(vcpu);
+
 	bitmap_copy(bitmap, pmu->reprogram_pmi, X86_PMC_IDX_MAX);
 
 	/*
@@ -848,6 +863,17 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 	kvm_pmu_reset(vcpu);
 }
 
+static void kvm_passthrough_pmu_incr_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
+{
+	if (static_call(kvm_x86_pmu_incr_counter)(pmc)) {
+		__set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->global_status);
+		kvm_make_request(KVM_REQ_PMU, vcpu);
+
+		if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT)
+			set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
+	}
+}
+
 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
 {
 	pmc->emulated_counter++;
@@ -880,7 +906,8 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc)
 	return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
 }
 
-void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
+static void __kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel,
+				    bool is_passthrough)
 {
 	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -914,9 +941,19 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
 		    !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc))
 			continue;
 
-		kvm_pmu_incr_counter(pmc);
+		if (is_passthrough)
+			kvm_passthrough_pmu_incr_counter(vcpu, pmc);
+		else
+			kvm_pmu_incr_counter(pmc);
 	}
 }
+
+void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
+{
+	bool is_passthrough = is_passthrough_pmu_enabled(vcpu);
+
+	__kvm_pmu_trigger_event(vcpu, eventsel, is_passthrough);
+}
 EXPORT_SYMBOL_GPL(kvm_pmu_trigger_event);
 
 static bool is_masked_filter_valid(const struct kvm_x86_pmu_event_filter *filter)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (43 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-20 20:19   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
                   ` (15 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Update pmc_{read,write}_counter() to disconnect perf API because
passthrough PMU does not use host PMU on backend. Because of that
pmc->counter contains directly the actual value of the guest VM when set by
the host (VMM) side.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 5 +++++
 arch/x86/kvm/pmu.h | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 41057d0122bd..3604cf467b34 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -322,6 +322,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
 
 void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
 {
+	if (pmc_to_pmu(pmc)->passthrough) {
+		pmc->counter = val;
+		return;
+	}
+
 	/*
 	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
 	 * read-modify-write.  Adjust the counter value so that its value is
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 78a7f0c5f3ba..7e006cb61296 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -116,6 +116,10 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 {
 	u64 counter, enabled, running;
 
+	counter = pmc->counter;
+	if (pmc_to_pmu(pmc)->passthrough)
+		return counter & pmc_bitmask(pmc);
+
 	counter = pmc->counter + pmc->emulated_counter;
 
 	if (pmc->perf_event && !pmc->is_paused)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (44 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-20 20:40   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
                   ` (14 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Disconnect counter reprogram logic because passthrough PMU never use host
PMU nor does it use perf API to do anything. Instead, when passthrough PMU
is enabled, touching anywhere around counter reprogram part should be an
error.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 3 +++
 arch/x86/kvm/pmu.h | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 3604cf467b34..fcd188cc389a 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -478,6 +478,9 @@ static int reprogram_counter(struct kvm_pmc *pmc)
 	bool emulate_overflow;
 	u8 fixed_ctr_ctrl;
 
+	if (WARN_ONCE(pmu->passthrough, "Passthrough PMU never reprogram counter\n"))
+		return 0;
+
 	emulate_overflow = pmc_pause_counter(pmc);
 
 	if (!pmc_event_is_allowed(pmc))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7e006cb61296..10553bc1ae1d 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -256,6 +256,10 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 
 static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
 {
+	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
+	if (pmc_to_pmu(pmc)->passthrough)
+		return;
+
 	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
 	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 }
@@ -264,6 +268,10 @@ static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
 {
 	int bit;
 
+	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
+	if (pmu->passthrough)
+		return;
+
 	if (!diff)
 		return;
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for passthrough PMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (45 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-11-20 20:52   ` Sean Christopherson
  2024-08-01  4:58 ` [RFC PATCH v3 48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
                   ` (13 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Add nested virtualization support for passthrough PMU by combining the MSR
interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
this patch, nested virtualization works for passthrough PMU because L1 will
see Perfmon v2 and will have to use legacy vPMU implementation if it is
Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
even be Linux.

If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
the access.

However, in current implementation, if without adding anything for nested,
KVM always set MSR interception bits in vmcs02. This leads to the fact that
L0 will emulate all MSR read/writes for L2, leading to errors, since the
current passthrough vPMU never implements set_msr() and get_msr() for any
counter access except counter accesses from the VMM side.

So fix the issue by setting up the correct MSR interception for PMU MSRs.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/nested.c | 52 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 643935a0f70a..ef385f9e7513 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -612,6 +612,55 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
 						   msr_bitmap_l0, msr);
 }
 
+/* Pass PMU MSRs to nested VM if L0 and L1 are set to passthrough. */
+static void nested_vmx_set_passthru_pmu_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap_l1,
+							  unsigned long *msr_bitmap_l0)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_ARCH_PERFMON_EVENTSEL0 + i,
+						 MSR_TYPE_RW);
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_IA32_PERFCTR0 + i,
+						 MSR_TYPE_RW);
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_IA32_PMC0 + i,
+						 MSR_TYPE_RW);
+	}
+
+	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++) {
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_CORE_PERF_FIXED_CTR0 + i,
+						 MSR_TYPE_RW);
+	}
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_FIXED_CTR_CTRL,
+					 MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_STATUS,
+					 MSR_TYPE_RW);
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_CTRL,
+					 MSR_TYPE_RW);
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_OVF_CTRL,
+					 MSR_TYPE_RW);
+}
+
 /*
  * Merge L0's and L1's MSR bitmap, return false to indicate that
  * we do not use the hardware.
@@ -713,6 +762,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		nested_vmx_set_passthru_pmu_intercept_for_msr(vcpu, msr_bitmap_l1, msr_bitmap_l0);
+
 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
 
 	vmx->nested.force_msr_bitmap_recalc = false;
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (46 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-02 17:50   ` Liang, Kan
  2024-08-01  4:58 ` [RFC PATCH v3 49/58] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
                   ` (12 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Apply the PERF_PMU_CAP_PASSTHROUGH_VPMU for Intel core PMU. It only
indicates that the perf side of core PMU is ready to support the
passthrough vPMU.  Besides the capability, the hypervisor should still need
to check the PMU version and other capabilities to decide whether to enable
the passthrough vPMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/intel/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 38c1b1f1deaa..d5bb7d4ed062 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4743,6 +4743,8 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
 	else
 		pmu->pmu.capabilities &= ~PERF_PMU_CAP_AUX_OUTPUT;
 
+	pmu->pmu.capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+
 	intel_pmu_check_event_constraints(pmu->event_constraints,
 					  pmu->num_counters,
 					  pmu->num_counters_fixed,
@@ -6235,6 +6237,9 @@ __init int intel_pmu_init(void)
 			pr_cont(" AnyThread deprecated, ");
 	}
 
+	/* The perf side of core PMU is ready to support the passthrough vPMU. */
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+
 	/*
 	 * Install the hw-cache-events table:
 	 */
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 49/58] KVM: x86/pmu/svm: Set passthrough capability for vcpus
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (47 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:58 ` [RFC PATCH v3 50/58] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
                   ` (11 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Pass on the passthrough PMU setting from kvm->arch into kvm_pmu for each
vcpu. As long as the host supports PerfMonV2, the guest PMU version does
not matter.

Note that guest vCPUs without a local APIC do not allocate an instance of
struct kvm_lapic because of which reading the guest LVTPC before switching
over to the PMI vector results in a NULL pointer dereference. Such vCPUs
also cannot receive PMIs. Hence, disable passthrough mode in such cases.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 64060cbd8210..0a16f0eb2511 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -211,6 +211,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
 	pmu->nr_arch_fixed_counters = 0;
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu &&
+			   lapic_in_kernel(vcpu);
 
 	if (pmu->version > 1 || guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
 		for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 50/58] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (48 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 49/58] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
@ 2024-08-01  4:58 ` Mingwei Zhang
  2024-08-01  4:59 ` [RFC PATCH v3 51/58] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
                   ` (10 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Since passthrough PMU can be also used on some AMD platforms, set the
"enable_passthrough_pmu" KVM kernel module parameter to true when the
following conditions are met.
 - parameter is set to true when module loaded
 - enable_pmu is true
 - is running on and AMD CPU
 - CPU supports PerfMonV2
 - host PMU supports passthrough mode

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.h     | 22 ++++++++++++++--------
 arch/x86/kvm/svm/svm.c |  2 ++
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 10553bc1ae1d..9fb3ddfd3a10 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -196,6 +196,7 @@ extern struct kvm_pmu_emulated_event_selectors kvm_pmu_eventsel;
 static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 {
 	bool is_intel = boot_cpu_data.x86_vendor == X86_VENDOR_INTEL;
+	bool is_amd = boot_cpu_data.x86_vendor == X86_VENDOR_AMD;
 	int min_nr_gp_ctrs = pmu_ops->MIN_NR_GP_COUNTERS;
 
 	/*
@@ -223,18 +224,23 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 			enable_pmu = false;
 	}
 
-	/* Pass-through vPMU is only supported in Intel CPUs. */
-	if (!is_intel)
+	/* Pass-through vPMU is only supported in Intel and AMD CPUs. */
+	if (!is_intel && !is_amd)
 		enable_passthrough_pmu = false;
 
 	/*
-	 * Pass-through vPMU requires at least PerfMon version 4 because the
-	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
-	 * for counter emulation as well as PMU context switch.  In addition, it
-	 * requires host PMU support on passthrough mode. Disable pass-through
-	 * vPMU if any condition fails.
+	 * On Intel platforms, pass-through vPMU requires at least PerfMon
+	 * version 4 because the implementation requires the usage of
+	 * MSR_CORE_PERF_GLOBAL_STATUS_SET for counter emulation as well as
+	 * PMU context switch.  In addition, it requires host PMU support on
+	 * passthrough mode. Disable pass-through vPMU if any condition fails.
+	 *
+	 * On AMD platforms, pass-through vPMU requires at least PerfMonV2
+	 * because MSR_PERF_CNTR_GLOBAL_STATUS_SET is required.
 	 */
-	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
+	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
+	    (is_intel && kvm_pmu_cap.version < 4) ||
+	    (is_amd && kvm_pmu_cap.version < 2))
 		enable_passthrough_pmu = false;
 
 	if (!enable_pmu) {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 296c524988f9..12868b7e6f51 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -239,6 +239,8 @@ module_param(intercept_smi, bool, 0444);
 bool vnmi = true;
 module_param(vnmi, bool, 0444);
 
+module_param(enable_passthrough_pmu, bool, 0444);
+
 static bool svm_gp_erratum_intercept = true;
 
 static u8 rsm_ins_bytes[] = "\x0f\xaa";
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 51/58] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (49 preceding siblings ...)
  2024-08-01  4:58 ` [RFC PATCH v3 50/58] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-08-01  4:59 ` [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
                   ` (9 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

If passthrough PMU is enabled and all counters exposed to guest, clear the
RDPMC interception bit in the VMCB Control Area (byte offset 0xc bit 15) to
let RDPMC instructions proceed without VM-Exits. This improves the guest
PMU performance in passthrough mode. If either condition is not satisfied,
then intercept RDPMC and prevent guest accessing unexposed counters.

Note that On AMD platforms, passing through RDPMC will only allow guests to
read the general-purpose counters. Details about the RDPMC interception bit
can be found in Appendix B the "Layout of VMCB" from the AMD64 Architecture
Programmer's Manual Volume 2.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/svm.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 12868b7e6f51..fc78f34832ca 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1229,6 +1229,11 @@ static inline void init_vmcb_after_set_cpuid(struct kvm_vcpu *vcpu)
 		/* No need to intercept these MSRs */
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 1, 1);
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 1, 1);
+
+		if (kvm_pmu_check_rdpmc_passthrough(vcpu))
+			svm_clr_intercept(svm, INTERCEPT_RDPMC);
+		else
+			svm_set_intercept(svm, INTERCEPT_RDPMC);
 	}
 }
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (50 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 51/58] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-11-20 21:02   ` Sean Christopherson
  2024-08-01  4:59 ` [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
                   ` (8 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Implement the AMD-specific callback for passthrough PMU that disables
interception of PMU-related MSRs if the guest PMU counters qualify the
requirement of passthrough. The PMU registers include the following.
 - PerfCntrGlobalStatus (MSR 0xc0000300)
 - PerfCntrGlobalCtl (MSR 0xc0000301)
 - PerfCntrGlobalStatusClr (MSR 0xc0000302)
 - PerfCntrGlobalStatusSet (MSR 0xc0000303)
 - PERF_CTLx and PERF_CTRx pairs (MSRs 0xc0010200..0xc001020b)

Note that the passthrough/interception is invoked after each CPUID set.
Since CPUID set can be done multiple times, do the intercept/clear of the
bitmap explicitly for each counters as well as global registers.

Note that even if the host is PerfCtrCore or PerfMonV2 capable, a guest
should still be able to use the four K7 legacy counters. Disable
interception of these MSRs in passthrough mode.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 55 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 0a16f0eb2511..cc03c3e9941f 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -248,6 +248,60 @@ static bool amd_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static void amd_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+	int msr_clear = !!(is_passthrough_pmu_enabled(vcpu));
+	int i;
+
+	for (i = 0; i < min(pmu->nr_arch_gp_counters, AMD64_NUM_COUNTERS); i++) {
+		/*
+		 * Legacy counters are always available irrespective of any
+		 * CPUID feature bits and when X86_FEATURE_PERFCTR_CORE is set,
+		 * PERF_LEGACY_CTLx and PERF_LEGACY_CTRx registers are mirrored
+		 * with PERF_CTLx and PERF_CTRx respectively.
+		 */
+		set_msr_interception(vcpu, svm->msrpm, MSR_K7_EVNTSEL0 + i, 0, 0);
+		set_msr_interception(vcpu, svm->msrpm, MSR_K7_PERFCTR0 + i, msr_clear, msr_clear);
+	}
+
+	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
+		/*
+		 * PERF_CTLx registers require interception in order to clear
+		 * HostOnly bit and set GuestOnly bit. This is to prevent the
+		 * PERF_CTRx registers from counting before VM entry and after
+		 * VM exit.
+		 */
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTL + 2 * i, 0, 0);
+
+		/*
+		 * Pass through counters exposed to the guest and intercept
+		 * counters that are unexposed. Do this explicitly since this
+		 * function may be set multiple times before vcpu runs.
+		 */
+		if (i >= pmu->nr_arch_gp_counters)
+			msr_clear = 0;
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTR + 2 * i, msr_clear, msr_clear);
+	}
+
+	/*
+	 * In mediated passthrough vPMU, intercept global PMU MSRs when guest
+	 * PMU only owns a subset of counters provided in HW or its version is
+	 * less than 2.
+	 */
+	if (is_passthrough_pmu_enabled(vcpu) && pmu->version > 1 &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)
+		msr_clear = 1;
+	else
+		msr_clear = 0;
+
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_CTL, msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, msr_clear, msr_clear);
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -258,6 +312,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
 	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
+	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (51 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-11-20 21:38   ` Sean Christopherson
  2024-08-01  4:59 ` [RFC PATCH v3 54/58] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
                   ` (7 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

On AMD platforms, there is no way to restore PerfCntrGlobalCtl at
VM-Entry or clear it at VM-Exit. Since the register states will be
restored before entering and saved after exiting guest context, the
counters can keep ticking and even overflow leading to chaos while
still in host context.

To avoid this, the PERF_CTLx MSRs (event selectors) are always
intercepted. KVM will always set the GuestOnly bit and clear the
HostOnly bit so that the counters run only in guest context even if
their enable bits are set. Intercepting these MSRs is also necessary
for guest event filtering.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index cc03c3e9941f..2b7cc7616162 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -165,7 +165,12 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		data &= ~pmu->reserved_bits;
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
-			kvm_pmu_request_counter_reprogram(pmc);
+			if (is_passthrough_pmu_enabled(vcpu)) {
+				data &= ~AMD64_EVENTSEL_HOSTONLY;
+				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
+			} else {
+				kvm_pmu_request_counter_reprogram(pmc);
+			}
 		}
 		return 0;
 	}
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 54/58] KVM: x86/pmu/svm: Add registers to direct access list
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (52 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-08-01  4:59 ` [RFC PATCH v3 55/58] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
                   ` (6 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Add all PMU-related MSRs (including legacy K7 MSRs) to the list of
possible direct access MSRs.  Most of them will not be intercepted when
using passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/svm.c | 24 ++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h |  2 +-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fc78f34832ca..ff07f6ee867a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -141,6 +141,30 @@ static const struct svm_direct_access_msrs {
 	{ .index = X2APIC_MSR(APIC_TMICT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TMCCT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TDCR),		.always = false },
+	{ .index = MSR_K7_EVNTSEL0,			.always = false },
+	{ .index = MSR_K7_PERFCTR0,			.always = false },
+	{ .index = MSR_K7_EVNTSEL1,			.always = false },
+	{ .index = MSR_K7_PERFCTR1,			.always = false },
+	{ .index = MSR_K7_EVNTSEL2,			.always = false },
+	{ .index = MSR_K7_PERFCTR2,			.always = false },
+	{ .index = MSR_K7_EVNTSEL3,			.always = false },
+	{ .index = MSR_K7_PERFCTR3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL0,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR0,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL1,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR1,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL2,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR2,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL4,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR4,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL5,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR5,			.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_CTL,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET,	.always = false },
 	{ .index = MSR_INVALID,				.always = false },
 };
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0f1472690b59..d096b405c9f3 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -30,7 +30,7 @@
 #define	IOPM_SIZE PAGE_SIZE * 3
 #define	MSRPM_SIZE PAGE_SIZE * 2
 
-#define MAX_DIRECT_ACCESS_MSRS	48
+#define MAX_DIRECT_ACCESS_MSRS	72
 #define MSRPM_OFFSETS	32
 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
 extern bool npt_enabled;
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 55/58] KVM: x86/pmu/svm: Implement handlers to save and restore context
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (53 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 54/58] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-08-01  4:59 ` [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
                   ` (5 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Implement the AMD-specific handlers to save and restore the state of
PMU-related MSRs when using passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 2b7cc7616162..86818da66bbe 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -307,6 +307,36 @@ static void amd_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, msr_clear, msr_clear);
 }
 
+static void amd_save_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, pmu->global_status);
+
+	/* Clear global status bits if non-zero */
+	if (pmu->global_status)
+		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, pmu->global_status);
+}
+
+static void amd_restore_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 global_status;
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);
+
+	/* Clear host global_status MSR if non-zero. */
+	if (global_status)
+		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -318,6 +348,8 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.init = amd_pmu_init,
 	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
 	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
+	.save_pmu_context = amd_save_pmu_context,
+	.restore_pmu_context = amd_restore_pmu_context,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (54 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 55/58] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-11-20 21:39   ` Sean Christopherson
  2024-08-01  4:59 ` [RFC PATCH v3 57/58] KVM: x86/pmu/svm: Implement callback to increment counters Mingwei Zhang
                   ` (4 subsequent siblings)
  60 siblings, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Manali Shukla <manali.shukla@amd.com>

With the Passthrough PMU enabled, the PERF_CTLx MSRs (event selectors) are
always intercepted and the event filter checking can be directly done
inside amd_pmu_set_msr().

Add a check to allow writing to event selector for GP counters if and only
if the event is allowed in filter.

Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 86818da66bbe..9f3e910ee453 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -166,6 +166,15 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
 			if (is_passthrough_pmu_enabled(vcpu)) {
+				if (!check_pmu_event_filter(pmc)) {
+					/*
+					 * When guest request an invalid event,
+					 * stop the counter by clearing the
+					 * event selector MSR.
+					 */
+					pmc->eventsel_hw = 0;
+					return 0;
+				}
 				data &= ~AMD64_EVENTSEL_HOSTONLY;
 				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
 			} else {
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 57/58] KVM: x86/pmu/svm: Implement callback to increment counters
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (55 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-08-01  4:59 ` [RFC PATCH v3 58/58] perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host Mingwei Zhang
                   ` (3 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Implement the AMD-specific callback for passthrough PMU that increments
counters for cases such as instruction emulation. A PMI will also be
injected if the increment results in an overflow.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/svm/pmu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 9f3e910ee453..70465903ef1e 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -346,6 +346,17 @@ static void amd_restore_pmu_context(struct kvm_vcpu *vcpu)
 	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
 }
 
+static bool amd_incr_counter(struct kvm_pmc *pmc)
+{
+	pmc->counter += 1;
+	pmc->counter &= pmc_bitmask(pmc);
+
+	if (!pmc->counter)
+		return true;
+
+	return false;
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -359,6 +370,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
 	.save_pmu_context = amd_save_pmu_context,
 	.restore_pmu_context = amd_restore_pmu_context,
+	.incr_counter = amd_incr_counter,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [RFC PATCH v3 58/58] perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (56 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 57/58] KVM: x86/pmu/svm: Implement callback to increment counters Mingwei Zhang
@ 2024-08-01  4:59 ` Mingwei Zhang
  2024-09-11 10:45 ` [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Ma, Yongwei
                   ` (2 subsequent siblings)
  60 siblings, 0 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-08-01  4:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Apply the PERF_PMU_CAP_PASSTHROUGH_VPMU flag for version 2 and later
implementations of the core PMU. Aside from having Global Control and
Status registers, virtualizing the PMU using the passthrough model
requires an interface to set or clear the overflow bits in the Global
Status MSRs while restoring or saving the PMU context of a vCPU.

PerfMonV2-capable hardware has additional MSRs for this purpose namely,
PerfCntrGlobalStatusSet and PerfCntrGlobalStatusClr, thereby making it
suitable for use with passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/events/amd/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 1fc4ce44e743..09f61821029f 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1426,6 +1426,8 @@ static int __init amd_core_pmu_init(void)
 
 		amd_pmu_global_cntr_mask = (1ULL << x86_pmu.num_counters) - 1;
 
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+
 		/* Update PMC handling functions */
 		x86_pmu.enable_all = amd_pmu_v2_enable_all;
 		x86_pmu.disable_all = amd_pmu_v2_disable_all;
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
  2024-08-01  4:58 ` [RFC PATCH v3 48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
@ 2024-08-02 17:50   ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-08-02 17:50 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 2024-08-01 12:58 a.m., Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Apply the PERF_PMU_CAP_PASSTHROUGH_VPMU for Intel core PMU. It only
> indicates that the perf side of core PMU is ready to support the
> passthrough vPMU.  Besides the capability, the hypervisor should still need
> to check the PMU version and other capabilities to decide whether to enable
> the passthrough vPMU.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/events/intel/core.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 38c1b1f1deaa..d5bb7d4ed062 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4743,6 +4743,8 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
>  	else
>  		pmu->pmu.capabilities &= ~PERF_PMU_CAP_AUX_OUTPUT;
>  
> +	pmu->pmu.capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
> +

Because of the simplification in patch 14, there is only one PASSTHROUGH
vPMU supported on a machine. The flag has to be moved for the hybrid
PMUs. Otherwise, the hybrid PMUs would not be registered even in the
bare metal.

>  	intel_pmu_check_event_constraints(pmu->event_constraints,
>  					  pmu->num_counters,
>  					  pmu->num_counters_fixed,
> @@ -6235,6 +6237,9 @@ __init int intel_pmu_init(void)
>  			pr_cont(" AnyThread deprecated, ");
>  	}
>  
> +	/* The perf side of core PMU is ready to support the passthrough vPMU. */
> +	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
> +

It's good enough to only set the flag here.

Thanks,
Kan

>  	/*
>  	 * Install the hw-cache-events table:
>  	 */

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-08-01  4:58 ` [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
@ 2024-08-06  7:04   ` Mi, Dapeng
  2024-10-24 20:26   ` Chen, Zide
  2024-11-19 18:16   ` Sean Christopherson
  2 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-08-06  7:04 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>
> In PMU passthrough mode, there are three requirements to manage
> IA32_PERF_GLOBAL_CTRL:
>  - guest IA32_PERF_GLOBAL_CTRL MSR must be saved at vm exit.
>  - IA32_PERF_GLOBAL_CTRL MSR must be cleared at vm exit to avoid any
>    counter of running within KVM runloop.
>  - guest IA32_PERF_GLOBAL_CTRL MSR must be restored at vm entry.
>
> Introduce vmx_set_perf_global_ctrl() function to auto switching
> IA32_PERF_GLOBAL_CTR and invoke it after the VMM finishes setting up the
> CPUID bits.
>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/include/asm/vmx.h |   1 +
>  arch/x86/kvm/vmx/vmx.c     | 117 +++++++++++++++++++++++++++++++------
>  arch/x86/kvm/vmx/vmx.h     |   3 +-
>  3 files changed, 103 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index d77a31039f24..5ed89a099533 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -106,6 +106,7 @@
>  #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
>  #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
>  #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
> +#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL      0x40000000
>  
>  #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 339742350b7a..34a420fa98c5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4394,6 +4394,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
>  	return pin_based_exec_ctrl;
>  }
>  
> +static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
> +{
> +	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
> +	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
> +	struct vmx_msrs *m;
> +	int i;
> +
> +	if (cpu_has_perf_global_ctrl_bug() ||
> +	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> +	}
> +
> +	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		/*
> +		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
> +		 */
> +		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
> +		} else {
> +			m = &vmx->msr_autoload.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			m->val[i].value = 0;

This function has much duplicated code to initialize/clear MSR
autoload/restore region, we may create two simple helpers to avoid these
duplicated code.

static inline void vmx_init_loadstore_msr(struct vmx_msrs *m, int idx, bool
load);

static inline void vmx_clear_loadstore_msr(struct vmx_msrs *m, int idx,
bool load);


> +		}
> +		/*
> +		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
> +		 */
> +		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
> +		} else {
> +			m = &vmx->msr_autoload.host;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			m->val[i].value = 0;
> +		}
> +		/*
> +		 * Setup auto save guest PERF_GLOBAL_CTRL msr at vm exit
> +		 */
> +		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autostore.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +		}
> +	} else {
> +		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autoload.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i >= 0) {
> +				m->nr--;
> +				m->val[i] = m->val[m->nr];
> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
> +			}
> +		}
> +		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autoload.host;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i >= 0) {
> +				m->nr--;
> +				m->val[i] = m->val[m->nr];
> +				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
> +			}
> +		}
> +		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autostore.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i >= 0) {
> +				m->nr--;
> +				m->val[i] = m->val[m->nr];
> +				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
> +			}
> +		}
> +	}
> +
> +	vm_entry_controls_set(vmx, vmentry_ctrl);
> +	vm_exit_controls_set(vmx, vmexit_ctrl);
> +}
> +
>  static u32 vmx_vmentry_ctrl(void)
>  {
>  	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
> @@ -4401,17 +4492,10 @@ static u32 vmx_vmentry_ctrl(void)
>  	if (vmx_pt_mode_is_system())
>  		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
>  				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
> -	/*
> -	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
> -	 */
> -	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
> -			  VM_ENTRY_LOAD_IA32_EFER |
> -			  VM_ENTRY_IA32E_MODE);
> -
> -	if (cpu_has_perf_global_ctrl_bug())
> -		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> -
> -	return vmentry_ctrl;
> +	 /*
> +	  * IA32e mode, and loading of EFER is toggled dynamically.
> +	  */
> +	return vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_IA32E_MODE);
>  }
>  
>  static u32 vmx_vmexit_ctrl(void)
> @@ -4429,12 +4513,8 @@ static u32 vmx_vmexit_ctrl(void)
>  		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
>  				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
>  
> -	if (cpu_has_perf_global_ctrl_bug())
> -		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> -
> -	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
> -	return vmexit_ctrl &
> -		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
> +	/* Loading of EFER is toggled dynamically */
> +	return vmexit_ctrl & ~VM_EXIT_LOAD_IA32_EFER;
>  }
>  
>  void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> @@ -4777,6 +4857,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  		vmcs_write64(VM_FUNCTION_CONTROL, 0);
>  
>  	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
> +	vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autostore.guest.val));
>  	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
>  	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
>  	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
> @@ -7916,6 +7997,8 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	else
>  		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
>  
> +	vmx_set_perf_global_ctrl(vmx);
> +
>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 7b64e271a931..32e3974c1a2c 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -510,7 +510,8 @@ static inline u8 vmx_get_rvi(void)
>  	       VM_EXIT_LOAD_IA32_EFER |					\
>  	       VM_EXIT_CLEAR_BNDCFGS |					\
>  	       VM_EXIT_PT_CONCEAL_PIP |					\
> -	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
> +	       VM_EXIT_CLEAR_IA32_RTIT_CTL |                            \
> +	       VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
>  
>  #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
>  	(PIN_BASED_EXT_INTR_MASK |					\

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-08-01  4:58 ` [RFC PATCH v3 33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
@ 2024-08-06  7:27   ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-08-06  7:27 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> and gains the full HW PMU ownership. On the contrary, host regains the
> ownership of PMU HW from KVM when control flow leaves the scope of
> passthrough PMU.
>
> Implement PMU context switches for Intel CPUs and opptunistically use
> rdpmcl() instead of rdmsrl() when reading counters since the former has
> lower latency in Intel CPUs.
>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/kvm/pmu.c           | 46 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/pmu_intel.c | 41 +++++++++++++++++++++++++++++++-
>  2 files changed, 86 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 782b564bdf96..9bb733384069 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -1068,14 +1068,60 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>  
>  void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	u32 i;
> +
>  	lockdep_assert_irqs_disabled();
>  
>  	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
> +
> +	/*
> +	 * Clear hardware selector MSR content and its counter to avoid
> +	 * leakage and also avoid this guest GP counter get accidentally
> +	 * enabled during host running when host enable global ctrl.
> +	 */
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		rdpmcl(i, pmc->counter);
> +		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
> +		if (pmc->counter)
> +			wrmsrl(pmc->msr_counter, 0);
> +		if (pmc->eventsel)
> +			wrmsrl(pmc->msr_eventsel, 0);
> +	}
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> +		if (pmc->counter)
> +			wrmsrl(pmc->msr_counter, 0);
> +	}
>  }
>  
>  void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	int i;
> +
>  	lockdep_assert_irqs_disabled();
>  
>  	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
> +
> +	/*
> +	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
> +	 * in this case will be intercepted. Accessing to these counters and
> +	 * selectors will cause #GP in the guest.
> +	 */
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		wrmsrl(pmc->msr_counter, pmc->counter);
> +		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);

pmu->gp_counters[i].eventsel  -> pmc->eventsel


> +	}
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		wrmsrl(pmc->msr_counter, pmc->counter);
> +	}
>  }
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 0de918dc14ea..89c8f73a48c8 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -572,7 +572,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  	}
>  
>  	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> -		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
> +		pmu->fixed_counters[i].msr_eventsel = 0;

Seems unnecessary to clear msr_eventsel for fixed counters. Why?


>  		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
>  	}
>  }
> @@ -799,6 +799,43 @@ static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>  	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
>  }
>  
> +static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	/* Global ctrl register is already saved at VM-exit. */
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +	/*
> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> +	 * also avoid these guest fixed counters get accidentially enabled
> +	 * during host running when host enable global ctrl.
> +	 */
> +	if (pmu->fixed_ctr_ctrl)
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> +}
> +
> +static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	u64 global_status, toggle;
> +
> +	/* Clear host global_ctrl MSR if non-zero. */
> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> +	toggle = pmu->global_status ^ global_status;
> +	if (global_status & toggle)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
> +	if (pmu->global_status & toggle)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
> +
> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +}
> +
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
>  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
> @@ -812,6 +849,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.cleanup = intel_pmu_cleanup,
>  	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
>  	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
> +	.save_pmu_context = intel_save_guest_pmu_context,
> +	.restore_pmu_context = intel_restore_guest_pmu_context,
>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = 1,

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
@ 2024-08-21  5:27   ` Mi, Dapeng
  2024-08-21 13:16     ` Liang, Kan
  2024-10-11 11:41     ` Peter Zijlstra
  2024-10-11 18:42   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-08-21  5:27 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Current perf doesn't explicitly schedule out all exclude_guest events
> while the guest is running. There is no problem with the current
> emulated vPMU. Because perf owns all the PMU counters. It can mask the
> counter which is assigned to an exclude_guest event when a guest is
> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
> (AMD way). The counter doesn't count when a guest is running.
>
> However, either way doesn't work with the introduced passthrough vPMU.
> A guest owns all the PMU counters when it's running. The host should not
> mask any counters. The counter may be used by the guest. The evsentsel
> may be overwritten.
>
> Perf should explicitly schedule out all exclude_guest events to release
> the PMU resources when entering a guest, and resume the counting when
> exiting the guest.
>
> It's possible that an exclude_guest event is created when a guest is
> running. The new event should not be scheduled in as well.
>
> The ctx time is shared among different PMUs. The time cannot be stopped
> when a guest is running. It is required to calculate the time for events
> from other PMUs, e.g., uncore events. Add timeguest to track the guest
> run time. For an exclude_guest event, the elapsed time equals
> the ctx time - guest time.
> Cgroup has dedicated times. Use the same method to deduct the guest time
> from the cgroup time as well.
>
> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |   6 ++
>  kernel/events/core.c       | 178 +++++++++++++++++++++++++++++++------
>  2 files changed, 155 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e22cdb6486e6..81a5f8399cb8 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -952,6 +952,11 @@ struct perf_event_context {
>  	 */
>  	struct perf_time_ctx		time;
>  
> +	/*
> +	 * Context clock, runs when in the guest mode.
> +	 */
> +	struct perf_time_ctx		timeguest;
> +
>  	/*
>  	 * These fields let us detect when two contexts have both
>  	 * been cloned (inherited) from a common ancestor.
> @@ -1044,6 +1049,7 @@ struct bpf_perf_event_data_kern {
>   */
>  struct perf_cgroup_info {
>  	struct perf_time_ctx		time;
> +	struct perf_time_ctx		timeguest;
>  	int				active;
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c25e2bf27001..57648736e43e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -376,7 +376,8 @@ enum event_type_t {
>  	/* see ctx_resched() for details */
>  	EVENT_CPU = 0x8,
>  	EVENT_CGROUP = 0x10,
> -	EVENT_FLAGS = EVENT_CGROUP,
> +	EVENT_GUEST = 0x20,
> +	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
>  	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
>  };
>  
> @@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>  
>  static atomic_t nr_mediated_pmu_vms;
>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> +static DEFINE_PER_CPU(bool, perf_in_guest);
>  
>  /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
>  static inline bool is_include_guest_event(struct perf_event *event)
> @@ -706,6 +708,10 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>  		return true;
>  
> +	if ((event_type & EVENT_GUEST) &&
> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -770,12 +776,21 @@ static inline int is_cgroup_event(struct perf_event *event)
>  	return event->cgrp != NULL;
>  }
>  
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *time,
> +					struct perf_time_ctx *timeguest);
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *time,
> +					    struct perf_time_ctx *timeguest,
> +					    u64 now);
> +
>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
>  	struct perf_cgroup_info *t;
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	return t->time.time;
> +	return __perf_event_time_ctx(event, &t->time, &t->timeguest);
>  }
>  
>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> @@ -784,9 +799,9 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>  	if (!__load_acquire(&t->active))
> -		return t->time.time;
> -	now += READ_ONCE(t->time.offset);
> -	return now;
> +		return __perf_event_time_ctx(event, &t->time, &t->timeguest);
> +
> +	return __perf_event_time_ctx_now(event, &t->time, &t->timeguest, now);
>  }
>  
>  static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
> @@ -796,6 +811,18 @@ static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bo
>  	update_perf_time_ctx(&info->time, now, adv);
>  }
>  
> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
> +{
> +	update_perf_time_ctx(&info->timeguest, now, adv);
> +}
> +
> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
> +{
> +	__update_cgrp_time(info, now, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_cgrp_guest_time(info, now, true);
> +}
> +
>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>  {
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> @@ -809,7 +836,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
>  			cgrp = container_of(css, struct perf_cgroup, css);
>  			info = this_cpu_ptr(cgrp->info);
>  
> -			__update_cgrp_time(info, now, true);
> +			update_cgrp_time(info, now);
>  			if (final)
>  				__store_release(&info->active, 0);
>  		}
> @@ -832,11 +859,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
>  	 * Do not update time when cgroup is not active
>  	 */
>  	if (info->active)
> -		__update_cgrp_time(info, perf_clock(), true);
> +		update_cgrp_time(info, perf_clock());
>  }
>  
>  static inline void
> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>  {
>  	struct perf_event_context *ctx = &cpuctx->ctx;
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> @@ -856,8 +883,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>  	for (css = &cgrp->css; css; css = css->parent) {
>  		cgrp = container_of(css, struct perf_cgroup, css);
>  		info = this_cpu_ptr(cgrp->info);
> -		__update_cgrp_time(info, ctx->time.stamp, false);
> -		__store_release(&info->active, 1);
> +		if (guest) {
> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
> +		} else {
> +			__update_cgrp_time(info, ctx->time.stamp, false);
> +			__store_release(&info->active, 1);
> +		}
>  	}
>  }
>  
> @@ -1061,7 +1092,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>  }
>  
>  static inline void
> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>  {
>  }
>  
> @@ -1488,16 +1519,34 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>   */
>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
>  {
> -	u64 now = perf_clock();
> +	lockdep_assert_held(&ctx->lock);
> +
> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
> +}
>  
> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
> +{
>  	lockdep_assert_held(&ctx->lock);
>  
> -	update_perf_time_ctx(&ctx->time, now, adv);
> +	/* must be called after __update_context_time(); */
> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
>  }
>  
>  static void update_context_time(struct perf_event_context *ctx)
>  {
>  	__update_context_time(ctx, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_context_guest_time(ctx, true);
> +}
> +
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *time,
> +					struct perf_time_ctx *timeguest)
> +{
> +	if (event->attr.exclude_guest)
> +		return time->time - timeguest->time;
> +	else
> +		return time->time;
>  }
>  
>  static u64 perf_event_time(struct perf_event *event)
> @@ -1510,7 +1559,26 @@ static u64 perf_event_time(struct perf_event *event)
>  	if (is_cgroup_event(event))
>  		return perf_cgroup_event_time(event);
>  
> -	return ctx->time.time;
> +	return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
> +}
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *time,
> +					    struct perf_time_ctx *timeguest,
> +					    u64 now)
> +{
> +	/*
> +	 * The exclude_guest event time should be calculated from
> +	 * the ctx time -  the guest time.
> +	 * The ctx time is now + READ_ONCE(time->offset).
> +	 * The guest time is now + READ_ONCE(timeguest->offset).
> +	 * So the exclude_guest time is
> +	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
> +	 */
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))

Hi Kan,

we see the following the warning when run perf record command after
enabling "CONFIG_DEBUG_PREEMPT" config item.

[  166.779208] BUG: using __this_cpu_read() in preemptible [00000000] code:
perf/9494
[  166.779234] caller is __this_cpu_preempt_check+0x13/0x20
[  166.779241] CPU: 56 UID: 0 PID: 9494 Comm: perf Not tainted
6.11.0-rc4-perf-next-mediated-vpmu-v3+ #80
[  166.779245] Hardware name: Quanta Cloud Technology Inc. QuantaGrid
D54Q-2U/S6Q-MB-MPS, BIOS 3A11.uh 12/02/2022
[  166.779248] Call Trace:
[  166.779250]  <TASK>
[  166.779252]  dump_stack_lvl+0x76/0xa0
[  166.779260]  dump_stack+0x10/0x20
[  166.779267]  check_preemption_disabled+0xd7/0xf0
[  166.779273]  __this_cpu_preempt_check+0x13/0x20
[  166.779279]  calc_timer_values+0x193/0x200
[  166.779287]  perf_event_update_userpage+0x4b/0x170
[  166.779294]  ? ring_buffer_attach+0x14c/0x200
[  166.779301]  perf_mmap+0x533/0x5d0
[  166.779309]  mmap_region+0x243/0xaa0
[  166.779322]  do_mmap+0x35b/0x640
[  166.779333]  vm_mmap_pgoff+0xf0/0x1c0
[  166.779345]  ksys_mmap_pgoff+0x17a/0x250
[  166.779354]  __x64_sys_mmap+0x33/0x70
[  166.779362]  x64_sys_call+0x1fa4/0x25f0
[  166.779369]  do_syscall_64+0x70/0x130

The season that kernel complains this is __perf_event_time_ctx_now() calls
__this_cpu_read() in preemption enabled context.

To eliminate the warning, we may need to use this_cpu_read() to replace
__this_cpu_read().

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ccd61fd06e8d..1eb628f8b3a0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1581,7 +1581,7 @@ static inline u64 __perf_event_time_ctx_now(struct
perf_event *event,
         * So the exclude_guest time is
         * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
         */
-       if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
+       if (event->attr.exclude_guest && this_cpu_read(perf_in_guest))
                return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
        else
                return now + READ_ONCE(time->offset);

> +		return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
> +	else
> +		return now + READ_ONCE(time->offset);
>  }
>  
>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
> @@ -1524,10 +1592,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
>  		return perf_cgroup_event_time_now(event, now);
>  
>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> -		return ctx->time.time;
> +		return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
>  
> -	now += READ_ONCE(ctx->time.offset);
> -	return now;
> +	return __perf_event_time_ctx_now(event, &ctx->time, &ctx->timeguest, now);
>  }
>  
>  static enum event_type_t get_event_type(struct perf_event *event)
> @@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  	 * would only update time for the pinned events.
>  	 */
>  	if (is_active & EVENT_TIME) {
> +		bool stop;
> +
> +		/* vPMU should not stop time */
> +		stop = !(event_type & EVENT_GUEST) &&
> +		       ctx == &cpuctx->ctx;
> +
>  		/* update (and stop) ctx time */
>  		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  			cpuctx->task_ctx = NULL;
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule out all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */
> +		is_active = EVENT_ALL;
> +		__update_context_guest_time(ctx, false);
> +		perf_cgroup_set_timestamp(cpuctx, true);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> @@ -3853,10 +3937,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
>  		event_update_userpage(event);
>  }
>  
> +struct merge_sched_data {
> +	int can_add_hw;
> +	enum event_type_t event_type;
> +};
> +
>  static int merge_sched_in(struct perf_event *event, void *data)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> -	int *can_add_hw = data;
> +	struct merge_sched_data *msd = data;
>  
>  	if (event->state <= PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  	if (!event_filter_match(event))
>  		return 0;
>  
> -	if (group_can_go_on(event, *can_add_hw)) {
> +	/*
> +	 * Don't schedule in any exclude_guest events of PMU with
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
> +	 */
> +	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
> +	    !(msd->event_type & EVENT_GUEST))
> +		return 0;
> +
> +	if (group_can_go_on(event, msd->can_add_hw)) {
>  		if (!group_sched_in(event, ctx))
>  			list_add_tail(&event->active_list, get_event_list(event));
>  	}
>  
>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		*can_add_hw = 0;
> +		msd->can_add_hw = 0;
>  		if (event->attr.pinned) {
>  			perf_cgroup_event_disable(event, ctx);
>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> @@ -3889,11 +3987,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  
>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>  				struct perf_event_groups *groups,
> -				struct pmu *pmu)
> +				struct pmu *pmu,
> +				enum event_type_t event_type)
>  {
> -	int can_add_hw = 1;
> +	struct merge_sched_data msd = {
> +		.can_add_hw = 1,
> +		.event_type = event_type,
> +	};
>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
> -			   merge_sched_in, &can_add_hw);
> +			   merge_sched_in, &msd);
>  }
>  
>  static void ctx_groups_sched_in(struct perf_event_context *ctx,
> @@ -3905,14 +4007,14 @@ static void ctx_groups_sched_in(struct perf_event_context *ctx,
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>  			continue;
> -		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
> +		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
>  	}
>  }
>  
>  static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
>  			       struct pmu *pmu)
>  {
> -	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
> +	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
>  }
>  
>  static void
> @@ -3927,9 +4029,11 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  		return;
>  
>  	if (!(is_active & EVENT_TIME)) {
> +		/* EVENT_TIME should be active while the guest runs */
> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>  		/* start ctx time */
>  		__update_context_time(ctx, false);
> -		perf_cgroup_set_timestamp(cpuctx);
> +		perf_cgroup_set_timestamp(cpuctx, false);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule in all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */
> +		is_active = EVENT_ALL;
> +
> +		/*
> +		 * Update ctx time to set the new start time for
> +		 * the exclude_guest events.
> +		 */
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	/*
>  	 * First go through the list and put on any pinned groups

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-21  5:27   ` Mi, Dapeng
@ 2024-08-21 13:16     ` Liang, Kan
  2024-10-11 11:41     ` Peter Zijlstra
  1 sibling, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-08-21 13:16 UTC (permalink / raw)
  To: Mi, Dapeng, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 2024-08-21 1:27 a.m., Mi, Dapeng wrote:
> 
> On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Current perf doesn't explicitly schedule out all exclude_guest events
>> while the guest is running. There is no problem with the current
>> emulated vPMU. Because perf owns all the PMU counters. It can mask the
>> counter which is assigned to an exclude_guest event when a guest is
>> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
>> (AMD way). The counter doesn't count when a guest is running.
>>
>> However, either way doesn't work with the introduced passthrough vPMU.
>> A guest owns all the PMU counters when it's running. The host should not
>> mask any counters. The counter may be used by the guest. The evsentsel
>> may be overwritten.
>>
>> Perf should explicitly schedule out all exclude_guest events to release
>> the PMU resources when entering a guest, and resume the counting when
>> exiting the guest.
>>
>> It's possible that an exclude_guest event is created when a guest is
>> running. The new event should not be scheduled in as well.
>>
>> The ctx time is shared among different PMUs. The time cannot be stopped
>> when a guest is running. It is required to calculate the time for events
>> from other PMUs, e.g., uncore events. Add timeguest to track the guest
>> run time. For an exclude_guest event, the elapsed time equals
>> the ctx time - guest time.
>> Cgroup has dedicated times. Use the same method to deduct the guest time
>> from the cgroup time as well.
>>
>> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |   6 ++
>>  kernel/events/core.c       | 178 +++++++++++++++++++++++++++++++------
>>  2 files changed, 155 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index e22cdb6486e6..81a5f8399cb8 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -952,6 +952,11 @@ struct perf_event_context {
>>  	 */
>>  	struct perf_time_ctx		time;
>>  
>> +	/*
>> +	 * Context clock, runs when in the guest mode.
>> +	 */
>> +	struct perf_time_ctx		timeguest;
>> +
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>>  	 * been cloned (inherited) from a common ancestor.
>> @@ -1044,6 +1049,7 @@ struct bpf_perf_event_data_kern {
>>   */
>>  struct perf_cgroup_info {
>>  	struct perf_time_ctx		time;
>> +	struct perf_time_ctx		timeguest;
>>  	int				active;
>>  };
>>  
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index c25e2bf27001..57648736e43e 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -376,7 +376,8 @@ enum event_type_t {
>>  	/* see ctx_resched() for details */
>>  	EVENT_CPU = 0x8,
>>  	EVENT_CGROUP = 0x10,
>> -	EVENT_FLAGS = EVENT_CGROUP,
>> +	EVENT_GUEST = 0x20,
>> +	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
>>  	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
>>  };
>>  
>> @@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>>  
>>  static atomic_t nr_mediated_pmu_vms;
>>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
>> +static DEFINE_PER_CPU(bool, perf_in_guest);
>>  
>>  /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
>>  static inline bool is_include_guest_event(struct perf_event *event)
>> @@ -706,6 +708,10 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>>  		return true;
>>  
>> +	if ((event_type & EVENT_GUEST) &&
>> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
>> +		return true;
>> +
>>  	return false;
>>  }
>>  
>> @@ -770,12 +776,21 @@ static inline int is_cgroup_event(struct perf_event *event)
>>  	return event->cgrp != NULL;
>>  }
>>  
>> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
>> +					struct perf_time_ctx *time,
>> +					struct perf_time_ctx *timeguest);
>> +
>> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
>> +					    struct perf_time_ctx *time,
>> +					    struct perf_time_ctx *timeguest,
>> +					    u64 now);
>> +
>>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>>  {
>>  	struct perf_cgroup_info *t;
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>> -	return t->time.time;
>> +	return __perf_event_time_ctx(event, &t->time, &t->timeguest);
>>  }
>>  
>>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>> @@ -784,9 +799,9 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>>  	if (!__load_acquire(&t->active))
>> -		return t->time.time;
>> -	now += READ_ONCE(t->time.offset);
>> -	return now;
>> +		return __perf_event_time_ctx(event, &t->time, &t->timeguest);
>> +
>> +	return __perf_event_time_ctx_now(event, &t->time, &t->timeguest, now);
>>  }
>>  
>>  static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
>> @@ -796,6 +811,18 @@ static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bo
>>  	update_perf_time_ctx(&info->time, now, adv);
>>  }
>>  
>> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
>> +{
>> +	update_perf_time_ctx(&info->timeguest, now, adv);
>> +}
>> +
>> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
>> +{
>> +	__update_cgrp_time(info, now, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_cgrp_guest_time(info, now, true);
>> +}
>> +
>>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>>  {
>>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
>> @@ -809,7 +836,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
>>  			cgrp = container_of(css, struct perf_cgroup, css);
>>  			info = this_cpu_ptr(cgrp->info);
>>  
>> -			__update_cgrp_time(info, now, true);
>> +			update_cgrp_time(info, now);
>>  			if (final)
>>  				__store_release(&info->active, 0);
>>  		}
>> @@ -832,11 +859,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
>>  	 * Do not update time when cgroup is not active
>>  	 */
>>  	if (info->active)
>> -		__update_cgrp_time(info, perf_clock(), true);
>> +		update_cgrp_time(info, perf_clock());
>>  }
>>  
>>  static inline void
>> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>>  {
>>  	struct perf_event_context *ctx = &cpuctx->ctx;
>>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
>> @@ -856,8 +883,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>>  	for (css = &cgrp->css; css; css = css->parent) {
>>  		cgrp = container_of(css, struct perf_cgroup, css);
>>  		info = this_cpu_ptr(cgrp->info);
>> -		__update_cgrp_time(info, ctx->time.stamp, false);
>> -		__store_release(&info->active, 1);
>> +		if (guest) {
>> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
>> +		} else {
>> +			__update_cgrp_time(info, ctx->time.stamp, false);
>> +			__store_release(&info->active, 1);
>> +		}
>>  	}
>>  }
>>  
>> @@ -1061,7 +1092,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>>  }
>>  
>>  static inline void
>> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>>  {
>>  }
>>  
>> @@ -1488,16 +1519,34 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>>   */
>>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
>>  {
>> -	u64 now = perf_clock();
>> +	lockdep_assert_held(&ctx->lock);
>> +
>> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
>> +}
>>  
>> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
>> +{
>>  	lockdep_assert_held(&ctx->lock);
>>  
>> -	update_perf_time_ctx(&ctx->time, now, adv);
>> +	/* must be called after __update_context_time(); */
>> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
>>  }
>>  
>>  static void update_context_time(struct perf_event_context *ctx)
>>  {
>>  	__update_context_time(ctx, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_context_guest_time(ctx, true);
>> +}
>> +
>> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
>> +					struct perf_time_ctx *time,
>> +					struct perf_time_ctx *timeguest)
>> +{
>> +	if (event->attr.exclude_guest)
>> +		return time->time - timeguest->time;
>> +	else
>> +		return time->time;
>>  }
>>  
>>  static u64 perf_event_time(struct perf_event *event)
>> @@ -1510,7 +1559,26 @@ static u64 perf_event_time(struct perf_event *event)
>>  	if (is_cgroup_event(event))
>>  		return perf_cgroup_event_time(event);
>>  
>> -	return ctx->time.time;
>> +	return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
>> +}
>> +
>> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
>> +					    struct perf_time_ctx *time,
>> +					    struct perf_time_ctx *timeguest,
>> +					    u64 now)
>> +{
>> +	/*
>> +	 * The exclude_guest event time should be calculated from
>> +	 * the ctx time -  the guest time.
>> +	 * The ctx time is now + READ_ONCE(time->offset).
>> +	 * The guest time is now + READ_ONCE(timeguest->offset).
>> +	 * So the exclude_guest time is
>> +	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
>> +	 */
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
> 
> Hi Kan,
> 
> we see the following the warning when run perf record command after
> enabling "CONFIG_DEBUG_PREEMPT" config item.
> 
> [  166.779208] BUG: using __this_cpu_read() in preemptible [00000000] code:
> perf/9494
> [  166.779234] caller is __this_cpu_preempt_check+0x13/0x20
> [  166.779241] CPU: 56 UID: 0 PID: 9494 Comm: perf Not tainted
> 6.11.0-rc4-perf-next-mediated-vpmu-v3+ #80
> [  166.779245] Hardware name: Quanta Cloud Technology Inc. QuantaGrid
> D54Q-2U/S6Q-MB-MPS, BIOS 3A11.uh 12/02/2022
> [  166.779248] Call Trace:
> [  166.779250]  <TASK>
> [  166.779252]  dump_stack_lvl+0x76/0xa0
> [  166.779260]  dump_stack+0x10/0x20
> [  166.779267]  check_preemption_disabled+0xd7/0xf0
> [  166.779273]  __this_cpu_preempt_check+0x13/0x20
> [  166.779279]  calc_timer_values+0x193/0x200
> [  166.779287]  perf_event_update_userpage+0x4b/0x170
> [  166.779294]  ? ring_buffer_attach+0x14c/0x200
> [  166.779301]  perf_mmap+0x533/0x5d0
> [  166.779309]  mmap_region+0x243/0xaa0
> [  166.779322]  do_mmap+0x35b/0x640
> [  166.779333]  vm_mmap_pgoff+0xf0/0x1c0
> [  166.779345]  ksys_mmap_pgoff+0x17a/0x250
> [  166.779354]  __x64_sys_mmap+0x33/0x70
> [  166.779362]  x64_sys_call+0x1fa4/0x25f0
> [  166.779369]  do_syscall_64+0x70/0x130
> 
> The season that kernel complains this is __perf_event_time_ctx_now() calls
> __this_cpu_read() in preemption enabled context.
> 
> To eliminate the warning, we may need to use this_cpu_read() to replace
> __this_cpu_read().

Sure.

Besides this, we recently update the time related code in perf.
https://lore.kernel.org/lkml/172311312757.2215.323044538405607858.tip-bot2@tip-bot2/

This patch probably have to be rebased on top of it.

Thanks,
Kan

> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ccd61fd06e8d..1eb628f8b3a0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1581,7 +1581,7 @@ static inline u64 __perf_event_time_ctx_now(struct
> perf_event *event,
>          * So the exclude_guest time is
>          * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
>          */
> -       if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
> +       if (event->attr.exclude_guest && this_cpu_read(perf_in_guest))
>                 return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
>         else
>                 return now + READ_ONCE(time->offset);
> 
>> +		return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
>> +	else
>> +		return now + READ_ONCE(time->offset);
>>  }
>>  
>>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
>> @@ -1524,10 +1592,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
>>  		return perf_cgroup_event_time_now(event, now);
>>  
>>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
>> -		return ctx->time.time;
>> +		return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
>>  
>> -	now += READ_ONCE(ctx->time.offset);
>> -	return now;
>> +	return __perf_event_time_ctx_now(event, &ctx->time, &ctx->timeguest, now);
>>  }
>>  
>>  static enum event_type_t get_event_type(struct perf_event *event)
>> @@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>>  	 * would only update time for the pinned events.
>>  	 */
>>  	if (is_active & EVENT_TIME) {
>> +		bool stop;
>> +
>> +		/* vPMU should not stop time */
>> +		stop = !(event_type & EVENT_GUEST) &&
>> +		       ctx == &cpuctx->ctx;
>> +
>>  		/* update (and stop) ctx time */
>>  		update_context_time(ctx);
>> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule out all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +		__update_context_guest_time(ctx, false);
>> +		perf_cgroup_set_timestamp(cpuctx, true);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>> @@ -3853,10 +3937,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
>>  		event_update_userpage(event);
>>  }
>>  
>> +struct merge_sched_data {
>> +	int can_add_hw;
>> +	enum event_type_t event_type;
>> +};
>> +
>>  static int merge_sched_in(struct perf_event *event, void *data)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> -	int *can_add_hw = data;
>> +	struct merge_sched_data *msd = data;
>>  
>>  	if (event->state <= PERF_EVENT_STATE_OFF)
>>  		return 0;
>> @@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  	if (!event_filter_match(event))
>>  		return 0;
>>  
>> -	if (group_can_go_on(event, *can_add_hw)) {
>> +	/*
>> +	 * Don't schedule in any exclude_guest events of PMU with
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
>> +	 */
>> +	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
>> +	    !(msd->event_type & EVENT_GUEST))
>> +		return 0;
>> +
>> +	if (group_can_go_on(event, msd->can_add_hw)) {
>>  		if (!group_sched_in(event, ctx))
>>  			list_add_tail(&event->active_list, get_event_list(event));
>>  	}
>>  
>>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -		*can_add_hw = 0;
>> +		msd->can_add_hw = 0;
>>  		if (event->attr.pinned) {
>>  			perf_cgroup_event_disable(event, ctx);
>>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>> @@ -3889,11 +3987,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  
>>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>>  				struct perf_event_groups *groups,
>> -				struct pmu *pmu)
>> +				struct pmu *pmu,
>> +				enum event_type_t event_type)
>>  {
>> -	int can_add_hw = 1;
>> +	struct merge_sched_data msd = {
>> +		.can_add_hw = 1,
>> +		.event_type = event_type,
>> +	};
>>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
>> -			   merge_sched_in, &can_add_hw);
>> +			   merge_sched_in, &msd);
>>  }
>>  
>>  static void ctx_groups_sched_in(struct perf_event_context *ctx,
>> @@ -3905,14 +4007,14 @@ static void ctx_groups_sched_in(struct perf_event_context *ctx,
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>>  			continue;
>> -		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
>> +		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
>>  	}
>>  }
>>  
>>  static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
>>  			       struct pmu *pmu)
>>  {
>> -	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
>> +	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
>>  }
>>  
>>  static void
>> @@ -3927,9 +4029,11 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>>  		return;
>>  
>>  	if (!(is_active & EVENT_TIME)) {
>> +		/* EVENT_TIME should be active while the guest runs */
>> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>>  		/* start ctx time */
>>  		__update_context_time(ctx, false);
>> -		perf_cgroup_set_timestamp(cpuctx);
>> +		perf_cgroup_set_timestamp(cpuctx, false);
>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule in all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +
>> +		/*
>> +		 * Update ctx time to set the new start time for
>> +		 * the exclude_guest events.
>> +		 */
>> +		update_context_time(ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, false);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	/*
>>  	 * First go through the list and put on any pinned groups
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2024-08-01  4:58 ` [RFC PATCH v3 30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
@ 2024-09-02  7:51   ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-09-02  7:51 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> Reject PMU MSRs interception explicitly in
> vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
> specially handled in intel_passthrough_pmu_msrs().
>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 34a420fa98c5..41102658ed21 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -166,7 +166,7 @@ module_param(enable_passthrough_pmu, bool, 0444);
>  
>  /*
>   * List of MSRs that can be directly passed to the guest.
> - * In addition to these x2apic, PT and LBR MSRs are handled specially.
> + * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
>   */
>  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
>  	MSR_IA32_SPEC_CTRL,
> @@ -695,6 +695,13 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
>  	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
>  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> +	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
> +	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
> +	case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:

We'd better to use the macro
KVM_MAX_NR_GP_COUNTERS/KVM_MAX_NR_FIXED_COUNTERS to replace these magic number.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6d9ccac839b4..68d9c5e7f91e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -696,9 +696,9 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
        case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
        case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
                /* LBR MSRs. These are handled in
vmx_update_intercept_for_lbr_msrs() */
-       case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
-       case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
-       case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:
+       case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + KVM_MAX_NR_GP_COUNTERS - 1:
+       case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 +
KVM_MAX_NR_GP_COUNTERS - 1:
+       case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 +
KVM_MAX_NR_FIXED_COUNTERS -1:
        case MSR_CORE_PERF_GLOBAL_STATUS:
        case MSR_CORE_PERF_GLOBAL_CTRL:
        case MSR_CORE_PERF_GLOBAL_OVF_CTRL:


> +	case MSR_CORE_PERF_GLOBAL_STATUS:
> +	case MSR_CORE_PERF_GLOBAL_CTRL:
> +	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> +		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
>  		return -ENOENT;
>  	}
>  

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 16/58] perf/x86: Forbid PMI handler when guest own PMU
  2024-08-01  4:58 ` [RFC PATCH v3 16/58] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
@ 2024-09-02  7:56   ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-09-02  7:56 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
> be held pending until EFLAGS.IF is set. In the meantime, if the logical
> processor receives an NMI for any reason at all, perf_event_nmi_handler()
> will be invoked. If there is any active perf event anywhere on the system,
> x86_pmu_handle_irq() will be invoked, and it will clear
> IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
> be a mystery which counter(s) overflowed.
>
> When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
> x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
> and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
> environment.
>
> So modify perf_event_nmi_handler() to check perf_in_guest per cpu variable,
> and if so, to simply return without calling x86_pmu_handle_irq().
>
> Suggested-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index b17ef8b6c1a6..cb5d8f5fd9ce 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -52,6 +52,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
>  	.pmu = &pmu,
>  };
>  
> +DEFINE_PER_CPU(bool, pmi_vector_is_nmi) = true;
> +
>  DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
>  DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
>  DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
> @@ -1733,6 +1735,24 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  	u64 finish_clock;
>  	int ret;
>  
> +	/*
> +	 * When guest pmu context is loaded this handler should be forbidden from
> +	 * running, the reasons are:
> +	 * 1. After perf_guest_enter() is called, and before cpu enter into
> +	 *    non-root mode, NMI could happen, but x86_pmu_handle_irq() restore PMU
> +	 *    to use NMI vector, which destroy KVM PMI vector setting.
> +	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
> +	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
> +	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
> +	 *    clear global_status MSR which has guest status now, then this destroy
> +	 *    guest PMU status.
> +	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
> +	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
> +	 *    which has guest status now, then this destroy guest PMU status.
> +	 */
> +	if (!this_cpu_read(pmi_vector_is_nmi))
> +		return 0;

0 -> NMI_DONE


> +
>  	/*
>  	 * All PMUs/events that share this PMI handler should make sure to
>  	 * increment active_events for their events.
> @@ -2675,11 +2695,14 @@ static bool x86_pmu_filter(struct pmu *pmu, int cpu)
>  
>  static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
>  {
> -	if (enter)
> +	if (enter) {
>  		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>  			   (guest_lvtpc & APIC_LVT_MASKED));
> -	else
> +		this_cpu_write(pmi_vector_is_nmi, false);
> +	} else {
>  		apic_write(APIC_LVTPC, APIC_DM_NMI);
> +		this_cpu_write(pmi_vector_is_nmi, true);
> +	}
>  }
>  
>  static struct pmu pmu = {

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed
  2024-08-01  4:58 ` [RFC PATCH v3 36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
@ 2024-09-02  7:59   ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-09-02  7:59 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> Allow writing to fixed counter selector if counter is exposed. If this
> fixed counter is filtered out, this counter won't be enabled on HW.
>
> Passthrough PMU implements the context switch at VM Enter/Exit boundary the
> guest value cannot be directly written to HW since the HW PMU is owned by
> the host. Introduce a new field fixed_ctr_ctrl_hw in kvm_pmu to cache the
> guest value.  which will be assigne to HW at PMU context restore.
>
> Since passthrough PMU intercept writes to fixed counter selector, there is
> no need to read the value at pmu context save, but still clear the fix
> counter ctrl MSR and counters when switching out to host PMU.
>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx/pmu_intel.c    | 28 ++++++++++++++++++++++++----
>  2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e5c288d4264f..93c17da8271d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -549,6 +549,7 @@ struct kvm_pmu {
>  	unsigned nr_arch_fixed_counters;
>  	unsigned available_event_types;
>  	u64 fixed_ctr_ctrl;
> +	u64 fixed_ctr_ctrl_hw;
>  	u64 fixed_ctr_ctrl_mask;
>  	u64 global_ctrl;
>  	u64 global_status;
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 0cd38c5632ee..c61936266cbd 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -34,6 +34,25 @@
>  
>  #define MSR_PMC_FULL_WIDTH_BIT      (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)
>  
> +static void reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64 data)
> +{
> +	struct kvm_pmc *pmc;
> +	u64 new_data = 0;
> +	int i;
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
> +		if (check_pmu_event_filter(pmc)) {
> +			pmc->current_config = fixed_ctrl_field(data, i);
> +			new_data |= (pmc->current_config << (i * 4));

Since we already have macro intel_fixed_bits_by_idx() to manipulate
fixed_cntr_ctrl, we 'd better to use it.

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 3dbeb41b85ab..0aa58bffb99d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -44,7 +44,7 @@ static void
reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64
                pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
                if (check_pmu_event_filter(pmc)) {
                        pmc->current_config = fixed_ctrl_field(data, i);
-                       new_data |= (pmc->current_config << (i * 4));
+                       new_data |= intel_fixed_bits_by_idx(i,
pmc->current_config);
                } else {
                        pmc->counter = 0;
                }


> +		} else {
> +			pmc->counter = 0;
> +		}
> +	}
> +	pmu->fixed_ctr_ctrl_hw = new_data;
> +	pmu->fixed_ctr_ctrl = data;
> +}
> +
>  static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
>  {
>  	struct kvm_pmc *pmc;
> @@ -351,7 +370,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (data & pmu->fixed_ctr_ctrl_mask)
>  			return 1;
>  
> -		if (pmu->fixed_ctr_ctrl != data)
> +		if (is_passthrough_pmu_enabled(vcpu))
> +			reprogram_fixed_counters_in_passthrough_pmu(pmu, data);
> +		else if (pmu->fixed_ctr_ctrl != data)
>  			reprogram_fixed_counters(pmu, data);
>  		break;
>  	case MSR_IA32_PEBS_ENABLE:
> @@ -820,13 +841,12 @@ static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
>  	if (pmu->global_status)
>  		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>  
> -	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>  	/*
>  	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>  	 * also avoid these guest fixed counters get accidentially enabled
>  	 * during host running when host enable global ctrl.
>  	 */
> -	if (pmu->fixed_ctr_ctrl)
> +	if (pmu->fixed_ctr_ctrl_hw)
>  		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>  }
>  
> @@ -844,7 +864,7 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
>  	if (pmu->global_status & toggle)
>  		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
>  
> -	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
>  }
>  
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces
  2024-08-01  4:58 ` [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
@ 2024-09-06 10:59   ` Mi, Dapeng
  2024-09-06 15:40     ` Liang, Kan
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-09-06 10:59 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Currently, the guest and host share the PMU resources when a guest is
> running. KVM has to create an extra virtual event to simulate the
> guest's event, which brings several issues, e.g., high overhead, not
> accuracy and etc.
>
> A new passthrough PMU method is proposed to address the issue. It requires
> that the PMU resources can be fully occupied by the guest while it's
> running. Two new interfaces are implemented to fulfill the requirement.
> The hypervisor should invoke the interface while creating a guest which
> wants the passthrough PMU capability.
>
> The PMU resources should only be temporarily occupied as a whole when a
> guest is running. When the guest is out, the PMU resources are still
> shared among different users.
>
> The exclude_guest event modifier is used to guarantee the exclusive
> occupation of the PMU resources. When creating a guest, the hypervisor
> should check whether there are !exclude_guest events in the system.
> If yes, the creation should fail. Because some PMU resources have been
> occupied by other users.
> If no, the PMU resources can be safely accessed by the guest directly.
> Perf guarantees that no new !exclude_guest events are created when a
> guest is running.
>
> Only the passthrough PMU is affected, but not for other PMU e.g., uncore
> and SW PMU. The behavior of those PMUs are not changed. The guest
> enter/exit interfaces should only impact the supported PMUs.
> Add a new PERF_PMU_CAP_PASSTHROUGH_VPMU flag to indicate the PMUs that
> support the feature.
>
> Add nr_include_guest_events to track the !exclude_guest events of PMU
> with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h | 10 ++++++
>  kernel/events/core.c       | 66 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 76 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a5304ae8c654..45d1ea82aa21 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -291,6 +291,7 @@ struct perf_event_pmu_context;
>  #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
>  #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
>  #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
> +#define PERF_PMU_CAP_PASSTHROUGH_VPMU		0x0200
>  
>  struct perf_output_handle;
>  
> @@ -1728,6 +1729,8 @@ extern void perf_event_task_tick(void);
>  extern int perf_event_account_interrupt(struct perf_event *event);
>  extern int perf_event_period(struct perf_event *event, u64 value);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
> +int perf_get_mediated_pmu(void);
> +void perf_put_mediated_pmu(void);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
>  perf_aux_output_begin(struct perf_output_handle *handle,
> @@ -1814,6 +1817,13 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset)
>  {
>  	return 0;
>  }
> +
> +static inline int perf_get_mediated_pmu(void)
> +{
> +	return 0;
> +}
> +
> +static inline void perf_put_mediated_pmu(void)			{ }
>  #endif
>  
>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8f908f077935..45868d276cde 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -402,6 +402,20 @@ static atomic_t nr_bpf_events __read_mostly;
>  static atomic_t nr_cgroup_events __read_mostly;
>  static atomic_t nr_text_poke_events __read_mostly;
>  static atomic_t nr_build_id_events __read_mostly;
> +static atomic_t nr_include_guest_events __read_mostly;
> +
> +static atomic_t nr_mediated_pmu_vms;
> +static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> +
> +/* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
> +static inline bool is_include_guest_event(struct perf_event *event)
> +{
> +	if ((event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
> +	    !event->attr.exclude_guest)
> +		return true;
> +
> +	return false;
> +}
>  
>  static LIST_HEAD(pmus);
>  static DEFINE_MUTEX(pmus_lock);
> @@ -5212,6 +5226,9 @@ static void _free_event(struct perf_event *event)
>  
>  	unaccount_event(event);
>  
> +	if (is_include_guest_event(event))
> +		atomic_dec(&nr_include_guest_events);
> +
>  	security_perf_event_free(event);
>  
>  	if (event->rb) {
> @@ -5769,6 +5786,36 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
>  }
>  EXPORT_SYMBOL_GPL(perf_event_pause);
>  
> +/*
> + * Currently invoked at VM creation to
> + * - Check whether there are existing !exclude_guest events of PMU with
> + *   PERF_PMU_CAP_PASSTHROUGH_VPMU
> + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
> + *   PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU
> + *
> + * No impact for the PMU without PERF_PMU_CAP_PASSTHROUGH_VPMU. The perf
> + * still owns all the PMU resources.
> + */
> +int perf_get_mediated_pmu(void)
> +{
> +	guard(mutex)(&perf_mediated_pmu_mutex);
> +	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
> +		return 0;
> +
> +	if (atomic_read(&nr_include_guest_events))
> +		return -EBUSY;
> +
> +	atomic_inc(&nr_mediated_pmu_vms);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
> +
> +void perf_put_mediated_pmu(void)
> +{
> +	atomic_dec(&nr_mediated_pmu_vms);
> +}
> +EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> +
>  /*
>   * Holding the top-level event's child_mutex means that any
>   * descendant process that has inherited this event will block
> @@ -11907,6 +11954,17 @@ static void account_event(struct perf_event *event)
>  	account_pmu_sb_event(event);
>  }
>  
> +static int perf_account_include_guest_event(void)
> +{
> +	guard(mutex)(&perf_mediated_pmu_mutex);
> +
> +	if (atomic_read(&nr_mediated_pmu_vms))
> +		return -EACCES;

Kan, Namhyung posted a patchset
https://lore.kernel.org/all/20240904064131.2377873-1-namhyung@kernel.org/
which would remove to set exclude_guest flag from perf tools by default.
This may impact current mediated vPMU solution, but fortunately the
patchset provides a fallback mechanism to add exclude_guest flag if kernel
returns "EOPNOTSUPP".

So we'd better return "EOPNOTSUPP" instead of "EACCES" here. BTW, returning
"EOPNOTSUPP" here looks more reasonable than "EACCES".


> +
> +	atomic_inc(&nr_include_guest_events);
> +	return 0;
> +}
> +
>  /*
>   * Allocate and initialize an event structure
>   */
> @@ -12114,11 +12172,19 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  	if (err)
>  		goto err_callchain_buffer;
>  
> +	if (is_include_guest_event(event)) {
> +		err = perf_account_include_guest_event();
> +		if (err)
> +			goto err_security_alloc;
> +	}
> +
>  	/* symmetric to unaccount_event() in _free_event() */
>  	account_event(event);
>  
>  	return event;
>  
> +err_security_alloc:
> +	security_perf_event_free(event);
>  err_callchain_buffer:
>  	if (!event->parent) {
>  		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces
  2024-09-06 10:59   ` Mi, Dapeng
@ 2024-09-06 15:40     ` Liang, Kan
  2024-09-09 22:17       ` Namhyung Kim
  0 siblings, 1 reply; 183+ messages in thread
From: Liang, Kan @ 2024-09-06 15:40 UTC (permalink / raw)
  To: Mi, Dapeng, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 2024-09-06 6:59 a.m., Mi, Dapeng wrote:
> 
> On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Currently, the guest and host share the PMU resources when a guest is
>> running. KVM has to create an extra virtual event to simulate the
>> guest's event, which brings several issues, e.g., high overhead, not
>> accuracy and etc.
>>
>> A new passthrough PMU method is proposed to address the issue. It requires
>> that the PMU resources can be fully occupied by the guest while it's
>> running. Two new interfaces are implemented to fulfill the requirement.
>> The hypervisor should invoke the interface while creating a guest which
>> wants the passthrough PMU capability.
>>
>> The PMU resources should only be temporarily occupied as a whole when a
>> guest is running. When the guest is out, the PMU resources are still
>> shared among different users.
>>
>> The exclude_guest event modifier is used to guarantee the exclusive
>> occupation of the PMU resources. When creating a guest, the hypervisor
>> should check whether there are !exclude_guest events in the system.
>> If yes, the creation should fail. Because some PMU resources have been
>> occupied by other users.
>> If no, the PMU resources can be safely accessed by the guest directly.
>> Perf guarantees that no new !exclude_guest events are created when a
>> guest is running.
>>
>> Only the passthrough PMU is affected, but not for other PMU e.g., uncore
>> and SW PMU. The behavior of those PMUs are not changed. The guest
>> enter/exit interfaces should only impact the supported PMUs.
>> Add a new PERF_PMU_CAP_PASSTHROUGH_VPMU flag to indicate the PMUs that
>> support the feature.
>>
>> Add nr_include_guest_events to track the !exclude_guest events of PMU
>> with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h | 10 ++++++
>>  kernel/events/core.c       | 66 ++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 76 insertions(+)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index a5304ae8c654..45d1ea82aa21 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -291,6 +291,7 @@ struct perf_event_pmu_context;
>>  #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
>>  #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
>>  #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
>> +#define PERF_PMU_CAP_PASSTHROUGH_VPMU		0x0200
>>  
>>  struct perf_output_handle;
>>  
>> @@ -1728,6 +1729,8 @@ extern void perf_event_task_tick(void);
>>  extern int perf_event_account_interrupt(struct perf_event *event);
>>  extern int perf_event_period(struct perf_event *event, u64 value);
>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>> +int perf_get_mediated_pmu(void);
>> +void perf_put_mediated_pmu(void);
>>  #else /* !CONFIG_PERF_EVENTS: */
>>  static inline void *
>>  perf_aux_output_begin(struct perf_output_handle *handle,
>> @@ -1814,6 +1817,13 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset)
>>  {
>>  	return 0;
>>  }
>> +
>> +static inline int perf_get_mediated_pmu(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline void perf_put_mediated_pmu(void)			{ }
>>  #endif
>>  
>>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 8f908f077935..45868d276cde 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -402,6 +402,20 @@ static atomic_t nr_bpf_events __read_mostly;
>>  static atomic_t nr_cgroup_events __read_mostly;
>>  static atomic_t nr_text_poke_events __read_mostly;
>>  static atomic_t nr_build_id_events __read_mostly;
>> +static atomic_t nr_include_guest_events __read_mostly;
>> +
>> +static atomic_t nr_mediated_pmu_vms;
>> +static DEFINE_MUTEX(perf_mediated_pmu_mutex);
>> +
>> +/* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
>> +static inline bool is_include_guest_event(struct perf_event *event)
>> +{
>> +	if ((event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
>> +	    !event->attr.exclude_guest)
>> +		return true;
>> +
>> +	return false;
>> +}
>>  
>>  static LIST_HEAD(pmus);
>>  static DEFINE_MUTEX(pmus_lock);
>> @@ -5212,6 +5226,9 @@ static void _free_event(struct perf_event *event)
>>  
>>  	unaccount_event(event);
>>  
>> +	if (is_include_guest_event(event))
>> +		atomic_dec(&nr_include_guest_events);
>> +
>>  	security_perf_event_free(event);
>>  
>>  	if (event->rb) {
>> @@ -5769,6 +5786,36 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_event_pause);
>>  
>> +/*
>> + * Currently invoked at VM creation to
>> + * - Check whether there are existing !exclude_guest events of PMU with
>> + *   PERF_PMU_CAP_PASSTHROUGH_VPMU
>> + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
>> + *   PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU
>> + *
>> + * No impact for the PMU without PERF_PMU_CAP_PASSTHROUGH_VPMU. The perf
>> + * still owns all the PMU resources.
>> + */
>> +int perf_get_mediated_pmu(void)
>> +{
>> +	guard(mutex)(&perf_mediated_pmu_mutex);
>> +	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
>> +		return 0;
>> +
>> +	if (atomic_read(&nr_include_guest_events))
>> +		return -EBUSY;
>> +
>> +	atomic_inc(&nr_mediated_pmu_vms);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
>> +
>> +void perf_put_mediated_pmu(void)
>> +{
>> +	atomic_dec(&nr_mediated_pmu_vms);
>> +}
>> +EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>> +
>>  /*
>>   * Holding the top-level event's child_mutex means that any
>>   * descendant process that has inherited this event will block
>> @@ -11907,6 +11954,17 @@ static void account_event(struct perf_event *event)
>>  	account_pmu_sb_event(event);
>>  }
>>  
>> +static int perf_account_include_guest_event(void)
>> +{
>> +	guard(mutex)(&perf_mediated_pmu_mutex);
>> +
>> +	if (atomic_read(&nr_mediated_pmu_vms))
>> +		return -EACCES;
> 
> Kan, Namhyung posted a patchset
> https://lore.kernel.org/all/20240904064131.2377873-1-namhyung@kernel.org/
> which would remove to set exclude_guest flag from perf tools by default.
> This may impact current mediated vPMU solution, but fortunately the
> patchset provides a fallback mechanism to add exclude_guest flag if kernel
> returns "EOPNOTSUPP".
> 
> So we'd better return "EOPNOTSUPP" instead of "EACCES" here. BTW, returning
> "EOPNOTSUPP" here looks more reasonable than "EACCES".

It seems the existing Apple M1 PMU has ready returned "EOPNOTSUPP" for
the !exclude_guest. Yes, we should use the same error code.

Thanks,
Kan
> 
> 
>> +
>> +	atomic_inc(&nr_include_guest_events);
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Allocate and initialize an event structure
>>   */
>> @@ -12114,11 +12172,19 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>  	if (err)
>>  		goto err_callchain_buffer;
>>  
>> +	if (is_include_guest_event(event)) {
>> +		err = perf_account_include_guest_event();
>> +		if (err)
>> +			goto err_security_alloc;
>> +	}
>> +
>>  	/* symmetric to unaccount_event() in _free_event() */
>>  	account_event(event);
>>  
>>  	return event;
>>  
>> +err_security_alloc:
>> +	security_perf_event_free(event);
>>  err_callchain_buffer:
>>  	if (!event->parent) {
>>  		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-08-01  4:58 ` [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
@ 2024-09-09 22:11   ` Colton Lewis
  2024-09-10  4:59     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Colton Lewis @ 2024-09-09 22:11 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: seanjc, pbonzini, xiong.y.zhang, dapeng1.mi, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	mizhang, gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv,
	yanfei.xu, like.xu.linux, peterz, rananta, kvm, linux-perf-users

Hello,

Mingwei Zhang <mizhang@google.com> writes:

> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

> Create a new vector in the host IDT for kvm guest PMI handling within
> mediated passthrough vPMU. In addition, guest PMI handler registration
> is added into x86_set_kvm_irq_handler().

> This is the preparation work to support mediated passthrough vPMU to
> handle kvm guest PMIs without interference from PMI handler of the host
> PMU.

> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>   arch/x86/include/asm/hardirq.h                |  1 +
>   arch/x86/include/asm/idtentry.h               |  1 +
>   arch/x86/include/asm/irq_vectors.h            |  5 ++++-
>   arch/x86/kernel/idt.c                         |  1 +
>   arch/x86/kernel/irq.c                         | 21 +++++++++++++++++++
>   .../beauty/arch/x86/include/asm/irq_vectors.h |  5 ++++-
>   6 files changed, 32 insertions(+), 2 deletions(-)

> diff --git a/arch/x86/include/asm/hardirq.h  
> b/arch/x86/include/asm/hardirq.h
> index c67fa6ad098a..42a396763c8d 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -19,6 +19,7 @@ typedef struct {
>   	unsigned int kvm_posted_intr_ipis;
>   	unsigned int kvm_posted_intr_wakeup_ipis;
>   	unsigned int kvm_posted_intr_nested_ipis;
> +	unsigned int kvm_guest_pmis;
>   #endif
>   	unsigned int x86_platform_ipis;	/* arch dependent */
>   	unsigned int apic_perf_irqs;
> diff --git a/arch/x86/include/asm/idtentry.h  
> b/arch/x86/include/asm/idtentry.h
> index d4f24499b256..7b1e3e542b1d 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		 
> sysvec_irq_work);
>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	 
> sysvec_kvm_posted_intr_wakeup_ipi);
>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	 
> sysvec_kvm_posted_intr_nested_ipi);
> +DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	         
> sysvec_kvm_guest_pmi_handler);
>   #else
>   # define fred_sysvec_kvm_posted_intr_ipi		NULL
>   # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
> diff --git a/arch/x86/include/asm/irq_vectors.h  
> b/arch/x86/include/asm/irq_vectors.h
> index 13aea8fc3d45..ada270e6f5cb 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -77,7 +77,10 @@
>    */
>   #define IRQ_WORK_VECTOR			0xf6

> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
> +#if IS_ENABLED(CONFIG_KVM)
> +#define KVM_GUEST_PMI_VECTOR		0xf5
> +#endif
> +
>   #define DEFERRED_ERROR_VECTOR		0xf4

>   /* Vector on which hypervisor callbacks will be delivered */
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index f445bec516a0..0bec4c7e2308 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[]  
> = {
>   	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>   	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>   	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
> +	INTG(KVM_GUEST_PMI_VECTOR,		asm_sysvec_kvm_guest_pmi_handler),


There is a subtle inconsistency in that this code is guarded on
CONFIG_HAVE_KVM when the #define KVM_GUEST_PMI_VECTOR is guarded on
CONFIG_KVM. Beats me why there are two different flags that are so
similar, but it is possible they have different values leading to
compilation errors.

I happened to hit this compilation error because make defconfig will
define CONFIG_HAVE_KVM but not CONFIG_KVM. CONFIG_KVM is the more strict
of the two because it depends on CONFIG_HAVE_KVM.

This code should also be guarded on CONFIG_KVM.

>   # endif
>   # ifdef CONFIG_IRQ_WORK
>   	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 18cd418fe106..b29714e23fc4 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -183,6 +183,12 @@ int arch_show_interrupts(struct seq_file *p, int  
> prec)
>   		seq_printf(p, "%10u ",
>   			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
>   	seq_puts(p, "  Posted-interrupt wakeup event\n");
> +
> +	seq_printf(p, "%*s: ", prec, "VPMU");
> +	for_each_online_cpu(j)
> +		seq_printf(p, "%10u ",
> +			   irq_stats(j)->kvm_guest_pmis);
> +	seq_puts(p, " KVM GUEST PMI\n");
>   #endif
>   #ifdef CONFIG_X86_POSTED_MSI
>   	seq_printf(p, "%*s: ", prec, "PMN");
> @@ -311,6 +317,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
>   #if IS_ENABLED(CONFIG_KVM)
>   static void dummy_handler(void) {}
>   static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
> +static void (*kvm_guest_pmi_handler)(void) = dummy_handler;

>   void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>   {
> @@ -321,6 +328,10 @@ void x86_set_kvm_irq_handler(u8 vector, void  
> (*handler)(void))
>   	    (handler == dummy_handler ||
>   	     kvm_posted_intr_wakeup_handler == dummy_handler))
>   		kvm_posted_intr_wakeup_handler = handler;
> +	else if (vector == KVM_GUEST_PMI_VECTOR &&
> +		 (handler == dummy_handler ||
> +		  kvm_guest_pmi_handler == dummy_handler))
> +		kvm_guest_pmi_handler = handler;
>   	else
>   		WARN_ON_ONCE(1);

> @@ -356,6 +367,16 @@  
> DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>   	apic_eoi();
>   	inc_irq_stat(kvm_posted_intr_nested_ipis);
>   }
> +
> +/*
> + * Handler for KVM_GUEST_PMI_VECTOR.
> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_guest_pmi_handler)
> +{
> +	apic_eoi();
> +	inc_irq_stat(kvm_guest_pmis);
> +	kvm_guest_pmi_handler();
> +}
>   #endif

>   #ifdef CONFIG_X86_POSTED_MSI
> diff --git a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h  
> b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
> index 13aea8fc3d45..670dcee46631 100644
> --- a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
> +++ b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
> @@ -77,7 +77,10 @@
>    */
>   #define IRQ_WORK_VECTOR			0xf6

> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
> +#if IS_ENABLED(CONFIG_KVM)
> +#define KVM_GUEST_PMI_VECTOR           0xf5
> +#endif
> +
>   #define DEFERRED_ERROR_VECTOR		0xf4

>   /* Vector on which hypervisor callbacks will be delivered */

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface
  2024-08-01  4:58 ` [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface Mingwei Zhang
@ 2024-09-09 22:11   ` Colton Lewis
  2024-09-10  5:00     ` Mi, Dapeng
  2024-10-24 19:45     ` Chen, Zide
  0 siblings, 2 replies; 183+ messages in thread
From: Colton Lewis @ 2024-09-09 22:11 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: seanjc, pbonzini, xiong.y.zhang, dapeng1.mi, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	mizhang, gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv,
	yanfei.xu, like.xu.linux, peterz, rananta, kvm, linux-perf-users

Mingwei Zhang <mizhang@google.com> writes:

> From: Kan Liang <kan.liang@linux.intel.com>

> Implement switch_interrupt interface for x86 PMU, switch PMI to dedicated
> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
> NMI at perf guest exit.

> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>   arch/x86/events/core.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 5bf78cd619bf..b17ef8b6c1a6 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2673,6 +2673,15 @@ static bool x86_pmu_filter(struct pmu *pmu, int  
> cpu)
>   	return ret;
>   }

> +static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
> +{
> +	if (enter)
> +		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
> +			   (guest_lvtpc & APIC_LVT_MASKED));
> +	else
> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
> +}
> +

Similar issue I point out in an earlier patch. #define
KVM_GUEST_PMI_VECTOR is guarded by CONFIG_KVM but this code is not,
which can result in compile errors.

>   static struct pmu pmu = {
>   	.pmu_enable		= x86_pmu_enable,
>   	.pmu_disable		= x86_pmu_disable,
> @@ -2702,6 +2711,8 @@ static struct pmu pmu = {
>   	.aux_output_match	= x86_pmu_aux_output_match,

>   	.filter			= x86_pmu_filter,
> +
> +	.switch_interrupt	= x86_pmu_switch_interrupt,
>   };

>   void arch_perf_update_userpage(struct perf_event *event,

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces
  2024-09-06 15:40     ` Liang, Kan
@ 2024-09-09 22:17       ` Namhyung Kim
  0 siblings, 0 replies; 183+ messages in thread
From: Namhyung Kim @ 2024-09-09 22:17 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mi, Dapeng, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

Hello,

On Fri, Sep 06, 2024 at 11:40:51AM -0400, Liang, Kan wrote:
> 
> 
> On 2024-09-06 6:59 a.m., Mi, Dapeng wrote:
> > 
> > On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> Currently, the guest and host share the PMU resources when a guest is
> >> running. KVM has to create an extra virtual event to simulate the
> >> guest's event, which brings several issues, e.g., high overhead, not
> >> accuracy and etc.
> >>
> >> A new passthrough PMU method is proposed to address the issue. It requires
> >> that the PMU resources can be fully occupied by the guest while it's
> >> running. Two new interfaces are implemented to fulfill the requirement.
> >> The hypervisor should invoke the interface while creating a guest which
> >> wants the passthrough PMU capability.
> >>
> >> The PMU resources should only be temporarily occupied as a whole when a
> >> guest is running. When the guest is out, the PMU resources are still
> >> shared among different users.
> >>
> >> The exclude_guest event modifier is used to guarantee the exclusive
> >> occupation of the PMU resources. When creating a guest, the hypervisor
> >> should check whether there are !exclude_guest events in the system.
> >> If yes, the creation should fail. Because some PMU resources have been
> >> occupied by other users.
> >> If no, the PMU resources can be safely accessed by the guest directly.
> >> Perf guarantees that no new !exclude_guest events are created when a
> >> guest is running.
> >>
> >> Only the passthrough PMU is affected, but not for other PMU e.g., uncore
> >> and SW PMU. The behavior of those PMUs are not changed. The guest
> >> enter/exit interfaces should only impact the supported PMUs.
> >> Add a new PERF_PMU_CAP_PASSTHROUGH_VPMU flag to indicate the PMUs that
> >> support the feature.
> >>
> >> Add nr_include_guest_events to track the !exclude_guest events of PMU
> >> with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> >>
> >> Suggested-by: Sean Christopherson <seanjc@google.com>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> ---
> >>  include/linux/perf_event.h | 10 ++++++
> >>  kernel/events/core.c       | 66 ++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 76 insertions(+)
> >>
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index a5304ae8c654..45d1ea82aa21 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -291,6 +291,7 @@ struct perf_event_pmu_context;
> >>  #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
> >>  #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
> >>  #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
> >> +#define PERF_PMU_CAP_PASSTHROUGH_VPMU		0x0200
> >>  
> >>  struct perf_output_handle;
> >>  
> >> @@ -1728,6 +1729,8 @@ extern void perf_event_task_tick(void);
> >>  extern int perf_event_account_interrupt(struct perf_event *event);
> >>  extern int perf_event_period(struct perf_event *event, u64 value);
> >>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
> >> +int perf_get_mediated_pmu(void);
> >> +void perf_put_mediated_pmu(void);
> >>  #else /* !CONFIG_PERF_EVENTS: */
> >>  static inline void *
> >>  perf_aux_output_begin(struct perf_output_handle *handle,
> >> @@ -1814,6 +1817,13 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset)
> >>  {
> >>  	return 0;
> >>  }
> >> +
> >> +static inline int perf_get_mediated_pmu(void)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static inline void perf_put_mediated_pmu(void)			{ }
> >>  #endif
> >>  
> >>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
> >> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> index 8f908f077935..45868d276cde 100644
> >> --- a/kernel/events/core.c
> >> +++ b/kernel/events/core.c
> >> @@ -402,6 +402,20 @@ static atomic_t nr_bpf_events __read_mostly;
> >>  static atomic_t nr_cgroup_events __read_mostly;
> >>  static atomic_t nr_text_poke_events __read_mostly;
> >>  static atomic_t nr_build_id_events __read_mostly;
> >> +static atomic_t nr_include_guest_events __read_mostly;
> >> +
> >> +static atomic_t nr_mediated_pmu_vms;
> >> +static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> >> +
> >> +/* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
> >> +static inline bool is_include_guest_event(struct perf_event *event)
> >> +{
> >> +	if ((event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
> >> +	    !event->attr.exclude_guest)
> >> +		return true;
> >> +
> >> +	return false;
> >> +}
> >>  
> >>  static LIST_HEAD(pmus);
> >>  static DEFINE_MUTEX(pmus_lock);
> >> @@ -5212,6 +5226,9 @@ static void _free_event(struct perf_event *event)
> >>  
> >>  	unaccount_event(event);
> >>  
> >> +	if (is_include_guest_event(event))
> >> +		atomic_dec(&nr_include_guest_events);
> >> +
> >>  	security_perf_event_free(event);
> >>  
> >>  	if (event->rb) {
> >> @@ -5769,6 +5786,36 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
> >>  }
> >>  EXPORT_SYMBOL_GPL(perf_event_pause);
> >>  
> >> +/*
> >> + * Currently invoked at VM creation to
> >> + * - Check whether there are existing !exclude_guest events of PMU with
> >> + *   PERF_PMU_CAP_PASSTHROUGH_VPMU
> >> + * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
> >> + *   PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU
> >> + *
> >> + * No impact for the PMU without PERF_PMU_CAP_PASSTHROUGH_VPMU. The perf
> >> + * still owns all the PMU resources.
> >> + */
> >> +int perf_get_mediated_pmu(void)
> >> +{
> >> +	guard(mutex)(&perf_mediated_pmu_mutex);
> >> +	if (atomic_inc_not_zero(&nr_mediated_pmu_vms))
> >> +		return 0;
> >> +
> >> +	if (atomic_read(&nr_include_guest_events))
> >> +		return -EBUSY;
> >> +
> >> +	atomic_inc(&nr_mediated_pmu_vms);
> >> +	return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
> >> +
> >> +void perf_put_mediated_pmu(void)
> >> +{
> >> +	atomic_dec(&nr_mediated_pmu_vms);
> >> +}
> >> +EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> >> +
> >>  /*
> >>   * Holding the top-level event's child_mutex means that any
> >>   * descendant process that has inherited this event will block
> >> @@ -11907,6 +11954,17 @@ static void account_event(struct perf_event *event)
> >>  	account_pmu_sb_event(event);
> >>  }
> >>  
> >> +static int perf_account_include_guest_event(void)
> >> +{
> >> +	guard(mutex)(&perf_mediated_pmu_mutex);
> >> +
> >> +	if (atomic_read(&nr_mediated_pmu_vms))
> >> +		return -EACCES;
> > 
> > Kan, Namhyung posted a patchset
> > https://lore.kernel.org/all/20240904064131.2377873-1-namhyung@kernel.org/
> > which would remove to set exclude_guest flag from perf tools by default.
> > This may impact current mediated vPMU solution, but fortunately the
> > patchset provides a fallback mechanism to add exclude_guest flag if kernel
> > returns "EOPNOTSUPP".
> > 
> > So we'd better return "EOPNOTSUPP" instead of "EACCES" here. BTW, returning
> > "EOPNOTSUPP" here looks more reasonable than "EACCES".
> 
> It seems the existing Apple M1 PMU has ready returned "EOPNOTSUPP" for
> the !exclude_guest. Yes, we should use the same error code.

Yep, it'd be much easier to handle if it returns the same error code.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-09-09 22:11   ` Colton Lewis
@ 2024-09-10  4:59     ` Mi, Dapeng
  2024-09-10 16:45       ` Colton Lewis
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-09-10  4:59 UTC (permalink / raw)
  To: Colton Lewis, Mingwei Zhang
  Cc: seanjc, pbonzini, xiong.y.zhang, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv, yanfei.xu,
	like.xu.linux, peterz, rananta, kvm, linux-perf-users


On 9/10/2024 6:11 AM, Colton Lewis wrote:
> Hello,
>
> Mingwei Zhang <mizhang@google.com> writes:
>
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Create a new vector in the host IDT for kvm guest PMI handling within
>> mediated passthrough vPMU. In addition, guest PMI handler registration
>> is added into x86_set_kvm_irq_handler().
>> This is the preparation work to support mediated passthrough vPMU to
>> handle kvm guest PMIs without interference from PMI handler of the host
>> PMU.
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>   arch/x86/include/asm/hardirq.h                |  1 +
>>   arch/x86/include/asm/idtentry.h               |  1 +
>>   arch/x86/include/asm/irq_vectors.h            |  5 ++++-
>>   arch/x86/kernel/idt.c                         |  1 +
>>   arch/x86/kernel/irq.c                         | 21 +++++++++++++++++++
>>   .../beauty/arch/x86/include/asm/irq_vectors.h |  5 ++++-
>>   6 files changed, 32 insertions(+), 2 deletions(-)
>> diff --git a/arch/x86/include/asm/hardirq.h  
>> b/arch/x86/include/asm/hardirq.h
>> index c67fa6ad098a..42a396763c8d 100644
>> --- a/arch/x86/include/asm/hardirq.h
>> +++ b/arch/x86/include/asm/hardirq.h
>> @@ -19,6 +19,7 @@ typedef struct {
>>   	unsigned int kvm_posted_intr_ipis;
>>   	unsigned int kvm_posted_intr_wakeup_ipis;
>>   	unsigned int kvm_posted_intr_nested_ipis;
>> +	unsigned int kvm_guest_pmis;
>>   #endif
>>   	unsigned int x86_platform_ipis;	/* arch dependent */
>>   	unsigned int apic_perf_irqs;
>> diff --git a/arch/x86/include/asm/idtentry.h  
>> b/arch/x86/include/asm/idtentry.h
>> index d4f24499b256..7b1e3e542b1d 100644
>> --- a/arch/x86/include/asm/idtentry.h
>> +++ b/arch/x86/include/asm/idtentry.h
>> @@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		 
>> sysvec_irq_work);
>>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
>>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	 
>> sysvec_kvm_posted_intr_wakeup_ipi);
>>   DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	 
>> sysvec_kvm_posted_intr_nested_ipi);
>> +DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	         
>> sysvec_kvm_guest_pmi_handler);
>>   #else
>>   # define fred_sysvec_kvm_posted_intr_ipi		NULL
>>   # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
>> diff --git a/arch/x86/include/asm/irq_vectors.h  
>> b/arch/x86/include/asm/irq_vectors.h
>> index 13aea8fc3d45..ada270e6f5cb 100644
>> --- a/arch/x86/include/asm/irq_vectors.h
>> +++ b/arch/x86/include/asm/irq_vectors.h
>> @@ -77,7 +77,10 @@
>>    */
>>   #define IRQ_WORK_VECTOR			0xf6
>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>> +#if IS_ENABLED(CONFIG_KVM)
>> +#define KVM_GUEST_PMI_VECTOR		0xf5
>> +#endif
>> +
>>   #define DEFERRED_ERROR_VECTOR		0xf4
>>   /* Vector on which hypervisor callbacks will be delivered */
>> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
>> index f445bec516a0..0bec4c7e2308 100644
>> --- a/arch/x86/kernel/idt.c
>> +++ b/arch/x86/kernel/idt.c
>> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[]  
>> = {
>>   	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>>   	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
>>   	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
>> +	INTG(KVM_GUEST_PMI_VECTOR,		asm_sysvec_kvm_guest_pmi_handler),
>
> There is a subtle inconsistency in that this code is guarded on
> CONFIG_HAVE_KVM when the #define KVM_GUEST_PMI_VECTOR is guarded on
> CONFIG_KVM. Beats me why there are two different flags that are so
> similar, but it is possible they have different values leading to
> compilation errors.
>
> I happened to hit this compilation error because make defconfig will
> define CONFIG_HAVE_KVM but not CONFIG_KVM. CONFIG_KVM is the more strict
> of the two because it depends on CONFIG_HAVE_KVM.
>
> This code should also be guarded on CONFIG_KVM.

Hi Colton,

 what's you kernel base? In latest code base, the CONFIG_HAVE_KVM has been
dropped and this part of code has been guarded by CONFIG_KVM.

Please see commit dcf0926e9b89 "x86: replace CONFIG_HAVE_KVM with
IS_ENABLED(CONFIG_KVM)". Thanks.


>
>>   # endif
>>   # ifdef CONFIG_IRQ_WORK
>>   	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index 18cd418fe106..b29714e23fc4 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -183,6 +183,12 @@ int arch_show_interrupts(struct seq_file *p, int  
>> prec)
>>   		seq_printf(p, "%10u ",
>>   			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
>>   	seq_puts(p, "  Posted-interrupt wakeup event\n");
>> +
>> +	seq_printf(p, "%*s: ", prec, "VPMU");
>> +	for_each_online_cpu(j)
>> +		seq_printf(p, "%10u ",
>> +			   irq_stats(j)->kvm_guest_pmis);
>> +	seq_puts(p, " KVM GUEST PMI\n");
>>   #endif
>>   #ifdef CONFIG_X86_POSTED_MSI
>>   	seq_printf(p, "%*s: ", prec, "PMN");
>> @@ -311,6 +317,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
>>   #if IS_ENABLED(CONFIG_KVM)
>>   static void dummy_handler(void) {}
>>   static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
>> +static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
>>   void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>>   {
>> @@ -321,6 +328,10 @@ void x86_set_kvm_irq_handler(u8 vector, void  
>> (*handler)(void))
>>   	    (handler == dummy_handler ||
>>   	     kvm_posted_intr_wakeup_handler == dummy_handler))
>>   		kvm_posted_intr_wakeup_handler = handler;
>> +	else if (vector == KVM_GUEST_PMI_VECTOR &&
>> +		 (handler == dummy_handler ||
>> +		  kvm_guest_pmi_handler == dummy_handler))
>> +		kvm_guest_pmi_handler = handler;
>>   	else
>>   		WARN_ON_ONCE(1);
>> @@ -356,6 +367,16 @@  
>> DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>>   	apic_eoi();
>>   	inc_irq_stat(kvm_posted_intr_nested_ipis);
>>   }
>> +
>> +/*
>> + * Handler for KVM_GUEST_PMI_VECTOR.
>> + */
>> +DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_guest_pmi_handler)
>> +{
>> +	apic_eoi();
>> +	inc_irq_stat(kvm_guest_pmis);
>> +	kvm_guest_pmi_handler();
>> +}
>>   #endif
>>   #ifdef CONFIG_X86_POSTED_MSI
>> diff --git a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h  
>> b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>> index 13aea8fc3d45..670dcee46631 100644
>> --- a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>> +++ b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>> @@ -77,7 +77,10 @@
>>    */
>>   #define IRQ_WORK_VECTOR			0xf6
>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>> +#if IS_ENABLED(CONFIG_KVM)
>> +#define KVM_GUEST_PMI_VECTOR           0xf5
>> +#endif
>> +
>>   #define DEFERRED_ERROR_VECTOR		0xf4
>>   /* Vector on which hypervisor callbacks will be delivered */

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface
  2024-09-09 22:11   ` Colton Lewis
@ 2024-09-10  5:00     ` Mi, Dapeng
  2024-10-24 19:45     ` Chen, Zide
  1 sibling, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-09-10  5:00 UTC (permalink / raw)
  To: Colton Lewis, Mingwei Zhang
  Cc: seanjc, pbonzini, xiong.y.zhang, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv, yanfei.xu,
	like.xu.linux, peterz, rananta, kvm, linux-perf-users


On 9/10/2024 6:11 AM, Colton Lewis wrote:
> Mingwei Zhang <mizhang@google.com> writes:
>
>> From: Kan Liang <kan.liang@linux.intel.com>
>> Implement switch_interrupt interface for x86 PMU, switch PMI to dedicated
>> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
>> NMI at perf guest exit.
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>   arch/x86/events/core.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 5bf78cd619bf..b17ef8b6c1a6 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2673,6 +2673,15 @@ static bool x86_pmu_filter(struct pmu *pmu, int  
>> cpu)
>>   	return ret;
>>   }
>> +static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
>> +{
>> +	if (enter)
>> +		apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>> +			   (guest_lvtpc & APIC_LVT_MASKED));
>> +	else
>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +}
>> +
> Similar issue I point out in an earlier patch. #define
> KVM_GUEST_PMI_VECTOR is guarded by CONFIG_KVM but this code is not,
> which can result in compile errors.

Yes, thanks for pointing this.


>
>>   static struct pmu pmu = {
>>   	.pmu_enable		= x86_pmu_enable,
>>   	.pmu_disable		= x86_pmu_disable,
>> @@ -2702,6 +2711,8 @@ static struct pmu pmu = {
>>   	.aux_output_match	= x86_pmu_aux_output_match,
>>   	.filter			= x86_pmu_filter,
>> +
>> +	.switch_interrupt	= x86_pmu_switch_interrupt,
>>   };
>>   void arch_perf_update_userpage(struct perf_event *event,

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-09-10  4:59     ` Mi, Dapeng
@ 2024-09-10 16:45       ` Colton Lewis
  0 siblings, 0 replies; 183+ messages in thread
From: Colton Lewis @ 2024-09-10 16:45 UTC (permalink / raw)
  To: Mi Dapeng
  Cc: mizhang, seanjc, pbonzini, xiong.y.zhang, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv, yanfei.xu,
	like.xu.linux, peterz, rananta, kvm, linux-perf-users

"Mi, Dapeng" <dapeng1.mi@linux.intel.com> writes:

> On 9/10/2024 6:11 AM, Colton Lewis wrote:
>> Hello,

>> Mingwei Zhang <mizhang@google.com> writes:

>>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Create a new vector in the host IDT for kvm guest PMI handling within
>>> mediated passthrough vPMU. In addition, guest PMI handler registration
>>> is added into x86_set_kvm_irq_handler().
>>> This is the preparation work to support mediated passthrough vPMU to
>>> handle kvm guest PMIs without interference from PMI handler of the host
>>> PMU.
>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>    arch/x86/include/asm/hardirq.h                |  1 +
>>>    arch/x86/include/asm/idtentry.h               |  1 +
>>>    arch/x86/include/asm/irq_vectors.h            |  5 ++++-
>>>    arch/x86/kernel/idt.c                         |  1 +
>>>    arch/x86/kernel/irq.c                         | 21 +++++++++++++++++++
>>>    .../beauty/arch/x86/include/asm/irq_vectors.h |  5 ++++-
>>>    6 files changed, 32 insertions(+), 2 deletions(-)
>>> diff --git a/arch/x86/include/asm/hardirq.h
>>> b/arch/x86/include/asm/hardirq.h
>>> index c67fa6ad098a..42a396763c8d 100644
>>> --- a/arch/x86/include/asm/hardirq.h
>>> +++ b/arch/x86/include/asm/hardirq.h
>>> @@ -19,6 +19,7 @@ typedef struct {
>>>    	unsigned int kvm_posted_intr_ipis;
>>>    	unsigned int kvm_posted_intr_wakeup_ipis;
>>>    	unsigned int kvm_posted_intr_nested_ipis;
>>> +	unsigned int kvm_guest_pmis;
>>>    #endif
>>>    	unsigned int x86_platform_ipis;	/* arch dependent */
>>>    	unsigned int apic_perf_irqs;
>>> diff --git a/arch/x86/include/asm/idtentry.h
>>> b/arch/x86/include/asm/idtentry.h
>>> index d4f24499b256..7b1e3e542b1d 100644
>>> --- a/arch/x86/include/asm/idtentry.h
>>> +++ b/arch/x86/include/asm/idtentry.h
>>> @@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,
>>> sysvec_irq_work);
>>>    DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		 
>>> sysvec_kvm_posted_intr_ipi);
>>>    DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,
>>> sysvec_kvm_posted_intr_wakeup_ipi);
>>>    DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,
>>> sysvec_kvm_posted_intr_nested_ipi);
>>> +DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,
>>> sysvec_kvm_guest_pmi_handler);
>>>    #else
>>>    # define fred_sysvec_kvm_posted_intr_ipi		NULL
>>>    # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
>>> diff --git a/arch/x86/include/asm/irq_vectors.h
>>> b/arch/x86/include/asm/irq_vectors.h
>>> index 13aea8fc3d45..ada270e6f5cb 100644
>>> --- a/arch/x86/include/asm/irq_vectors.h
>>> +++ b/arch/x86/include/asm/irq_vectors.h
>>> @@ -77,7 +77,10 @@
>>>     */
>>>    #define IRQ_WORK_VECTOR			0xf6
>>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>>> +#if IS_ENABLED(CONFIG_KVM)
>>> +#define KVM_GUEST_PMI_VECTOR		0xf5
>>> +#endif
>>> +
>>>    #define DEFERRED_ERROR_VECTOR		0xf4
>>>    /* Vector on which hypervisor callbacks will be delivered */
>>> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
>>> index f445bec516a0..0bec4c7e2308 100644
>>> --- a/arch/x86/kernel/idt.c
>>> +++ b/arch/x86/kernel/idt.c
>>> @@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[]
>>> = {
>>>    	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
>>>    	INTG(POSTED_INTR_WAKEUP_VECTOR,		 
>>> asm_sysvec_kvm_posted_intr_wakeup_ipi),
>>>    	INTG(POSTED_INTR_NESTED_VECTOR,		 
>>> asm_sysvec_kvm_posted_intr_nested_ipi),
>>> +	INTG(KVM_GUEST_PMI_VECTOR,		asm_sysvec_kvm_guest_pmi_handler),

>> There is a subtle inconsistency in that this code is guarded on
>> CONFIG_HAVE_KVM when the #define KVM_GUEST_PMI_VECTOR is guarded on
>> CONFIG_KVM. Beats me why there are two different flags that are so
>> similar, but it is possible they have different values leading to
>> compilation errors.

>> I happened to hit this compilation error because make defconfig will
>> define CONFIG_HAVE_KVM but not CONFIG_KVM. CONFIG_KVM is the more strict
>> of the two because it depends on CONFIG_HAVE_KVM.

>> This code should also be guarded on CONFIG_KVM.

> Hi Colton,

>  what's you kernel base? In latest code base, the CONFIG_HAVE_KVM has been
> dropped and this part of code has been guarded by CONFIG_KVM.

> Please see commit dcf0926e9b89 "x86: replace CONFIG_HAVE_KVM with
> IS_ENABLED(CONFIG_KVM)". Thanks.

My kernel base was older than that. Glad to know it is no longer an
issue.


>>>    # endif
>>>    # ifdef CONFIG_IRQ_WORK
>>>    	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
>>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>>> index 18cd418fe106..b29714e23fc4 100644
>>> --- a/arch/x86/kernel/irq.c
>>> +++ b/arch/x86/kernel/irq.c
>>> @@ -183,6 +183,12 @@ int arch_show_interrupts(struct seq_file *p, int
>>> prec)
>>>    		seq_printf(p, "%10u ",
>>>    			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
>>>    	seq_puts(p, "  Posted-interrupt wakeup event\n");
>>> +
>>> +	seq_printf(p, "%*s: ", prec, "VPMU");
>>> +	for_each_online_cpu(j)
>>> +		seq_printf(p, "%10u ",
>>> +			   irq_stats(j)->kvm_guest_pmis);
>>> +	seq_puts(p, " KVM GUEST PMI\n");
>>>    #endif
>>>    #ifdef CONFIG_X86_POSTED_MSI
>>>    	seq_printf(p, "%*s: ", prec, "PMN");
>>> @@ -311,6 +317,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
>>>    #if IS_ENABLED(CONFIG_KVM)
>>>    static void dummy_handler(void) {}
>>>    static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
>>> +static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
>>>    void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>>>    {
>>> @@ -321,6 +328,10 @@ void x86_set_kvm_irq_handler(u8 vector, void
>>> (*handler)(void))
>>>    	    (handler == dummy_handler ||
>>>    	     kvm_posted_intr_wakeup_handler == dummy_handler))
>>>    		kvm_posted_intr_wakeup_handler = handler;
>>> +	else if (vector == KVM_GUEST_PMI_VECTOR &&
>>> +		 (handler == dummy_handler ||
>>> +		  kvm_guest_pmi_handler == dummy_handler))
>>> +		kvm_guest_pmi_handler = handler;
>>>    	else
>>>    		WARN_ON_ONCE(1);
>>> @@ -356,6 +367,16 @@
>>> DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>>>    	apic_eoi();
>>>    	inc_irq_stat(kvm_posted_intr_nested_ipis);
>>>    }
>>> +
>>> +/*
>>> + * Handler for KVM_GUEST_PMI_VECTOR.
>>> + */
>>> +DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_guest_pmi_handler)
>>> +{
>>> +	apic_eoi();
>>> +	inc_irq_stat(kvm_guest_pmis);
>>> +	kvm_guest_pmi_handler();
>>> +}
>>>    #endif
>>>    #ifdef CONFIG_X86_POSTED_MSI
>>> diff --git a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>>> b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>>> index 13aea8fc3d45..670dcee46631 100644
>>> --- a/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>>> +++ b/tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h
>>> @@ -77,7 +77,10 @@
>>>     */
>>>    #define IRQ_WORK_VECTOR			0xf6
>>> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
>>> +#if IS_ENABLED(CONFIG_KVM)
>>> +#define KVM_GUEST_PMI_VECTOR           0xf5
>>> +#endif
>>> +
>>>    #define DEFERRED_ERROR_VECTOR		0xf4
>>>    /* Vector on which hypervisor callbacks will be delivered */

^ permalink raw reply	[flat|nested] 183+ messages in thread

* RE: [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (57 preceding siblings ...)
  2024-08-01  4:59 ` [RFC PATCH v3 58/58] perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host Mingwei Zhang
@ 2024-09-11 10:45 ` Ma, Yongwei
  2024-11-19 14:00 ` Sean Christopherson
  2024-11-20 11:55 ` Mi, Dapeng
  60 siblings, 0 replies; 183+ messages in thread
From: Ma, Yongwei @ 2024-09-11 10:45 UTC (permalink / raw)
  To: Zhang, Mingwei, Sean Christopherson, Paolo Bonzini,
	Zhang, Xiong Y, Dapeng Mi, Liang, Kan, Zhenyu Wang, Manali Shukla,
	Sandipan Das
  Cc: Jim Mattson, Eranian, Stephane, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev@google.com, Alt, Samantha, Lv, Zhiyuan,
	Xu, Yanfei, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta,
	kvm@vger.kernel.org, linux-perf-users@vger.kernel.org

On 2024/9/11 16:16, Yongwei Ma wrote:
> This series contains perf interface improvements to address Peter's
> comments. In addition, fix several bugs for v2. This version is based on 6.10-
> rc4. The main changes are:
> 
>  - Use atomics to replace refcounts to track the nr_mediated_pmu_vms.
>  - Use the generic ctx_sched_{in,out}() to switch PMU resources when a
>    guest is entering and exiting.
>  - Add a new EVENT_GUEST flag to indicate the context switch case of
>    entering and exiting a guest. Updates the generic ctx_sched_{in,out}
>    to specifically handle this case, especially for time management.
>  - Switch PMI vector in perf_guest_{enter,exit}() as well. Add a new
>    driver-specific interface to facilitate the switch.
>  - Remove the PMU_FL_PASSTHROUGH flag and uses the PASSTHROUGH pmu
>    capability instead.
>  - Adjust commit sequence in PERF and KVM PMI interrupt functions.
>  - Use pmc_is_globally_enabled() check in emulated counter increment [1]
>  - Fix PMU context switch [2] by using rdpmc() instead of rdmsr().
> 
> AMD fixes:
>  - Add support for legacy PMU MSRs in MSR interception.
>  - Make MSR usage consistent if PerfMonV2 is available.
>  - Avoid enabling passthrough vPMU when local APIC is not in kernel.
>  - increment counters in emulation mode.
> 
> This series is organized in the following order:
> 
> Patches 1-3:
>  - Immediate bug fixes that can be applied to Linux tip.
>  - Note: will put immediate fixes ahead in the future. These patches
>    might be duplicated with existing posts.
>  - Note: patches 1-2 are needed for AMD when host kernel enables
>    preemption. Otherwise, guest will suffer from softlockup.
> 
> Patches 4-17:
>  - Perf side changes, infra changes in core pmu with API for KVM.
> 
> Patches 18-48:
>  - KVM mediated passthrough vPMU framework + Intel CPU implementation.
> 
> Patches 49-58:
>  - AMD CPU implementation for vPMU.
> 
> Reference (patches in v2):
> [1] [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter
> increment for passthrough PMU
>  - https://lore.kernel.org/all/20240506053020.3911940-43-
> mizhang@google.com/
> [2] [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU
> state for Intel CPU
>  - https://lore.kernel.org/all/20240506053020.3911940-31-
> mizhang@google.com/
> 
> V2: https://lore.kernel.org/all/20240506053020.3911940-1-
> mizhang@google.com/
> 
> Dapeng Mi (3):
>   x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
>   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
>   KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU
>     MSRs
> 
> Kan Liang (8):
>   perf: Support get/put passthrough PMU interfaces
>   perf: Skip pmu_ctx based on event_type
>   perf: Clean up perf ctx time
>   perf: Add a EVENT_GUEST flag
>   perf: Add generic exclude_guest support
>   perf: Add switch_interrupt() interface
>   perf/x86: Support switch_interrupt interface
>   perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
> 
> Manali Shukla (1):
>   KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough
>     PMU
> 
> Mingwei Zhang (24):
>   perf/x86: Forbid PMI handler when guest own PMU
>   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
>   KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
>   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
>   KVM: x86/pmu: Add host_perf_cap and initialize it in
>     kvm_x86_vendor_init()
>   KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to
>     guest
>   KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough
>     allowed
>   KVM: x86/pmu: Create a function prototype to disable MSR interception
>   KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in
>     passthrough vPMU
>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>   KVM: x86/pmu: Add counter MSR and selector MSR index into struct
>     kvm_pmc
>   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
>     context
>   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
>   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
>   KVM: x86/pmu: Allow writing to event selector for GP counters if event
>     is allowed
>   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
>     exposed
>   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
>   KVM: x86/pmu: Introduce PMU operator to increment counter
>   KVM: x86/pmu: Introduce PMU operator for setting counter overflow
>   KVM: x86/pmu: Implement emulated counter increment for passthrough
> PMU
>   KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
>   KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
>   KVM: nVMX: Add nested virtualization support for passthrough PMU
> 
> Sandipan Das (12):
>   perf/x86: Do not set bit width for unavailable counters
>   x86/msr: Define PerfCntrGlobalStatusSet register
>   KVM: x86/pmu: Always set global enable bits in passthrough mode
>   KVM: x86/pmu/svm: Set passthrough capability for vcpus
>   KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
>   KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed
>     to guest
>   KVM: x86/pmu/svm: Implement callback to disable MSR interception
>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>     write to event selectors
>   KVM: x86/pmu/svm: Add registers to direct access list
>   KVM: x86/pmu/svm: Implement handlers to save and restore context
>   KVM: x86/pmu/svm: Implement callback to increment counters
>   perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD
> host
> 
> Sean Christopherson (2):
>   sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
>   sched/core: Drop spinlocks on contention iff kernel is preemptible
> 
> Xiong Zhang (8):
>   x86/irq: Factor out common code for installing kvm irq handler
>   perf: core/x86: Register a new vector for KVM GUEST PMI
>   KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
>   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
>   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
>   KVM: x86/pmu: Notify perf core at KVM context switch boundary
>   KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
>   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
> 
>  .../admin-guide/kernel-parameters.txt         |   4 +-
>  arch/x86/events/amd/core.c                    |   2 +
>  arch/x86/events/core.c                        |  44 +-
>  arch/x86/events/intel/core.c                  |   5 +
>  arch/x86/include/asm/hardirq.h                |   1 +
>  arch/x86/include/asm/idtentry.h               |   1 +
>  arch/x86/include/asm/irq.h                    |   2 +-
>  arch/x86/include/asm/irq_vectors.h            |   5 +-
>  arch/x86/include/asm/kvm-x86-pmu-ops.h        |   6 +
>  arch/x86/include/asm/kvm_host.h               |   9 +
>  arch/x86/include/asm/msr-index.h              |   2 +
>  arch/x86/include/asm/perf_event.h             |   1 +
>  arch/x86/include/asm/vmx.h                    |   1 +
>  arch/x86/kernel/idt.c                         |   1 +
>  arch/x86/kernel/irq.c                         |  39 +-
>  arch/x86/kvm/cpuid.c                          |   3 +
>  arch/x86/kvm/pmu.c                            | 154 +++++-
>  arch/x86/kvm/pmu.h                            |  49 ++
>  arch/x86/kvm/svm/pmu.c                        | 136 +++++-
>  arch/x86/kvm/svm/svm.c                        |  31 ++
>  arch/x86/kvm/svm/svm.h                        |   2 +-
>  arch/x86/kvm/vmx/capabilities.h               |   1 +
>  arch/x86/kvm/vmx/nested.c                     |  52 +++
>  arch/x86/kvm/vmx/pmu_intel.c                  | 192 +++++++-
>  arch/x86/kvm/vmx/vmx.c                        | 197 ++++++--
>  arch/x86/kvm/vmx/vmx.h                        |   3 +-
>  arch/x86/kvm/x86.c                            |  45 ++
>  arch/x86/kvm/x86.h                            |   1 +
>  include/linux/perf_event.h                    |  38 +-
>  include/linux/preempt.h                       |  41 ++
>  include/linux/sched.h                         |  41 --
>  include/linux/spinlock.h                      |  14 +-
>  kernel/events/core.c                          | 441 ++++++++++++++----
>  .../beauty/arch/x86/include/asm/irq_vectors.h |   5 +-
>  34 files changed, 1373 insertions(+), 196 deletions(-)
> 
> 
> base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
> --
> 2.46.0.rc1.232.g9752f9e123-goog
> 
Tested on Intel Ice Lake and Sky Lake platforms. 
Run kvm-unit-test, kselftest and perf fuzz test. 
Compared with the no mediated vpmu solution, did not see any obvious issue.
Tested-by: Yongwei Ma <yongwei.ma@intel.com>

Best Regard,
Yongwei Ma

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
@ 2024-09-19  6:02   ` Manali Shukla
  2024-09-19 13:00     ` Liang, Kan
  2024-10-14 11:56   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 183+ messages in thread
From: Manali Shukla @ 2024-09-19  6:02 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users, Manali Shukla

On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> There will be a dedicated interrupt vector for guests on some platforms,
> e.g., Intel. Add an interface to switch the interrupt vector while
> entering/exiting a guest.
> 
> When PMI switch into a new guest vector, guest_lvtpc value need to be
> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> bit should be cleared also, then PMI can be generated continuously
> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> and switch_interrupt().
> 
> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> be found. Since only one passthrough pmu is supported, we keep the
> implementation simply by tracking the pmu as a global variable.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> 
> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> supported.]
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |  9 +++++++--
>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 75773f9890cc..aeb08f78f539 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -541,6 +541,11 @@ struct pmu {
>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>  	 */
>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> +
> +	/*
> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
> +	 */
> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>  };
>  
>  enum perf_addr_filter_action_t {
> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  int perf_get_mediated_pmu(void);
>  void perf_put_mediated_pmu(void);
> -void perf_guest_enter(void);
> +void perf_guest_enter(u32 guest_lvtpc);
>  void perf_guest_exit(void);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>  }
>  
>  static inline void perf_put_mediated_pmu(void)			{ }
> -static inline void perf_guest_enter(void)			{ }
> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>  static inline void perf_guest_exit(void)			{ }
>  #endif
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 57ff737b922b..047ca5748ee2 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>  
>  static LIST_HEAD(pmus);
>  static DEFINE_MUTEX(pmus_lock);
> +static struct pmu *passthru_pmu;
>  static struct srcu_struct pmus_srcu;
>  static cpumask_var_t perf_online_mask;
>  static struct kmem_cache *perf_event_cache;
> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> +{
> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> +	if (!passthru_pmu)
> +		return;
> +
> +	if (passthru_pmu->switch_interrupt &&
> +	    try_module_get(passthru_pmu->module)) {
> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> +		module_put(passthru_pmu->module);
> +	}
> +}
> +
>  /* When entering a guest, schedule out all exclude_guest events. */
> -void perf_guest_enter(void)
> +void perf_guest_enter(u32 guest_lvtpc)
>  {
>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>  
> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
>  
> +	perf_switch_interrupt(true, guest_lvtpc);
> +
>  	__this_cpu_write(perf_in_guest, true);
>  
>  unlock:
> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>  		goto unlock;
>  
> +	perf_switch_interrupt(false, 0);
> +
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (!pmu->event_idx)
>  		pmu->event_idx = perf_event_idx_default;
>  
> -	list_add_rcu(&pmu->entry, &pmus);
> +	/*
> +	 * Initialize passthru_pmu with the core pmu that has
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> +	 */
> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> +		if (!passthru_pmu)
> +			passthru_pmu = pmu;
> +
> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> +			ret = -EINVAL;
> +			goto free_dev;
> +		}
> +	}


Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However, 
if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.

> +
> +	list_add_tail_rcu(&pmu->entry, &pmus);
>  	atomic_set(&pmu->exclusive_cnt, 0);
>  	ret = 0;
>  unlock:


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-09-19  6:02   ` Manali Shukla
@ 2024-09-19 13:00     ` Liang, Kan
  2024-09-20  5:09       ` Manali Shukla
  0 siblings, 1 reply; 183+ messages in thread
From: Liang, Kan @ 2024-09-19 13:00 UTC (permalink / raw)
  To: Manali Shukla, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 2024-09-19 2:02 a.m., Manali Shukla wrote:
> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> There will be a dedicated interrupt vector for guests on some platforms,
>> e.g., Intel. Add an interface to switch the interrupt vector while
>> entering/exiting a guest.
>>
>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>> bit should be cleared also, then PMI can be generated continuously
>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>> and switch_interrupt().
>>
>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>> be found. Since only one passthrough pmu is supported, we keep the
>> implementation simply by tracking the pmu as a global variable.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>
>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>> supported.]
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |  9 +++++++--
>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 75773f9890cc..aeb08f78f539 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -541,6 +541,11 @@ struct pmu {
>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>  	 */
>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>> +
>> +	/*
>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>> +	 */
>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>  };
>>  
>>  enum perf_addr_filter_action_t {
>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>  int perf_get_mediated_pmu(void);
>>  void perf_put_mediated_pmu(void);
>> -void perf_guest_enter(void);
>> +void perf_guest_enter(u32 guest_lvtpc);
>>  void perf_guest_exit(void);
>>  #else /* !CONFIG_PERF_EVENTS: */
>>  static inline void *
>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>>  }
>>  
>>  static inline void perf_put_mediated_pmu(void)			{ }
>> -static inline void perf_guest_enter(void)			{ }
>> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>>  static inline void perf_guest_exit(void)			{ }
>>  #endif
>>  
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 57ff737b922b..047ca5748ee2 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>>  
>>  static LIST_HEAD(pmus);
>>  static DEFINE_MUTEX(pmus_lock);
>> +static struct pmu *passthru_pmu;
>>  static struct srcu_struct pmus_srcu;
>>  static cpumask_var_t perf_online_mask;
>>  static struct kmem_cache *perf_event_cache;
>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>  
>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>> +{
>> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>> +	if (!passthru_pmu)
>> +		return;
>> +
>> +	if (passthru_pmu->switch_interrupt &&
>> +	    try_module_get(passthru_pmu->module)) {
>> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>> +		module_put(passthru_pmu->module);
>> +	}
>> +}
>> +
>>  /* When entering a guest, schedule out all exclude_guest events. */
>> -void perf_guest_enter(void)
>> +void perf_guest_enter(u32 guest_lvtpc)
>>  {
>>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>  
>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>  
>> +	perf_switch_interrupt(true, guest_lvtpc);
>> +
>>  	__this_cpu_write(perf_in_guest, true);
>>  
>>  unlock:
>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>  		goto unlock;
>>  
>> +	perf_switch_interrupt(false, 0);
>> +
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>  	if (!pmu->event_idx)
>>  		pmu->event_idx = perf_event_idx_default;
>>  
>> -	list_add_rcu(&pmu->entry, &pmus);
>> +	/*
>> +	 * Initialize passthru_pmu with the core pmu that has
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>> +	 */
>> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>> +		if (!passthru_pmu)
>> +			passthru_pmu = pmu;
>> +
>> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>> +			ret = -EINVAL;
>> +			goto free_dev;
>> +		}
>> +	}
> 
> 
> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However, 
> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>

The original implementation doesn't limit the number of PMUs with
PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
case of more than one PMU with the flag. After several debates, the
patch was simplified only to support one PMU with the flag.
It should not be hard to change it back.

Thanks,
Kan

>> +
>> +	list_add_tail_rcu(&pmu->entry, &pmus);
>>  	atomic_set(&pmu->exclusive_cnt, 0);
>>  	ret = 0;
>>  unlock:
> 
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-09-19 13:00     ` Liang, Kan
@ 2024-09-20  5:09       ` Manali Shukla
  2024-09-23 18:49         ` Mingwei Zhang
  0 siblings, 1 reply; 183+ messages in thread
From: Manali Shukla @ 2024-09-20  5:09 UTC (permalink / raw)
  To: Liang, Kan, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users, Manali Shukla

On 9/19/2024 6:30 PM, Liang, Kan wrote:
> 
> 
> On 2024-09-19 2:02 a.m., Manali Shukla wrote:
>> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> There will be a dedicated interrupt vector for guests on some platforms,
>>> e.g., Intel. Add an interface to switch the interrupt vector while
>>> entering/exiting a guest.
>>>
>>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>>> bit should be cleared also, then PMI can be generated continuously
>>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>>> and switch_interrupt().
>>>
>>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>>> be found. Since only one passthrough pmu is supported, we keep the
>>> implementation simply by tracking the pmu as a global variable.
>>>
>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>>> supported.]
>>>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>  include/linux/perf_event.h |  9 +++++++--
>>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index 75773f9890cc..aeb08f78f539 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -541,6 +541,11 @@ struct pmu {
>>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>>  	 */
>>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>>> +
>>> +	/*
>>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>>> +	 */
>>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>>  };
>>>  
>>>  enum perf_addr_filter_action_t {
>>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>>  int perf_get_mediated_pmu(void);
>>>  void perf_put_mediated_pmu(void);
>>> -void perf_guest_enter(void);
>>> +void perf_guest_enter(u32 guest_lvtpc);
>>>  void perf_guest_exit(void);
>>>  #else /* !CONFIG_PERF_EVENTS: */
>>>  static inline void *
>>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>>>  }
>>>  
>>>  static inline void perf_put_mediated_pmu(void)			{ }
>>> -static inline void perf_guest_enter(void)			{ }
>>> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>>>  static inline void perf_guest_exit(void)			{ }
>>>  #endif
>>>  
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 57ff737b922b..047ca5748ee2 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>>>  
>>>  static LIST_HEAD(pmus);
>>>  static DEFINE_MUTEX(pmus_lock);
>>> +static struct pmu *passthru_pmu;
>>>  static struct srcu_struct pmus_srcu;
>>>  static cpumask_var_t perf_online_mask;
>>>  static struct kmem_cache *perf_event_cache;
>>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>>  }
>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>  
>>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>>> +{
>>> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>>> +	if (!passthru_pmu)
>>> +		return;
>>> +
>>> +	if (passthru_pmu->switch_interrupt &&
>>> +	    try_module_get(passthru_pmu->module)) {
>>> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>>> +		module_put(passthru_pmu->module);
>>> +	}
>>> +}
>>> +
>>>  /* When entering a guest, schedule out all exclude_guest events. */
>>> -void perf_guest_enter(void)
>>> +void perf_guest_enter(u32 guest_lvtpc)
>>>  {
>>>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>  
>>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>>  	}
>>>  
>>> +	perf_switch_interrupt(true, guest_lvtpc);
>>> +
>>>  	__this_cpu_write(perf_in_guest, true);
>>>  
>>>  unlock:
>>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>>  		goto unlock;
>>>  
>>> +	perf_switch_interrupt(false, 0);
>>> +
>>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>>  	if (!pmu->event_idx)
>>>  		pmu->event_idx = perf_event_idx_default;
>>>  
>>> -	list_add_rcu(&pmu->entry, &pmus);
>>> +	/*
>>> +	 * Initialize passthru_pmu with the core pmu that has
>>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>>> +	 */
>>> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>>> +		if (!passthru_pmu)
>>> +			passthru_pmu = pmu;
>>> +
>>> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>>> +			ret = -EINVAL;
>>> +			goto free_dev;
>>> +		}
>>> +	}
>>
>>
>> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However, 
>> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
>> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>>
> 
> The original implementation doesn't limit the number of PMUs with
> PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
> case of more than one PMU with the flag. After several debates, the
> patch was simplified only to support one PMU with the flag.
> It should not be hard to change it back.
> 
> Thanks,
> Kan
> 

Yes, we have a use case to use mediated passthrough vPMU framework for IBS virtualization.
So, we will need it.

- Manali

>>> +
>>> +	list_add_tail_rcu(&pmu->entry, &pmus);
>>>  	atomic_set(&pmu->exclusive_cnt, 0);
>>>  	ret = 0;
>>>  unlock:
>>
>>


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-09-20  5:09       ` Manali Shukla
@ 2024-09-23 18:49         ` Mingwei Zhang
  2024-09-24 16:55           ` Manali Shukla
  2024-10-14 11:59           ` Peter Zijlstra
  0 siblings, 2 replies; 183+ messages in thread
From: Mingwei Zhang @ 2024-09-23 18:49 UTC (permalink / raw)
  To: Manali Shukla
  Cc: Liang, Kan, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Fri, Sep 20, 2024 at 7:09 AM Manali Shukla <manali.shukla@amd.com> wrote:
>
> On 9/19/2024 6:30 PM, Liang, Kan wrote:
> >
> >
> > On 2024-09-19 2:02 a.m., Manali Shukla wrote:
> >> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
> >>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>
> >>> There will be a dedicated interrupt vector for guests on some platforms,
> >>> e.g., Intel. Add an interface to switch the interrupt vector while
> >>> entering/exiting a guest.
> >>>
> >>> When PMI switch into a new guest vector, guest_lvtpc value need to be
> >>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> >>> bit should be cleared also, then PMI can be generated continuously
> >>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> >>> and switch_interrupt().
> >>>
> >>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> >>> be found. Since only one passthrough pmu is supported, we keep the
> >>> implementation simply by tracking the pmu as a global variable.
> >>>
> >>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>
> >>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> >>> supported.]
> >>>
> >>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >>> ---
> >>>  include/linux/perf_event.h |  9 +++++++--
> >>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
> >>>  2 files changed, 41 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >>> index 75773f9890cc..aeb08f78f539 100644
> >>> --- a/include/linux/perf_event.h
> >>> +++ b/include/linux/perf_event.h
> >>> @@ -541,6 +541,11 @@ struct pmu {
> >>>      * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
> >>>      */
> >>>     int (*check_period)             (struct perf_event *event, u64 value); /* optional */
> >>> +
> >>> +   /*
> >>> +    * Switch the interrupt vectors, e.g., guest enter/exit.
> >>> +    */
> >>> +   void (*switch_interrupt)        (bool enter, u32 guest_lvtpc); /* optional */
> >>>  };
> >>>
> >>>  enum perf_addr_filter_action_t {
> >>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
> >>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
> >>>  int perf_get_mediated_pmu(void);
> >>>  void perf_put_mediated_pmu(void);
> >>> -void perf_guest_enter(void);
> >>> +void perf_guest_enter(u32 guest_lvtpc);
> >>>  void perf_guest_exit(void);
> >>>  #else /* !CONFIG_PERF_EVENTS: */
> >>>  static inline void *
> >>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
> >>>  }
> >>>
> >>>  static inline void perf_put_mediated_pmu(void)                     { }
> >>> -static inline void perf_guest_enter(void)                  { }
> >>> +static inline void perf_guest_enter(u32 guest_lvtpc)               { }
> >>>  static inline void perf_guest_exit(void)                   { }
> >>>  #endif
> >>>
> >>> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >>> index 57ff737b922b..047ca5748ee2 100644
> >>> --- a/kernel/events/core.c
> >>> +++ b/kernel/events/core.c
> >>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
> >>>
> >>>  static LIST_HEAD(pmus);
> >>>  static DEFINE_MUTEX(pmus_lock);
> >>> +static struct pmu *passthru_pmu;
> >>>  static struct srcu_struct pmus_srcu;
> >>>  static cpumask_var_t perf_online_mask;
> >>>  static struct kmem_cache *perf_event_cache;
> >>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> >>>
> >>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> >>> +{
> >>> +   /* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> >>> +   if (!passthru_pmu)
> >>> +           return;
> >>> +
> >>> +   if (passthru_pmu->switch_interrupt &&
> >>> +       try_module_get(passthru_pmu->module)) {
> >>> +           passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> >>> +           module_put(passthru_pmu->module);
> >>> +   }
> >>> +}
> >>> +
> >>>  /* When entering a guest, schedule out all exclude_guest events. */
> >>> -void perf_guest_enter(void)
> >>> +void perf_guest_enter(u32 guest_lvtpc)
> >>>  {
> >>>     struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> >>>
> >>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
> >>>             perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> >>>     }
> >>>
> >>> +   perf_switch_interrupt(true, guest_lvtpc);
> >>> +
> >>>     __this_cpu_write(perf_in_guest, true);
> >>>
> >>>  unlock:
> >>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
> >>>     if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> >>>             goto unlock;
> >>>
> >>> +   perf_switch_interrupt(false, 0);
> >>> +
> >>>     perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> >>>     ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
> >>>     perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> >>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> >>>     if (!pmu->event_idx)
> >>>             pmu->event_idx = perf_event_idx_default;
> >>>
> >>> -   list_add_rcu(&pmu->entry, &pmus);
> >>> +   /*
> >>> +    * Initialize passthru_pmu with the core pmu that has
> >>> +    * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> >>> +    */
> >>> +   if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> >>> +           if (!passthru_pmu)
> >>> +                   passthru_pmu = pmu;
> >>> +
> >>> +           if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> >>> +                   ret = -EINVAL;
> >>> +                   goto free_dev;
> >>> +           }
> >>> +   }
> >>
> >>
> >> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However,
> >> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
> >> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> >>
> >
> > The original implementation doesn't limit the number of PMUs with
> > PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
> > case of more than one PMU with the flag. After several debates, the
> > patch was simplified only to support one PMU with the flag.
> > It should not be hard to change it back.

The original implementation is by design having a terrible performance
overhead, ie., every PMU context switch at runtime requires a SRCU
lock pair and pmu list traversal. To reduce the overhead, we put
"passthrough" pmus in the front of the list and quickly exit the pmu
traversal when we just pass the last "passthrough" pmu.

I honestly do not like the idea because it is simply a hacky solution
with high maintenance cost, but I am not the right person to make the
call. So, if perf (kernel) subsystem maintainers are happy, I can
withdraw from this one.

My second point is: if you look at the details, the only reason why we
traverse the pmu list is to check if a pmu has implemented the
"switch_interrupt()" API. If so, invoke it, which will change the PMI
vector from NMI to a maskable interrupt for KVM. In AMD case, I bet
even if there are multiple passthrough pmus, only one requires
switching the interrupt vector. Let me know if this is wrong.

Thanks.
-Mingwei

> >
> > Thanks,
> > Kan
> >
>
> Yes, we have a use case to use mediated passthrough vPMU framework for IBS virtualization.
> So, we will need it.
>
> - Manali
>
> >>> +
> >>> +   list_add_tail_rcu(&pmu->entry, &pmus);
> >>>     atomic_set(&pmu->exclusive_cnt, 0);
> >>>     ret = 0;
> >>>  unlock:
> >>
> >>
>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-09-23 18:49         ` Mingwei Zhang
@ 2024-09-24 16:55           ` Manali Shukla
  2024-10-14 11:59           ` Peter Zijlstra
  1 sibling, 0 replies; 183+ messages in thread
From: Manali Shukla @ 2024-09-24 16:55 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Liang, Kan, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users, Shukla, Santosh,
	Manali Shukla, nikunj

On 9/24/2024 12:19 AM, Mingwei Zhang wrote:
> On Fri, Sep 20, 2024 at 7:09 AM Manali Shukla <manali.shukla@amd.com> wrote:
>>
>> On 9/19/2024 6:30 PM, Liang, Kan wrote:
>>>
>>>
>>> On 2024-09-19 2:02 a.m., Manali Shukla wrote:
>>>> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> There will be a dedicated interrupt vector for guests on some platforms,
>>>>> e.g., Intel. Add an interface to switch the interrupt vector while
>>>>> entering/exiting a guest.
>>>>>
>>>>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>>>>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>>>>> bit should be cleared also, then PMI can be generated continuously
>>>>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>>>>> and switch_interrupt().
>>>>>
>>>>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>>>>> be found. Since only one passthrough pmu is supported, we keep the
>>>>> implementation simply by tracking the pmu as a global variable.
>>>>>
>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>>>>> supported.]
>>>>>
>>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>>> ---
>>>>>  include/linux/perf_event.h |  9 +++++++--
>>>>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>>>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>>> index 75773f9890cc..aeb08f78f539 100644
>>>>> --- a/include/linux/perf_event.h
>>>>> +++ b/include/linux/perf_event.h
>>>>> @@ -541,6 +541,11 @@ struct pmu {
>>>>>      * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>>>>      */
>>>>>     int (*check_period)             (struct perf_event *event, u64 value); /* optional */
>>>>> +
>>>>> +   /*
>>>>> +    * Switch the interrupt vectors, e.g., guest enter/exit.
>>>>> +    */
>>>>> +   void (*switch_interrupt)        (bool enter, u32 guest_lvtpc); /* optional */
>>>>>  };
>>>>>
>>>>>  enum perf_addr_filter_action_t {
>>>>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>>>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>>>>  int perf_get_mediated_pmu(void);
>>>>>  void perf_put_mediated_pmu(void);
>>>>> -void perf_guest_enter(void);
>>>>> +void perf_guest_enter(u32 guest_lvtpc);
>>>>>  void perf_guest_exit(void);
>>>>>  #else /* !CONFIG_PERF_EVENTS: */
>>>>>  static inline void *
>>>>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>>>>>  }
>>>>>
>>>>>  static inline void perf_put_mediated_pmu(void)                     { }
>>>>> -static inline void perf_guest_enter(void)                  { }
>>>>> +static inline void perf_guest_enter(u32 guest_lvtpc)               { }
>>>>>  static inline void perf_guest_exit(void)                   { }
>>>>>  #endif
>>>>>
>>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>>> index 57ff737b922b..047ca5748ee2 100644
>>>>> --- a/kernel/events/core.c
>>>>> +++ b/kernel/events/core.c
>>>>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>>>>>
>>>>>  static LIST_HEAD(pmus);
>>>>>  static DEFINE_MUTEX(pmus_lock);
>>>>> +static struct pmu *passthru_pmu;
>>>>>  static struct srcu_struct pmus_srcu;
>>>>>  static cpumask_var_t perf_online_mask;
>>>>>  static struct kmem_cache *perf_event_cache;
>>>>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>>>
>>>>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>>>>> +{
>>>>> +   /* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>>>>> +   if (!passthru_pmu)
>>>>> +           return;
>>>>> +
>>>>> +   if (passthru_pmu->switch_interrupt &&
>>>>> +       try_module_get(passthru_pmu->module)) {
>>>>> +           passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>>>>> +           module_put(passthru_pmu->module);
>>>>> +   }
>>>>> +}
>>>>> +
>>>>>  /* When entering a guest, schedule out all exclude_guest events. */
>>>>> -void perf_guest_enter(void)
>>>>> +void perf_guest_enter(u32 guest_lvtpc)
>>>>>  {
>>>>>     struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>>>
>>>>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>>>>             perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>>>>     }
>>>>>
>>>>> +   perf_switch_interrupt(true, guest_lvtpc);
>>>>> +
>>>>>     __this_cpu_write(perf_in_guest, true);
>>>>>
>>>>>  unlock:
>>>>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>>>>     if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>>>>             goto unlock;
>>>>>
>>>>> +   perf_switch_interrupt(false, 0);
>>>>> +
>>>>>     perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>>>>     ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>>>>     perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>>>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>>>>     if (!pmu->event_idx)
>>>>>             pmu->event_idx = perf_event_idx_default;
>>>>>
>>>>> -   list_add_rcu(&pmu->entry, &pmus);
>>>>> +   /*
>>>>> +    * Initialize passthru_pmu with the core pmu that has
>>>>> +    * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>>>>> +    */
>>>>> +   if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>>>>> +           if (!passthru_pmu)
>>>>> +                   passthru_pmu = pmu;
>>>>> +
>>>>> +           if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>>>>> +                   ret = -EINVAL;
>>>>> +                   goto free_dev;
>>>>> +           }
>>>>> +   }
>>>>
>>>>
>>>> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However,
>>>> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
>>>> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>>>>
>>>
>>> The original implementation doesn't limit the number of PMUs with
>>> PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
>>> case of more than one PMU with the flag. After several debates, the
>>> patch was simplified only to support one PMU with the flag.
>>> It should not be hard to change it back.
> 
> The original implementation is by design having a terrible performance
> overhead, ie., every PMU context switch at runtime requires a SRCU
> lock pair and pmu list traversal. To reduce the overhead, we put
> "passthrough" pmus in the front of the list and quickly exit the pmu
> traversal when we just pass the last "passthrough" pmu.
> 
> I honestly do not like the idea because it is simply a hacky solution
> with high maintenance cost, but I am not the right person to make the
> call. So, if perf (kernel) subsystem maintainers are happy, I can
> withdraw from this one.
> 
> My second point is: if you look at the details, the only reason why we
> traverse the pmu list is to check if a pmu has implemented the
> "switch_interrupt()" API. If so, invoke it, which will change the PMI
> vector from NMI to a maskable interrupt for KVM. In AMD case, I bet
> even if there are multiple passthrough pmus, only one requires
> switching the interrupt vector. Let me know if this is wrong.
> 
> Thanks.
> -Mingwei
> 

Yeah, That is correct. Currently, switching of the interrupt vector is
needed only by one passthrough PMU for AMD. It is not required for IBS
virtualization and PMC virtualization because VNMI is used to deliver
interrupt in both cases.

With the mediated passthrough VPMU enabled, all exclude_guest events
for PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU capability are scheduled out 
when entering a guest and scheduled back in upon exit. We need this
functionality for IBS virtualization and PMC virtualization.

However, the current design allows only one passthrough PMU, which
prevents the IBS driver from loading with the
PERF_PMU_CAP_PASSTHROUGH_VPMU capability.

As a result, we are unable to utilize functionality of scheduling out
and scheduling in exclude_guest events within  the mediated
passthrough VPMU framework for IBS virtualization.

- Manali


>>>
>>> Thanks,
>>> Kan
>>>
>>
>> Yes, we have a use case to use mediated passthrough vPMU framework for IBS virtualization.
>> So, we will need it.
>>
>> - Manali
>>
>>>>> +
>>>>> +   list_add_tail_rcu(&pmu->entry, &pmus);
>>>>>     atomic_set(&pmu->exclusive_cnt, 0);
>>>>>     ret = 0;
>>>>>  unlock:
>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 07/58] perf: Skip pmu_ctx based on event_type
  2024-08-01  4:58 ` [RFC PATCH v3 07/58] perf: Skip pmu_ctx based on event_type Mingwei Zhang
@ 2024-10-11 11:18   ` Peter Zijlstra
  0 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-11 11:18 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:16AM +0000, Mingwei Zhang wrote:
> @@ -3299,9 +3309,6 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>  	struct perf_event_pmu_context *pmu_ctx;
>  	int is_active = ctx->is_active;
> -	bool cgroup = event_type & EVENT_CGROUP;
> -
> -	event_type &= ~EVENT_CGROUP;
>  
>  	lockdep_assert_held(&ctx->lock);
>  
> @@ -3336,7 +3343,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  		barrier();
>  	}
>  
> -	ctx->is_active &= ~event_type;
> +	ctx->is_active &= ~(event_type & ~EVENT_FLAGS);
>  	if (!(ctx->is_active & EVENT_ALL))
>  		ctx->is_active = 0;
>  

> @@ -3912,9 +3919,6 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  {
>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>  	int is_active = ctx->is_active;
> -	bool cgroup = event_type & EVENT_CGROUP;
> -
> -	event_type &= ~EVENT_CGROUP;
>  
>  	lockdep_assert_held(&ctx->lock);
>  
> @@ -3932,7 +3936,7 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  		barrier();
>  	}
>  
> -	ctx->is_active |= (event_type | EVENT_TIME);
> +	ctx->is_active |= ((event_type & ~EVENT_FLAGS) | EVENT_TIME);
>  	if (ctx->task) {
>  		if (!is_active)
>  			cpuctx->task_ctx = ctx;

Would it make sense to do something like:

	enum event_type_t active_type = event_type & ~EVENT_FLAGS;

	ctx->is_active &= ~active_type;
and
	ctx->is_active |= (active_type | EVENT_TIME);

Or something along those lines; those expressions become a little
unwieldy otherwise.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 08/58] perf: Clean up perf ctx time
  2024-08-01  4:58 ` [RFC PATCH v3 08/58] perf: Clean up perf ctx time Mingwei Zhang
@ 2024-10-11 11:39   ` Peter Zijlstra
  0 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-11 11:39 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:17AM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The current perf tracks two timestamps for the normal ctx and cgroup.
> The same type of variables and similar codes are used to track the
> timestamps. In the following patch, the third timestamp to track the
> guest time will be introduced.
> To avoid the code duplication, add a new struct perf_time_ctx and factor
> out a generic function update_perf_time_ctx().
> 
> No functional change.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>


--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -801,11 +801,22 @@ static inline u64 perf_cgroup_event_time
 	return now;
 }
 
-static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
-
-static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
+static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
 {
-	update_perf_time_ctx(&info->time, now, adv);
+	if (adv)
+		time->time += now - time->stamp;
+	time->stamp = now;
+
+	/*
+	 * The above: time' = time + (now - timestamp), can be re-arranged
+	 * into: time` = now + (time - timestamp), which gives a single value
+	 * offset to compute future time without locks on.
+	 *
+	 * See perf_event_time_now(), which can be used from NMI context where
+	 * it's (obviously) not possible to acquire ctx->lock in order to read
+	 * both the above values in a consistent manner.
+	 */
+	WRITE_ONCE(time->offset, time->time - time->stamp);
 }
 
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
@@ -821,7 +832,7 @@ static inline void update_cgrp_time_from
 			cgrp = container_of(css, struct perf_cgroup, css);
 			info = this_cpu_ptr(cgrp->info);
 
-			__update_cgrp_time(info, now, true);
+			update_perf_time_ctx(&info->time, now, true);
 			if (final)
 				__store_release(&info->active, 0);
 		}
@@ -844,7 +855,7 @@ static inline void update_cgrp_time_from
 	 * Do not update time when cgroup is not active
 	 */
 	if (info->active)
-		__update_cgrp_time(info, perf_clock(), true);
+		update_perf_time_ctx(&info->time, perf_clock(), true);
 }
 
 static inline void
@@ -868,7 +879,7 @@ perf_cgroup_set_timestamp(struct perf_cp
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		__update_cgrp_time(info, ctx->time.stamp, false);
+		update_perf_time_ctx(&info->time, ctx->time.stamp, false);
 		__store_release(&info->active, 1);
 	}
 }
@@ -1478,24 +1489,6 @@ static void perf_unpin_context(struct pe
 	raw_spin_unlock_irqrestore(&ctx->lock, flags);
 }
 
-static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
-{
-	if (adv)
-		time->time += now - time->stamp;
-	time->stamp = now;
-
-	/*
-	 * The above: time' = time + (now - timestamp), can be re-arranged
-	 * into: time` = now + (time - timestamp), which gives a single value
-	 * offset to compute future time without locks on.
-	 *
-	 * See perf_event_time_now(), which can be used from NMI context where
-	 * it's (obviously) not possible to acquire ctx->lock in order to read
-	 * both the above values in a consistent manner.
-	 */
-	WRITE_ONCE(time->offset, time->time - time->stamp);
-}
-
 /*
  * Update the record of the current time in a context.
  */

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-21  5:27   ` Mi, Dapeng
  2024-08-21 13:16     ` Liang, Kan
@ 2024-10-11 11:41     ` Peter Zijlstra
  2024-10-11 13:16       ` Liang, Kan
  1 sibling, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-11 11:41 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Aug 21, 2024 at 01:27:17PM +0800, Mi, Dapeng wrote:
> On 8/1/2024 12:58 PM, Mingwei Zhang wrote:

> > +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> > +					    struct perf_time_ctx *time,
> > +					    struct perf_time_ctx *timeguest,
> > +					    u64 now)
> > +{
> > +	/*
> > +	 * The exclude_guest event time should be calculated from
> > +	 * the ctx time -  the guest time.
> > +	 * The ctx time is now + READ_ONCE(time->offset).
> > +	 * The guest time is now + READ_ONCE(timeguest->offset).
> > +	 * So the exclude_guest time is
> > +	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
> > +	 */
> > +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
> 
> Hi Kan,
> 
> we see the following the warning when run perf record command after
> enabling "CONFIG_DEBUG_PREEMPT" config item.
> 
> [  166.779208] BUG: using __this_cpu_read() in preemptible [00000000] code:
> perf/9494
> [  166.779234] caller is __this_cpu_preempt_check+0x13/0x20
> [  166.779241] CPU: 56 UID: 0 PID: 9494 Comm: perf Not tainted
> 6.11.0-rc4-perf-next-mediated-vpmu-v3+ #80
> [  166.779245] Hardware name: Quanta Cloud Technology Inc. QuantaGrid
> D54Q-2U/S6Q-MB-MPS, BIOS 3A11.uh 12/02/2022
> [  166.779248] Call Trace:
> [  166.779250]  <TASK>
> [  166.779252]  dump_stack_lvl+0x76/0xa0
> [  166.779260]  dump_stack+0x10/0x20
> [  166.779267]  check_preemption_disabled+0xd7/0xf0
> [  166.779273]  __this_cpu_preempt_check+0x13/0x20
> [  166.779279]  calc_timer_values+0x193/0x200
> [  166.779287]  perf_event_update_userpage+0x4b/0x170
> [  166.779294]  ? ring_buffer_attach+0x14c/0x200
> [  166.779301]  perf_mmap+0x533/0x5d0
> [  166.779309]  mmap_region+0x243/0xaa0
> [  166.779322]  do_mmap+0x35b/0x640
> [  166.779333]  vm_mmap_pgoff+0xf0/0x1c0
> [  166.779345]  ksys_mmap_pgoff+0x17a/0x250
> [  166.779354]  __x64_sys_mmap+0x33/0x70
> [  166.779362]  x64_sys_call+0x1fa4/0x25f0
> [  166.779369]  do_syscall_64+0x70/0x130
> 
> The season that kernel complains this is __perf_event_time_ctx_now() calls
> __this_cpu_read() in preemption enabled context.
> 
> To eliminate the warning, we may need to use this_cpu_read() to replace
> __this_cpu_read().
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ccd61fd06e8d..1eb628f8b3a0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1581,7 +1581,7 @@ static inline u64 __perf_event_time_ctx_now(struct
> perf_event *event,
>          * So the exclude_guest time is
>          * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
>          */
> -       if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
> +       if (event->attr.exclude_guest && this_cpu_read(perf_in_guest))
>                 return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
>         else
>                 return now + READ_ONCE(time->offset);

The saner fix is moving the preempt_disable() in
perf_event_update_userpage() up a few lines, no?

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-10-11 11:41     ` Peter Zijlstra
@ 2024-10-11 13:16       ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-11 13:16 UTC (permalink / raw)
  To: Peter Zijlstra, Mi, Dapeng
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-11 7:41 a.m., Peter Zijlstra wrote:
> On Wed, Aug 21, 2024 at 01:27:17PM +0800, Mi, Dapeng wrote:
>> On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> 
>>> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
>>> +					    struct perf_time_ctx *time,
>>> +					    struct perf_time_ctx *timeguest,
>>> +					    u64 now)
>>> +{
>>> +	/*
>>> +	 * The exclude_guest event time should be calculated from
>>> +	 * the ctx time -  the guest time.
>>> +	 * The ctx time is now + READ_ONCE(time->offset).
>>> +	 * The guest time is now + READ_ONCE(timeguest->offset).
>>> +	 * So the exclude_guest time is
>>> +	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
>>> +	 */
>>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
>>
>> Hi Kan,
>>
>> we see the following the warning when run perf record command after
>> enabling "CONFIG_DEBUG_PREEMPT" config item.
>>
>> [  166.779208] BUG: using __this_cpu_read() in preemptible [00000000] code:
>> perf/9494
>> [  166.779234] caller is __this_cpu_preempt_check+0x13/0x20
>> [  166.779241] CPU: 56 UID: 0 PID: 9494 Comm: perf Not tainted
>> 6.11.0-rc4-perf-next-mediated-vpmu-v3+ #80
>> [  166.779245] Hardware name: Quanta Cloud Technology Inc. QuantaGrid
>> D54Q-2U/S6Q-MB-MPS, BIOS 3A11.uh 12/02/2022
>> [  166.779248] Call Trace:
>> [  166.779250]  <TASK>
>> [  166.779252]  dump_stack_lvl+0x76/0xa0
>> [  166.779260]  dump_stack+0x10/0x20
>> [  166.779267]  check_preemption_disabled+0xd7/0xf0
>> [  166.779273]  __this_cpu_preempt_check+0x13/0x20
>> [  166.779279]  calc_timer_values+0x193/0x200
>> [  166.779287]  perf_event_update_userpage+0x4b/0x170
>> [  166.779294]  ? ring_buffer_attach+0x14c/0x200
>> [  166.779301]  perf_mmap+0x533/0x5d0
>> [  166.779309]  mmap_region+0x243/0xaa0
>> [  166.779322]  do_mmap+0x35b/0x640
>> [  166.779333]  vm_mmap_pgoff+0xf0/0x1c0
>> [  166.779345]  ksys_mmap_pgoff+0x17a/0x250
>> [  166.779354]  __x64_sys_mmap+0x33/0x70
>> [  166.779362]  x64_sys_call+0x1fa4/0x25f0
>> [  166.779369]  do_syscall_64+0x70/0x130
>>
>> The season that kernel complains this is __perf_event_time_ctx_now() calls
>> __this_cpu_read() in preemption enabled context.
>>
>> To eliminate the warning, we may need to use this_cpu_read() to replace
>> __this_cpu_read().
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index ccd61fd06e8d..1eb628f8b3a0 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1581,7 +1581,7 @@ static inline u64 __perf_event_time_ctx_now(struct
>> perf_event *event,
>>          * So the exclude_guest time is
>>          * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
>>          */
>> -       if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
>> +       if (event->attr.exclude_guest && this_cpu_read(perf_in_guest))
>>                 return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
>>         else
>>                 return now + READ_ONCE(time->offset);
> 
> The saner fix is moving the preempt_disable() in
> perf_event_update_userpage() up a few lines, no?

Yes, the preempt_disable() is to guarantee consistent time stamps. It
makes sense to include the time calculation part in the preempt_disable().

Since the time-related code was updated recently, I will fold all the
suggestions into the newly re-based code.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
  2024-08-21  5:27   ` Mi, Dapeng
@ 2024-10-11 18:42   ` Peter Zijlstra
  2024-10-11 19:49     ` Liang, Kan
  2024-10-14 11:14   ` Peter Zijlstra
  2024-12-13  9:37   ` Sandipan Das
  3 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-11 18:42 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users


Can you rework this one along these lines?

---
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -975,6 +975,7 @@ struct perf_event_context {
 	 * Context clock, runs when context enabled.
 	 */
 	struct perf_time_ctx		time;
+	struct perf_time_ctx		timeguest;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -1066,6 +1067,7 @@ struct bpf_perf_event_data_kern {
  */
 struct perf_cgroup_info {
 	struct perf_time_ctx		time;
+	struct perf_time_ctx		timeguest;
 	int				active;
 };
 
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -782,12 +782,44 @@ static inline int is_cgroup_event(struct
 	return event->cgrp != NULL;
 }
 
+static_assert(offsetof(struct perf_event, timeguest) -
+	      offsetof(struct perf_event, time) == 
+	      sizeof(struct perf_time_ctx));
+
+static_assert(offsetof(struct perf_cgroup_info, timeguest) -
+	      offsetof(struct perf_cgroup_info, time) ==
+	      sizeof(struct perf_time_ctx));
+
+static inline u64 __perf_event_time_ctx(struct perf_event *event,
+					struct perf_time_ctx *times)
+{
+	u64 time = times[0].time;
+	if (event->attr.exclude_guest)
+		time -= times[1].time;
+	return time;
+}
+
+static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
+					    struct perf_time_ctx *times,
+					    u64 now)
+{
+	if (event->attr.exclude_guest) {
+		/*
+		 * (now + times[0].offset) - (now + times[1].offset) :=
+		 * times[0].offset - times[1].offset
+		 */
+		return READ_ONCE(times[0].offset) - READ_ONCE(times[1].offset);
+	}
+
+	return now + READ_ONCE(times[0].offset);
+}
+
 static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
 	struct perf_cgroup_info *t;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	return t->time.time;
+	return __perf_event_time_ctx(event, &t->time);
 }
 
 static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
@@ -796,12 +828,12 @@ static inline u64 perf_cgroup_event_time
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
 	if (!__load_acquire(&t->active))
-		return t->time.time;
+		return __perf_event_time_ctx(event, &t->time);
 	now += READ_ONCE(t->time.offset);
-	return now;
+	return __perf_event_time_ctx_now(event, &t->time, now);
 }
 
-static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
+static inline void __update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
 {
 	if (adv)
 		time->time += now - time->stamp;
@@ -819,6 +851,13 @@ static inline void update_perf_time_ctx(
 	WRITE_ONCE(time->offset, time->time - time->stamp);
 }
 
+static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
+{
+	__update_perf_time_ctx(time + 0, now, adv);
+	if (__this_cpu_read(perf_in_guest))
+		__update_perf_time_ctx(time + 1, now, adv)
+}
+
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-10-11 18:42   ` Peter Zijlstra
@ 2024-10-11 19:49     ` Liang, Kan
  2024-10-14 10:55       ` Peter Zijlstra
  0 siblings, 1 reply; 183+ messages in thread
From: Liang, Kan @ 2024-10-11 19:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-11 2:42 p.m., Peter Zijlstra wrote:
> 
> Can you rework this one along these lines?

Sure.

I probably also add macros to replace the magic number 0 and 1. For example,

#define T_TOTAL		0
#define T_GUEST		1

Thanks,
Kan
> 
> ---
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -975,6 +975,7 @@ struct perf_event_context {
>  	 * Context clock, runs when context enabled.
>  	 */
>  	struct perf_time_ctx		time;
> +	struct perf_time_ctx		timeguest;
>  
>  	/*
>  	 * These fields let us detect when two contexts have both
> @@ -1066,6 +1067,7 @@ struct bpf_perf_event_data_kern {
>   */
>  struct perf_cgroup_info {
>  	struct perf_time_ctx		time;
> +	struct perf_time_ctx		timeguest;
>  	int				active;
>  };
>  
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -782,12 +782,44 @@ static inline int is_cgroup_event(struct
>  	return event->cgrp != NULL;
>  }
>  
> +static_assert(offsetof(struct perf_event, timeguest) -
> +	      offsetof(struct perf_event, time) == 
> +	      sizeof(struct perf_time_ctx));
> +
> +static_assert(offsetof(struct perf_cgroup_info, timeguest) -
> +	      offsetof(struct perf_cgroup_info, time) ==
> +	      sizeof(struct perf_time_ctx));
> +
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *times)
> +{
> +	u64 time = times[0].time;
> +	if (event->attr.exclude_guest)
> +		time -= times[1].time;
> +	return time;
> +}
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *times,
> +					    u64 now)
> +{
> +	if (event->attr.exclude_guest) {
> +		/*
> +		 * (now + times[0].offset) - (now + times[1].offset) :=
> +		 * times[0].offset - times[1].offset
> +		 */
> +		return READ_ONCE(times[0].offset) - READ_ONCE(times[1].offset);
> +	}
> +
> +	return now + READ_ONCE(times[0].offset);
> +}
> +
>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
>  	struct perf_cgroup_info *t;
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	return t->time.time;
> +	return __perf_event_time_ctx(event, &t->time);
>  }
>  
>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> @@ -796,12 +828,12 @@ static inline u64 perf_cgroup_event_time
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>  	if (!__load_acquire(&t->active))
> -		return t->time.time;
> +		return __perf_event_time_ctx(event, &t->time);
>  	now += READ_ONCE(t->time.offset);
> -	return now;
> +	return __perf_event_time_ctx_now(event, &t->time, now);
>  }
>  
> -static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
> +static inline void __update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
>  {
>  	if (adv)
>  		time->time += now - time->stamp;
> @@ -819,6 +851,13 @@ static inline void update_perf_time_ctx(
>  	WRITE_ONCE(time->offset, time->time - time->stamp);
>  }
>  
> +static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv)
> +{
> +	__update_perf_time_ctx(time + 0, now, adv);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_perf_time_ctx(time + 1, now, adv)
> +}
> +
>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>  {
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> 


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-10-11 19:49     ` Liang, Kan
@ 2024-10-14 10:55       ` Peter Zijlstra
  0 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 10:55 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users

On Fri, Oct 11, 2024 at 03:49:44PM -0400, Liang, Kan wrote:
> 
> 
> On 2024-10-11 2:42 p.m., Peter Zijlstra wrote:
> > 
> > Can you rework this one along these lines?
> 
> Sure.

Thanks!

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
  2024-08-21  5:27   ` Mi, Dapeng
  2024-10-11 18:42   ` Peter Zijlstra
@ 2024-10-14 11:14   ` Peter Zijlstra
  2024-10-14 15:06     ` Liang, Kan
  2024-12-13  9:37   ` Sandipan Das
  3 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 11:14 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:18AM +0000, Mingwei Zhang wrote:

> @@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  	 * would only update time for the pinned events.
>  	 */
>  	if (is_active & EVENT_TIME) {
> +		bool stop;
> +
> +		/* vPMU should not stop time */
> +		stop = !(event_type & EVENT_GUEST) &&
> +		       ctx == &cpuctx->ctx;
> +
>  		/* update (and stop) ctx time */
>  		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  			cpuctx->task_ctx = NULL;
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule out all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */

I thought the premise was that if we have !exclude_guest events, we'll
fail the creation of vPMU, and if we have vPMU we'll fail the creation
of !exclude_guest events.

As such, they're mutually exclusive, and the above comment doesn't make
sense, if we have a vPMU, there are no !exclude_guest events, IOW
schedule out the entire context.

> +		is_active = EVENT_ALL;
> +		__update_context_guest_time(ctx, false);
> +		perf_cgroup_set_timestamp(cpuctx, true);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))

> @@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  	if (!event_filter_match(event))
>  		return 0;
>  
> -	if (group_can_go_on(event, *can_add_hw)) {
> +	/*
> +	 * Don't schedule in any exclude_guest events of PMU with
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
> +	 */

More confusion; if we have vPMU there should not be !exclude_guest
events, right? So the above then becomes 'Don't schedule in any events'.

> +	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
> +	    !(msd->event_type & EVENT_GUEST))
> +		return 0;
> +
> +	if (group_can_go_on(event, msd->can_add_hw)) {
>  		if (!group_sched_in(event, ctx))
>  			list_add_tail(&event->active_list, get_event_list(event));

> @@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule in all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */

Idem.

> +		is_active = EVENT_ALL;
> +
> +		/*
> +		 * Update ctx time to set the new start time for
> +		 * the exclude_guest events.
> +		 */
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 10/58] perf: Add generic exclude_guest support
  2024-08-01  4:58 ` [RFC PATCH v3 10/58] perf: Add generic exclude_guest support Mingwei Zhang
@ 2024-10-14 11:20   ` Peter Zijlstra
  2024-10-14 15:27     ` Liang, Kan
  0 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 11:20 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:19AM +0000, Mingwei Zhang wrote:
> +void perf_guest_exit(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> +		goto unlock;
> +
> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> +	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> +	if (cpuctx->task_ctx) {
> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> +		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> +	}

Does this not violate the scheduling order of events? AFAICT this will
do:

  cpu pinned
  cpu flexible
  task pinned
  task flexible

as opposed to:

  cpu pinned
  task pinned
  cpu flexible
  task flexible

We have the perf_event_sched_in() helper for this.

> +
> +	__this_cpu_write(perf_in_guest, false);
> +unlock:
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +EXPORT_SYMBOL_GPL(perf_guest_exit);

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
  2024-09-19  6:02   ` Manali Shukla
@ 2024-10-14 11:56   ` Peter Zijlstra
  2024-10-14 15:40     ` Liang, Kan
  2024-10-14 12:03   ` Peter Zijlstra
  2024-10-14 13:52   ` Peter Zijlstra
  3 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 11:56 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:

> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> +{
> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> +	if (!passthru_pmu)
> +		return;
> +
> +	if (passthru_pmu->switch_interrupt &&
> +	    try_module_get(passthru_pmu->module)) {
> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> +		module_put(passthru_pmu->module);
> +	}
> +}

Should we move the whole module reference to perf_pmu_(,un}register() ?

> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (!pmu->event_idx)
>  		pmu->event_idx = perf_event_idx_default;
>  
> -	list_add_rcu(&pmu->entry, &pmus);
> +	/*
> +	 * Initialize passthru_pmu with the core pmu that has
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> +	 */
> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> +		if (!passthru_pmu)
> +			passthru_pmu = pmu;
> +
> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> +			ret = -EINVAL;
> +			goto free_dev;

Why impose this limit? Changelog also fails to explain this.

> +		}
> +	}
> +
> +	list_add_tail_rcu(&pmu->entry, &pmus);
>  	atomic_set(&pmu->exclusive_cnt, 0);
>  	ret = 0;
>  unlock:
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-09-23 18:49         ` Mingwei Zhang
  2024-09-24 16:55           ` Manali Shukla
@ 2024-10-14 11:59           ` Peter Zijlstra
  2024-10-14 16:15             ` Liang, Kan
  1 sibling, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 11:59 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Manali Shukla, Liang, Kan, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users

On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:

> The original implementation is by design having a terrible performance
> overhead, ie., every PMU context switch at runtime requires a SRCU
> lock pair and pmu list traversal. To reduce the overhead, we put
> "passthrough" pmus in the front of the list and quickly exit the pmu
> traversal when we just pass the last "passthrough" pmu.

What was the expensive bit? The SRCU memory barrier or the list
iteration? How long is that list really?


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
  2024-09-19  6:02   ` Manali Shukla
  2024-10-14 11:56   ` Peter Zijlstra
@ 2024-10-14 12:03   ` Peter Zijlstra
  2024-10-14 15:51     ` Liang, Kan
  2024-10-14 13:52   ` Peter Zijlstra
  3 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 12:03 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> There will be a dedicated interrupt vector for guests on some platforms,
> e.g., Intel. Add an interface to switch the interrupt vector while
> entering/exiting a guest.
> 
> When PMI switch into a new guest vector, guest_lvtpc value need to be
> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> bit should be cleared also, then PMI can be generated continuously
> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> and switch_interrupt().
> 
> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> be found. Since only one passthrough pmu is supported, we keep the
> implementation simply by tracking the pmu as a global variable.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> 
> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> supported.]
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |  9 +++++++--
>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 75773f9890cc..aeb08f78f539 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -541,6 +541,11 @@ struct pmu {
>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>  	 */
>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> +
> +	/*
> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
> +	 */
> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>  };

I'm thinking the guets_lvtpc argument shouldn't be part of the
interface. That should be PMU implementation data and accessed by the
method implementation.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
                     ` (2 preceding siblings ...)
  2024-10-14 12:03   ` Peter Zijlstra
@ 2024-10-14 13:52   ` Peter Zijlstra
  2024-10-14 15:57     ` Liang, Kan
  3 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 13:52 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
>  
> +	perf_switch_interrupt(true, guest_lvtpc);
> +
>  	__this_cpu_write(perf_in_guest, true);
>  
>  unlock:
> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>  		goto unlock;
>  
> +	perf_switch_interrupt(false, 0);
> +
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);

This seems to suggest the method is named wrong, it should probably be
guest_enter() or somsuch.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-10-14 11:14   ` Peter Zijlstra
@ 2024-10-14 15:06     ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-14 15:06 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 7:14 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:18AM +0000, Mingwei Zhang wrote:
> 
>> @@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>>  	 * would only update time for the pinned events.
>>  	 */
>>  	if (is_active & EVENT_TIME) {
>> +		bool stop;
>> +
>> +		/* vPMU should not stop time */
>> +		stop = !(event_type & EVENT_GUEST) &&
>> +		       ctx == &cpuctx->ctx;
>> +
>>  		/* update (and stop) ctx time */
>>  		update_context_time(ctx);
>> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule out all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
> 
> I thought the premise was that if we have !exclude_guest events, we'll
> fail the creation of vPMU, and if we have vPMU we'll fail the creation
> of !exclude_guest events.
> 
> As such, they're mutually exclusive, and the above comment doesn't make
> sense, if we have a vPMU, there are no !exclude_guest events, IOW
> schedule out the entire context.

It's a typo. Right, it should schedule out all exclude_guest events of
the passthrough PMUs.

> 
>> +		is_active = EVENT_ALL;
>> +		__update_context_guest_time(ctx, false);
>> +		perf_cgroup_set_timestamp(cpuctx, true);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> 
>> @@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  	if (!event_filter_match(event))
>>  		return 0;
>>  
>> -	if (group_can_go_on(event, *can_add_hw)) {
>> +	/*
>> +	 * Don't schedule in any exclude_guest events of PMU with
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
>> +	 */
> 
> More confusion; if we have vPMU there should not be !exclude_guest
> events, right? So the above then becomes 'Don't schedule in any events'.

Right. I will correct the three comments.

Thanks,
Kan

> 
>> +	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
>> +	    !(msd->event_type & EVENT_GUEST))
>> +		return 0;
>> +
>> +	if (group_can_go_on(event, msd->can_add_hw)) {
>>  		if (!group_sched_in(event, ctx))
>>  			list_add_tail(&event->active_list, get_event_list(event));
> 
>> @@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule in all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
> 
> Idem.
> 
>> +		is_active = EVENT_ALL;
>> +
>> +		/*
>> +		 * Update ctx time to set the new start time for
>> +		 * the exclude_guest events.
>> +		 */
>> +		update_context_time(ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, false);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
> 


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 10/58] perf: Add generic exclude_guest support
  2024-10-14 11:20   ` Peter Zijlstra
@ 2024-10-14 15:27     ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-14 15:27 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 7:20 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:19AM +0000, Mingwei Zhang wrote:
>> +void perf_guest_exit(void)
>> +{
>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>> +
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>> +		goto unlock;
>> +
>> +	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> +	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>> +	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>> +	if (cpuctx->task_ctx) {
>> +		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> +		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
>> +		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>> +	}
> 
> Does this not violate the scheduling order of events? AFAICT this will
> do:
> 
>   cpu pinned
>   cpu flexible
>   task pinned
>   task flexible
> 
> as opposed to:
> 
>   cpu pinned
>   task pinned
>   cpu flexible
>   task flexible
> 
> We have the perf_event_sched_in() helper for this.

Yes, we can avoid the sched_in() with EVENT_GUEST flag, then invoke the
perf_event_sched_in() helper to do the real schedule. I will do more
tests to double check.

Thanks,
Kan
> 
>> +
>> +	__this_cpu_write(perf_in_guest, false);
>> +unlock:
>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +}
>> +EXPORT_SYMBOL_GPL(perf_guest_exit);
> 


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 11:56   ` Peter Zijlstra
@ 2024-10-14 15:40     ` Liang, Kan
  2024-10-14 17:47       ` Peter Zijlstra
  2024-10-14 17:51       ` Peter Zijlstra
  0 siblings, 2 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-14 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 7:56 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> 
>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>  
>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>> +{
>> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>> +	if (!passthru_pmu)
>> +		return;
>> +
>> +	if (passthru_pmu->switch_interrupt &&
>> +	    try_module_get(passthru_pmu->module)) {
>> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>> +		module_put(passthru_pmu->module);
>> +	}
>> +}
> 
> Should we move the whole module reference to perf_pmu_(,un}register() ?

A PMU module can be load/unload anytime. How should we know if the PMU
module is available when the reference check is moved to
perf_pmu_(,un}register()?

> 
>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>  	if (!pmu->event_idx)
>>  		pmu->event_idx = perf_event_idx_default;
>>  
>> -	list_add_rcu(&pmu->entry, &pmus);
>> +	/*
>> +	 * Initialize passthru_pmu with the core pmu that has
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>> +	 */
>> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>> +		if (!passthru_pmu)
>> +			passthru_pmu = pmu;
>> +
>> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>> +			ret = -EINVAL;
>> +			goto free_dev;
> 
> Why impose this limit? Changelog also fails to explain this.

Because the passthru_pmu is global variable. If there are two or more
PMUs with the PERF_PMU_CAP_PASSTHROUGH_VPMU, the former one will be
implicitly overwritten if without the check.

Thanks
Kan
> 
>> +		}
>> +	}
>> +
>> +	list_add_tail_rcu(&pmu->entry, &pmus);
>>  	atomic_set(&pmu->exclusive_cnt, 0);
>>  	ret = 0;
>>  unlock:
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>
> 


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 12:03   ` Peter Zijlstra
@ 2024-10-14 15:51     ` Liang, Kan
  2024-10-14 17:49       ` Peter Zijlstra
  0 siblings, 1 reply; 183+ messages in thread
From: Liang, Kan @ 2024-10-14 15:51 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 8:03 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> There will be a dedicated interrupt vector for guests on some platforms,
>> e.g., Intel. Add an interface to switch the interrupt vector while
>> entering/exiting a guest.
>>
>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>> bit should be cleared also, then PMI can be generated continuously
>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>> and switch_interrupt().
>>
>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>> be found. Since only one passthrough pmu is supported, we keep the
>> implementation simply by tracking the pmu as a global variable.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>
>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>> supported.]
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |  9 +++++++--
>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 75773f9890cc..aeb08f78f539 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -541,6 +541,11 @@ struct pmu {
>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>  	 */
>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>> +
>> +	/*
>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>> +	 */
>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>  };
> 
> I'm thinking the guets_lvtpc argument shouldn't be part of the
> interface. That should be PMU implementation data and accessed by the
> method implementation.

I think the name of the perf_switch_interrupt() is too specific.
Here should be to switch the guest context. The interrupt should be just
part of the context. Maybe a interface as below

void (*switch_guest_ctx)	(bool enter, void *data); /* optional */

Thanks,
Kan

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 13:52   ` Peter Zijlstra
@ 2024-10-14 15:57     ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-14 15:57 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 9:52 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>  
>> +	perf_switch_interrupt(true, guest_lvtpc);
>> +
>>  	__this_cpu_write(perf_in_guest, true);
>>  
>>  unlock:
>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>  		goto unlock;
>>  
>> +	perf_switch_interrupt(false, 0);
>> +
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> 
> This seems to suggest the method is named wrong, it should probably be
> guest_enter() or somsuch.
>

The ctx_sched_in() is to schedule in the host context after the
guest_exit(). The EVENT_GUEST is to indicate the guest ctx switch.

The name may brings some confusion. Maybe I can add a wrap function
perf_host_enter() to include the above codes.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 11:59           ` Peter Zijlstra
@ 2024-10-14 16:15             ` Liang, Kan
  2024-10-14 17:45               ` Peter Zijlstra
  0 siblings, 1 reply; 183+ messages in thread
From: Liang, Kan @ 2024-10-14 16:15 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Manali Shukla, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu,
	Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 7:59 a.m., Peter Zijlstra wrote:
> On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:
> 
>> The original implementation is by design having a terrible performance
>> overhead, ie., every PMU context switch at runtime requires a SRCU
>> lock pair and pmu list traversal. To reduce the overhead, we put
>> "passthrough" pmus in the front of the list and quickly exit the pmu
>> traversal when we just pass the last "passthrough" pmu.
> 
> What was the expensive bit? The SRCU memory barrier or the list
> iteration? How long is that list really?

Both. But I don't think there is any performance data.

The length of the list could vary on different platforms. For a modern
server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs,
IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of
devices. The number could keep increasing with more and more devices
supporting the PMU capability.

Two methods were considered.
- One is to add a global variable to track the "passthrough" pmu. The
idea assumes that there is only one "passthrough" pmu that requires the
switch, and the situation will not be changed in the near feature.
So the SRCU memory barrier and the list iteration can be avoided.
It's implemented in the patch

- The other one is always put the "passthrough" pmus in the front of the
list. So the unnecessary list iteration can be avoided. It does nothing
for the SRCU lock pair.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 16:15             ` Liang, Kan
@ 2024-10-14 17:45               ` Peter Zijlstra
  2024-10-15 15:59                 ` Liang, Kan
  0 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 17:45 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Manali Shukla, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users

On Mon, Oct 14, 2024 at 12:15:11PM -0400, Liang, Kan wrote:
> 
> 
> On 2024-10-14 7:59 a.m., Peter Zijlstra wrote:
> > On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:
> > 
> >> The original implementation is by design having a terrible performance
> >> overhead, ie., every PMU context switch at runtime requires a SRCU
> >> lock pair and pmu list traversal. To reduce the overhead, we put
> >> "passthrough" pmus in the front of the list and quickly exit the pmu
> >> traversal when we just pass the last "passthrough" pmu.
> > 
> > What was the expensive bit? The SRCU memory barrier or the list
> > iteration? How long is that list really?
> 
> Both. But I don't think there is any performance data.
> 
> The length of the list could vary on different platforms. For a modern
> server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs,
> IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of
> devices. The number could keep increasing with more and more devices
> supporting the PMU capability.
> 
> Two methods were considered.
> - One is to add a global variable to track the "passthrough" pmu. The
> idea assumes that there is only one "passthrough" pmu that requires the
> switch, and the situation will not be changed in the near feature.
> So the SRCU memory barrier and the list iteration can be avoided.
> It's implemented in the patch
> 
> - The other one is always put the "passthrough" pmus in the front of the
> list. So the unnecessary list iteration can be avoided. It does nothing
> for the SRCU lock pair.

PaulMck has patches that introduce srcu_read_lock_lite(), which would
avoid the smp_mb() in most cases.

  https://lkml.kernel.org/r/20241011173931.2050422-6-paulmck@kernel.org

We can also keep a second list, just for the passthrough pmus. A bit
like sched_cb_list.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 15:40     ` Liang, Kan
@ 2024-10-14 17:47       ` Peter Zijlstra
  2024-10-14 17:51       ` Peter Zijlstra
  1 sibling, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 17:47 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users

On Mon, Oct 14, 2024 at 11:40:21AM -0400, Liang, Kan wrote:

> >> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> >>  	if (!pmu->event_idx)
> >>  		pmu->event_idx = perf_event_idx_default;
> >>  
> >> -	list_add_rcu(&pmu->entry, &pmus);
> >> +	/*
> >> +	 * Initialize passthru_pmu with the core pmu that has
> >> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> >> +	 */
> >> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> >> +		if (!passthru_pmu)
> >> +			passthru_pmu = pmu;
> >> +
> >> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> >> +			ret = -EINVAL;
> >> +			goto free_dev;
> > 
> > Why impose this limit? Changelog also fails to explain this.
> 
> Because the passthru_pmu is global variable. If there are two or more
> PMUs with the PERF_PMU_CAP_PASSTHROUGH_VPMU, the former one will be
> implicitly overwritten if without the check.

That is not the question; but is has been answered elsewhere. For some
reason you thought the srcu+list thing was expensive.


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 15:51     ` Liang, Kan
@ 2024-10-14 17:49       ` Peter Zijlstra
  2024-10-15 13:23         ` Liang, Kan
  0 siblings, 1 reply; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 17:49 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users

On Mon, Oct 14, 2024 at 11:51:06AM -0400, Liang, Kan wrote:
> On 2024-10-14 8:03 a.m., Peter Zijlstra wrote:
> > On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> There will be a dedicated interrupt vector for guests on some platforms,
> >> e.g., Intel. Add an interface to switch the interrupt vector while
> >> entering/exiting a guest.
> >>
> >> When PMI switch into a new guest vector, guest_lvtpc value need to be
> >> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> >> bit should be cleared also, then PMI can be generated continuously
> >> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> >> and switch_interrupt().
> >>
> >> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> >> be found. Since only one passthrough pmu is supported, we keep the
> >> implementation simply by tracking the pmu as a global variable.
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> >> supported.]
> >>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> ---
> >>  include/linux/perf_event.h |  9 +++++++--
> >>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
> >>  2 files changed, 41 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index 75773f9890cc..aeb08f78f539 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -541,6 +541,11 @@ struct pmu {
> >>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
> >>  	 */
> >>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> >> +
> >> +	/*
> >> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
> >> +	 */
> >> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
> >>  };
> > 
> > I'm thinking the guets_lvtpc argument shouldn't be part of the
> > interface. That should be PMU implementation data and accessed by the
> > method implementation.
> 
> I think the name of the perf_switch_interrupt() is too specific.
> Here should be to switch the guest context. The interrupt should be just
> part of the context. Maybe a interface as below
> 
> void (*switch_guest_ctx)	(bool enter, void *data); /* optional */

I don't think you even need the data thing. For example, the x86/intel
implementation can just look at a x86_pmu data field to find the magic
value.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 15:40     ` Liang, Kan
  2024-10-14 17:47       ` Peter Zijlstra
@ 2024-10-14 17:51       ` Peter Zijlstra
  1 sibling, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2024-10-14 17:51 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users

On Mon, Oct 14, 2024 at 11:40:21AM -0400, Liang, Kan wrote:
> 
> 
> On 2024-10-14 7:56 a.m., Peter Zijlstra wrote:
> > On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> > 
> >> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
> >>  }
> >>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> >>  
> >> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> >> +{
> >> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> >> +	if (!passthru_pmu)
> >> +		return;
> >> +
> >> +	if (passthru_pmu->switch_interrupt &&
> >> +	    try_module_get(passthru_pmu->module)) {
> >> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> >> +		module_put(passthru_pmu->module);
> >> +	}
> >> +}
> > 
> > Should we move the whole module reference to perf_pmu_(,un}register() ?
> 
> A PMU module can be load/unload anytime. How should we know if the PMU
> module is available when the reference check is moved to
> perf_pmu_(,un}register()?

Feh, dunno. I never really use modules. I just think the above is naf --
doubly so because you're saying the SRCU smp_mb() are expensive; but
this module reference crap is more expensive than that.

(IOW, the SRCU smp_mb() cannot have been the problem)

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 17:49       ` Peter Zijlstra
@ 2024-10-15 13:23         ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-15 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 1:49 p.m., Peter Zijlstra wrote:
> On Mon, Oct 14, 2024 at 11:51:06AM -0400, Liang, Kan wrote:
>> On 2024-10-14 8:03 a.m., Peter Zijlstra wrote:
>>> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> There will be a dedicated interrupt vector for guests on some platforms,
>>>> e.g., Intel. Add an interface to switch the interrupt vector while
>>>> entering/exiting a guest.
>>>>
>>>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>>>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>>>> bit should be cleared also, then PMI can be generated continuously
>>>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>>>> and switch_interrupt().
>>>>
>>>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>>>> be found. Since only one passthrough pmu is supported, we keep the
>>>> implementation simply by tracking the pmu as a global variable.
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>>>> supported.]
>>>>
>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>> ---
>>>>  include/linux/perf_event.h |  9 +++++++--
>>>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 75773f9890cc..aeb08f78f539 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -541,6 +541,11 @@ struct pmu {
>>>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>>>  	 */
>>>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>>>> +
>>>> +	/*
>>>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>>>> +	 */
>>>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>>>  };
>>>
>>> I'm thinking the guets_lvtpc argument shouldn't be part of the
>>> interface. That should be PMU implementation data and accessed by the
>>> method implementation.
>>
>> I think the name of the perf_switch_interrupt() is too specific.
>> Here should be to switch the guest context. The interrupt should be just
>> part of the context. Maybe a interface as below
>>
>> void (*switch_guest_ctx)	(bool enter, void *data); /* optional */
> 
> I don't think you even need the data thing. For example, the x86/intel
> implementation can just look at a x86_pmu data field to find the magic
> value.

The new vector is created by KVM, not perf. So it cannot be found in the
x86_pmu data field. Perf needs it to update the interrupt vector so the
guest PMI can be handled by KVM directly.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface
  2024-10-14 17:45               ` Peter Zijlstra
@ 2024-10-15 15:59                 ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-10-15 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Manali Shukla, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Raghavendra Rao Ananta, kvm, linux-perf-users



On 2024-10-14 1:45 p.m., Peter Zijlstra wrote:
> On Mon, Oct 14, 2024 at 12:15:11PM -0400, Liang, Kan wrote:
>>
>>
>> On 2024-10-14 7:59 a.m., Peter Zijlstra wrote:
>>> On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:
>>>
>>>> The original implementation is by design having a terrible performance
>>>> overhead, ie., every PMU context switch at runtime requires a SRCU
>>>> lock pair and pmu list traversal. To reduce the overhead, we put
>>>> "passthrough" pmus in the front of the list and quickly exit the pmu
>>>> traversal when we just pass the last "passthrough" pmu.
>>>
>>> What was the expensive bit? The SRCU memory barrier or the list
>>> iteration? How long is that list really?
>>
>> Both. But I don't think there is any performance data.
>>
>> The length of the list could vary on different platforms. For a modern
>> server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs,
>> IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of
>> devices. The number could keep increasing with more and more devices
>> supporting the PMU capability.
>>
>> Two methods were considered.
>> - One is to add a global variable to track the "passthrough" pmu. The
>> idea assumes that there is only one "passthrough" pmu that requires the
>> switch, and the situation will not be changed in the near feature.
>> So the SRCU memory barrier and the list iteration can be avoided.
>> It's implemented in the patch
>>
>> - The other one is always put the "passthrough" pmus in the front of the
>> list. So the unnecessary list iteration can be avoided. It does nothing
>> for the SRCU lock pair.
> 
> PaulMck has patches that introduce srcu_read_lock_lite(), which would
> avoid the smp_mb() in most cases.
> 
>   https://lkml.kernel.org/r/20241011173931.2050422-6-paulmck@kernel.org
> 
> We can also keep a second list, just for the passthrough pmus. A bit
> like sched_cb_list.

Maybe we can do something like pmu_event_list, which has its own lock.
So we should not need the srcu and the module reference.

Thanks,
Kan



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface
  2024-09-09 22:11   ` Colton Lewis
  2024-09-10  5:00     ` Mi, Dapeng
@ 2024-10-24 19:45     ` Chen, Zide
  2024-10-25  0:52       ` Mi, Dapeng
  1 sibling, 1 reply; 183+ messages in thread
From: Chen, Zide @ 2024-10-24 19:45 UTC (permalink / raw)
  To: Colton Lewis, Mingwei Zhang
  Cc: seanjc, pbonzini, xiong.y.zhang, dapeng1.mi, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv, yanfei.xu,
	like.xu.linux, peterz, rananta, kvm, linux-perf-users



On 9/9/2024 3:11 PM, Colton Lewis wrote:
> Mingwei Zhang <mizhang@google.com> writes:
> 
>> From: Kan Liang <kan.liang@linux.intel.com>
> 
>> Implement switch_interrupt interface for x86 PMU, switch PMI to dedicated
>> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
>> NMI at perf guest exit.
> 
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>   arch/x86/events/core.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
> 
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 5bf78cd619bf..b17ef8b6c1a6 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2673,6 +2673,15 @@ static bool x86_pmu_filter(struct pmu *pmu, int
>> cpu)
>>       return ret;
>>   }
> 
>> +static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
>> +{
>> +    if (enter)
>> +        apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>> +               (guest_lvtpc & APIC_LVT_MASKED));
>> +    else
>> +        apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +}
>> +
> 
> Similar issue I point out in an earlier patch. #define
> KVM_GUEST_PMI_VECTOR is guarded by CONFIG_KVM but this code is not,
> which can result in compile errors.

Since KVM_GUEST_PMI_VECTOR and the interrupt handler are owned by KVM,
how about to simplify it to:

static void x86_pmu_switch_guest_ctx(bool enter, void *data)
{
	if (enter)
		apic_write(APIC_LVTPC, *(u32 *)data);
        ...
}

In KVM side:
perf_guest_enter(whatever_lvtpc_value_it_decides);


>>   static struct pmu pmu = {
>>       .pmu_enable        = x86_pmu_enable,
>>       .pmu_disable        = x86_pmu_disable,
>> @@ -2702,6 +2711,8 @@ static struct pmu pmu = {
>>       .aux_output_match    = x86_pmu_aux_output_match,
> 
>>       .filter            = x86_pmu_filter,
>> +
>> +    .switch_interrupt    = x86_pmu_switch_interrupt,
>>   };
> 
>>   void arch_perf_update_userpage(struct perf_event *event,
> 


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  2024-08-01  4:58 ` [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
@ 2024-10-24 19:57   ` Chen, Zide
  2024-10-25  2:55     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Chen, Zide @ 2024-10-24 19:57 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> Add correct PMU context switch at VM_entry/exit boundary.
> 
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dd6d2c334d90..70274c0da017 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11050,6 +11050,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		set_debugreg(0, 7);
>  	}
>  
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		kvm_pmu_restore_pmu_context(vcpu);

Suggest to move is_passthrough_pmu_enabled() into the PMU restore API to
keep x86.c clean. It's up to PMU to decide in what scenarios it needs to
do context switch.

> +
>  	guest_timing_enter_irqoff();
>  
>  	for (;;) {
> @@ -11078,6 +11081,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		++vcpu->stat.exits;
>  	}
>  
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		kvm_pmu_save_pmu_context(vcpu);

ditto.

>  	/*
>  	 * Do this here before restoring debug registers on the host.  And
>  	 * since we do this before handling the vmexit, a DR access vmexit


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-08-01  4:58 ` [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
@ 2024-10-24 19:58   ` Chen, Zide
  2024-10-25  2:50     ` Mi, Dapeng
  2024-11-19 18:17   ` Sean Christopherson
  1 sibling, 1 reply; 183+ messages in thread
From: Chen, Zide @ 2024-10-24 19:58 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
> interception.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
>  arch/x86/kvm/cpuid.c                   | 4 ++++
>  arch/x86/kvm/pmu.c                     | 5 +++++
>  arch/x86/kvm/pmu.h                     | 2 ++
>  4 files changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index fd986d5146e4..1b7876dcb3c3 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -24,6 +24,7 @@ KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
>  KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
> +KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
>  
>  #undef KVM_X86_PMU_OP
>  #undef KVM_X86_PMU_OP_OPTIONAL
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index f2f2be5d1141..3deb79b39847 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -381,6 +381,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>  
>  	kvm_pmu_refresh(vcpu);
> +
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		kvm_pmu_passthrough_pmu_msrs(vcpu);
> +
>  	vcpu->arch.cr4_guest_rsvd_bits =
>  	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
>  
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 3afefe4cf6e2..bd94f2d67f5c 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -1059,3 +1059,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>  	kfree(filter);
>  	return r;
>  }
> +
> +void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> +{
> +	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
> +}
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index e1af6d07b191..63f876557716 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -41,6 +41,7 @@ struct kvm_pmu_ops {
>  	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
>  	void (*cleanup)(struct kvm_vcpu *vcpu);
>  	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
> +	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);

Seems after_set_cpuid() is a better name. It's more generic to reflect
the fact that PMU needs to do something after userspace sets CPUID.
Currently PMU needs to update the MSR interception policy, but it may
want to do more in the future.

Also, it's more consistent to other APIs called in
kvm_vcpu_after_set_cpuid().

>  
>  	const u64 EVENTSEL_EVENT;
>  	const int MAX_NR_GP_COUNTERS;
> @@ -292,6 +293,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>  int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>  void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
>  bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
> +void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
>  
>  bool is_vmware_backdoor_pmc(u32 pmc_idx);
>  


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-08-01  4:58 ` [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
  2024-08-06  7:04   ` Mi, Dapeng
@ 2024-10-24 20:26   ` Chen, Zide
  2024-10-25  2:36     ` Mi, Dapeng
  2024-11-19 18:16   ` Sean Christopherson
  2 siblings, 1 reply; 183+ messages in thread
From: Chen, Zide @ 2024-10-24 20:26 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 7/31/2024 9:58 PM, Mingwei Zhang wrote:

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 339742350b7a..34a420fa98c5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4394,6 +4394,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
>  	return pin_based_exec_ctrl;
>  }
>  
> +static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
> +{
> +	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
> +	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
> +	struct vmx_msrs *m;
> +	int i;
> +
> +	if (cpu_has_perf_global_ctrl_bug() ||
> +	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> +	}
> +
> +	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		/*
> +		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
> +		 */
> +		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);

To save and restore Global Ctrl MSR at VMX transitions, I'm wondering if
there are particular reasons why we prefer VMCS exec control over
VMX-transition MSR areas? If no, I'd suggest to use the MSR area
approach only for two reasons:

1. Simpler code. In this patch set, in total it takes ~100 LOC to handle
the switch of this MSR.
2. With exec ctr approach, it needs one expensive VMCS read instruction
to save guest Global Ctrl on every VM exit and one VMCS write in VM
entry. (covered in patch 37)


> +		} else {
> +			m = &vmx->msr_autoload.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			m->val[i].value = 0;
> +		}
> +		/*
> +		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
> +		 */
> +		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);

ditto.

> +		} else {
> +			m = &vmx->msr_autoload.host;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			m->val[i].value = 0;
> +		}
> +		/*
> +		 * Setup auto save guest PERF_GLOBAL_CTRL msr at vm exit
> +		 */
> +		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autostore.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +		}
> +	} else {
> +		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autoload.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i >= 0) {
> +				m->nr--;
> +				m->val[i] = m->val[m->nr];
> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
> +			}
> +		}
> +		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autoload.host;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i >= 0) {
> +				m->nr--;
> +				m->val[i] = m->val[m->nr];
> +				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
> +			}
> +		}
> +		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
> +			m = &vmx->msr_autostore.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i >= 0) {
> +				m->nr--;
> +				m->val[i] = m->val[m->nr];
> +				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
> +			}
> +		}
> +	}
> +
> +	vm_entry_controls_set(vmx, vmentry_ctrl);
> +	vm_exit_controls_set(vmx, vmexit_ctrl);
> +}
> +
>  static u32 vmx_vmentry_ctrl(void)
>  {
>  	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-08-01  4:58 ` [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
@ 2024-10-24 20:26   ` Chen, Zide
  2024-10-25  2:51     ` Mi, Dapeng
  2024-10-31  3:14   ` Mi, Dapeng
  1 sibling, 1 reply; 183+ messages in thread
From: Chen, Zide @ 2024-10-24 20:26 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 7/31/2024 9:58 PM, Mingwei Zhang wrote:

> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>  					msrs[i].host, false);
>  }
>  
> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> +	int i;
> +
> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);

As commented in patch 26, compared with MSR auto save/store area
approach, the exec control way needs one relatively expensive VMCS read
on every VM exit.

> +	} else {
> +		i = pmu->global_ctrl_slot_in_autostore;
> +		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
> +	}
> +}
> +
> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> +	u64 global_ctrl = pmu->global_ctrl;
> +	int i;
> +
> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);

ditto.

We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
makes the code even more complicated.


> +	} else {
> +		i = pmu->global_ctrl_slot_in_autoload;
> +		vmx->msr_autoload.guest.val[i].value = global_ctrl;
> +	}
> +}
> +
> +static void __atomic_switch_perf_msrs_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	load_perf_global_ctrl_in_passthrough_pmu(vmx);
> +}
> +
> +static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> +{
> +	if (is_passthrough_pmu_enabled(&vmx->vcpu))
> +		__atomic_switch_perf_msrs_in_passthrough_pmu(vmx);
> +	else
> +		__atomic_switch_perf_msrs(vmx);
> +}
> +
>  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface
  2024-10-24 19:45     ` Chen, Zide
@ 2024-10-25  0:52       ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-25  0:52 UTC (permalink / raw)
  To: Chen, Zide, Colton Lewis, Mingwei Zhang
  Cc: seanjc, pbonzini, xiong.y.zhang, kan.liang, zhenyuw,
	manali.shukla, sandipan.das, jmattson, eranian, irogers, namhyung,
	gce-passthrou-pmu-dev, samantha.alt, zhiyuan.lv, yanfei.xu,
	like.xu.linux, peterz, rananta, kvm, linux-perf-users


On 10/25/2024 3:45 AM, Chen, Zide wrote:
>
> On 9/9/2024 3:11 PM, Colton Lewis wrote:
>> Mingwei Zhang <mizhang@google.com> writes:
>>
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>> Implement switch_interrupt interface for x86 PMU, switch PMI to dedicated
>>> KVM_GUEST_PMI_VECTOR at perf guest enter, and switch PMI back to
>>> NMI at perf guest exit.
>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>   arch/x86/events/core.c | 11 +++++++++++
>>>   1 file changed, 11 insertions(+)
>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>> index 5bf78cd619bf..b17ef8b6c1a6 100644
>>> --- a/arch/x86/events/core.c
>>> +++ b/arch/x86/events/core.c
>>> @@ -2673,6 +2673,15 @@ static bool x86_pmu_filter(struct pmu *pmu, int
>>> cpu)
>>>       return ret;
>>>   }
>>> +static void x86_pmu_switch_interrupt(bool enter, u32 guest_lvtpc)
>>> +{
>>> +    if (enter)
>>> +        apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>>> +               (guest_lvtpc & APIC_LVT_MASKED));
>>> +    else
>>> +        apic_write(APIC_LVTPC, APIC_DM_NMI);
>>> +}
>>> +
>> Similar issue I point out in an earlier patch. #define
>> KVM_GUEST_PMI_VECTOR is guarded by CONFIG_KVM but this code is not,
>> which can result in compile errors.
> Since KVM_GUEST_PMI_VECTOR and the interrupt handler are owned by KVM,
> how about to simplify it to:
>
> static void x86_pmu_switch_guest_ctx(bool enter, void *data)
> {
> 	if (enter)
> 		apic_write(APIC_LVTPC, *(u32 *)data);
>         ...
> }
>
> In KVM side:
> perf_guest_enter(whatever_lvtpc_value_it_decides);

Good point. Would address in v4.


>
>
>>>   static struct pmu pmu = {
>>>       .pmu_enable        = x86_pmu_enable,
>>>       .pmu_disable        = x86_pmu_disable,
>>> @@ -2702,6 +2711,8 @@ static struct pmu pmu = {
>>>       .aux_output_match    = x86_pmu_aux_output_match,
>>>       .filter            = x86_pmu_filter,
>>> +
>>> +    .switch_interrupt    = x86_pmu_switch_interrupt,
>>>   };
>>>   void arch_perf_update_userpage(struct perf_event *event,

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-10-24 20:26   ` Chen, Zide
@ 2024-10-25  2:36     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-25  2:36 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 10/25/2024 4:26 AM, Chen, Zide wrote:
>
> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 339742350b7a..34a420fa98c5 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -4394,6 +4394,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
>>  	return pin_based_exec_ctrl;
>>  }
>>  
>> +static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
>> +{
>> +	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
>> +	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
>> +	struct vmx_msrs *m;
>> +	int i;
>> +
>> +	if (cpu_has_perf_global_ctrl_bug() ||
>> +	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
>> +		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
>> +		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
>> +		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
>> +	}
>> +
>> +	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
>> +		/*
>> +		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
>> +		 */
>> +		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
>> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
> To save and restore Global Ctrl MSR at VMX transitions, I'm wondering if
> there are particular reasons why we prefer VMCS exec control over
> VMX-transition MSR areas? If no, I'd suggest to use the MSR area
> approach only for two reasons:
>
> 1. Simpler code. In this patch set, in total it takes ~100 LOC to handle
> the switch of this MSR.
> 2. With exec ctr approach, it needs one expensive VMCS read instruction
> to save guest Global Ctrl on every VM exit and one VMCS write in VM
> entry. (covered in patch 37)

In my opinion, there could be two reasons that we prefer to use VMCS exec
control to save/restore GLOBAL_CTRL MSR.

1. VMCS exec_control save/load is prior of MSR area save/load in
VM-Exit/VM-Entry process.

    Take VM-Exit as example, the sequence (copy from SDM Chapter 28 "VM
EXITS") is

    "

    2. Processor state is saved in the guest-state area (Section 28.3).

    3. MSRs may be saved in the VM-exit MSR-store area (Section 28.4).

    4. The following may be performed in parallel and in any order (Section
28.5):

        — Processor state is loaded based in part on the host-state area
and some VM-exit controls.

        — Address-range monitoring is cleared.

    5. MSRs may be loaded from the VM-exit MSR-load area (Section 28.6).

    "

    In our mediated vPMU implementation, we hope the guest counters are
disabled (Load 0 host global_ctrl) ASAP. That could help to avoid some race
condition issues, such as guest overflow PMI leaks into host in VM-Exit
process in theory, although we don't really observe it on Intel platforms.


2. Currently VMX save/restore the MSRs in MSR auto-load/restore area with
MSR index. As the recommendation from SDM, GLOBAL_CTRL MSR should be
enabled at last among all PMU MSRs. If there are multiple PMU MSRs in the
MSR auto-load/restore area, the restoring sequence may not meet this
requirement as GLOBAL_CTRL doesn't always have the largest MSR index. Of
course, we don't have this issue right now since current implementation
only save/restore one MSR global_ctrl with MSR auto-load/store area.


But yes, frequent vmcs_read()/vmcs_write() indeed brings some kind of
performance hit. We may need to do a performance evaluation and look at how
big the performance impact is.



>
>
>> +		} else {
>> +			m = &vmx->msr_autoload.guest;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i < 0) {
>> +				i = m->nr++;
>> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
>> +			}
>> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
>> +			m->val[i].value = 0;
>> +		}
>> +		/*
>> +		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
>> +		 */
>> +		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
>> +			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
> ditto.
>
>> +		} else {
>> +			m = &vmx->msr_autoload.host;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i < 0) {
>> +				i = m->nr++;
>> +				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
>> +			}
>> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
>> +			m->val[i].value = 0;
>> +		}
>> +		/*
>> +		 * Setup auto save guest PERF_GLOBAL_CTRL msr at vm exit
>> +		 */
>> +		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
>> +			m = &vmx->msr_autostore.guest;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i < 0) {
>> +				i = m->nr++;
>> +				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
>> +			}
>> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
>> +		}
>> +	} else {
>> +		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
>> +			m = &vmx->msr_autoload.guest;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i >= 0) {
>> +				m->nr--;
>> +				m->val[i] = m->val[m->nr];
>> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
>> +			}
>> +		}
>> +		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
>> +			m = &vmx->msr_autoload.host;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i >= 0) {
>> +				m->nr--;
>> +				m->val[i] = m->val[m->nr];
>> +				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
>> +			}
>> +		}
>> +		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
>> +			m = &vmx->msr_autostore.guest;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i >= 0) {
>> +				m->nr--;
>> +				m->val[i] = m->val[m->nr];
>> +				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
>> +			}
>> +		}
>> +	}
>> +
>> +	vm_entry_controls_set(vmx, vmentry_ctrl);
>> +	vm_exit_controls_set(vmx, vmexit_ctrl);
>> +}
>> +
>>  static u32 vmx_vmentry_ctrl(void)
>>  {
>>  	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-10-24 19:58   ` Chen, Zide
@ 2024-10-25  2:50     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-25  2:50 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 10/25/2024 3:58 AM, Chen, Zide wrote:
>
> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>> Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
>> interception.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> ---
>>  arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
>>  arch/x86/kvm/cpuid.c                   | 4 ++++
>>  arch/x86/kvm/pmu.c                     | 5 +++++
>>  arch/x86/kvm/pmu.h                     | 2 ++
>>  4 files changed, 12 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> index fd986d5146e4..1b7876dcb3c3 100644
>> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> @@ -24,6 +24,7 @@ KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
>>  KVM_X86_PMU_OP_OPTIONAL(reset)
>>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
>> +KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
>>  
>>  #undef KVM_X86_PMU_OP
>>  #undef KVM_X86_PMU_OP_OPTIONAL
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index f2f2be5d1141..3deb79b39847 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -381,6 +381,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>  	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>>  
>>  	kvm_pmu_refresh(vcpu);
>> +
>> +	if (is_passthrough_pmu_enabled(vcpu))
>> +		kvm_pmu_passthrough_pmu_msrs(vcpu);
>> +
>>  	vcpu->arch.cr4_guest_rsvd_bits =
>>  	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
>>  
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 3afefe4cf6e2..bd94f2d67f5c 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -1059,3 +1059,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>>  	kfree(filter);
>>  	return r;
>>  }
>> +
>> +void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>> +{
>> +	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
>> +}
>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>> index e1af6d07b191..63f876557716 100644
>> --- a/arch/x86/kvm/pmu.h
>> +++ b/arch/x86/kvm/pmu.h
>> @@ -41,6 +41,7 @@ struct kvm_pmu_ops {
>>  	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
>>  	void (*cleanup)(struct kvm_vcpu *vcpu);
>>  	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
>> +	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
> Seems after_set_cpuid() is a better name. It's more generic to reflect
> the fact that PMU needs to do something after userspace sets CPUID.
> Currently PMU needs to update the MSR interception policy, but it may
> want to do more in the future.
>
> Also, it's more consistent to other APIs called in
> kvm_vcpu_after_set_cpuid().

Looks reasonable.


>
>>  
>>  	const u64 EVENTSEL_EVENT;
>>  	const int MAX_NR_GP_COUNTERS;
>> @@ -292,6 +293,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>>  int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>>  void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
>>  bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
>> +void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
>>  
>>  bool is_vmware_backdoor_pmc(u32 pmc_idx);
>>  

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-10-24 20:26   ` Chen, Zide
@ 2024-10-25  2:51     ` Mi, Dapeng
  2024-11-19  1:46       ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-25  2:51 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 10/25/2024 4:26 AM, Chen, Zide wrote:
>
> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>
>> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>>  					msrs[i].host, false);
>>  }
>>  
>> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>> +	int i;
>> +
>> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
>> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> As commented in patch 26, compared with MSR auto save/store area
> approach, the exec control way needs one relatively expensive VMCS read
> on every VM exit.

Anyway, let us have a evaluation and data speaks.


>
>> +	} else {
>> +		i = pmu->global_ctrl_slot_in_autostore;
>> +		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
>> +	}
>> +}
>> +
>> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>> +	u64 global_ctrl = pmu->global_ctrl;
>> +	int i;
>> +
>> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
>> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
> ditto.
>
> We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
> update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
> makes the code even more complicated.
>
>
>> +	} else {
>> +		i = pmu->global_ctrl_slot_in_autoload;
>> +		vmx->msr_autoload.guest.val[i].value = global_ctrl;
>> +	}
>> +}
>> +
>> +static void __atomic_switch_perf_msrs_in_passthrough_pmu(struct vcpu_vmx *vmx)
>> +{
>> +	load_perf_global_ctrl_in_passthrough_pmu(vmx);
>> +}
>> +
>> +static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>> +{
>> +	if (is_passthrough_pmu_enabled(&vmx->vcpu))
>> +		__atomic_switch_perf_msrs_in_passthrough_pmu(vmx);
>> +	else
>> +		__atomic_switch_perf_msrs(vmx);
>> +}
>> +
>>  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>>  {
>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  2024-10-24 19:57   ` Chen, Zide
@ 2024-10-25  2:55     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-25  2:55 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 10/25/2024 3:57 AM, Chen, Zide wrote:
>
> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>
>> Add correct PMU context switch at VM_entry/exit boundary.
>>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/x86.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index dd6d2c334d90..70274c0da017 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -11050,6 +11050,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>  		set_debugreg(0, 7);
>>  	}
>>  
>> +	if (is_passthrough_pmu_enabled(vcpu))
>> +		kvm_pmu_restore_pmu_context(vcpu);
> Suggest to move is_passthrough_pmu_enabled() into the PMU restore API to
> keep x86.c clean. It's up to PMU to decide in what scenarios it needs to
> do context switch.

Agree.


>
>> +
>>  	guest_timing_enter_irqoff();
>>  
>>  	for (;;) {
>> @@ -11078,6 +11081,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>  		++vcpu->stat.exits;
>>  	}
>>  
>> +	if (is_passthrough_pmu_enabled(vcpu))
>> +		kvm_pmu_save_pmu_context(vcpu);
> ditto.
>
>>  	/*
>>  	 * Do this here before restoring debug registers on the host.  And
>>  	 * since we do this before handling the vmexit, a DR access vmexit

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  2024-08-01  4:58 ` [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
@ 2024-10-25 16:16   ` Chen, Zide
  2024-10-27 12:06     ` Mi, Dapeng
  2024-11-20 18:48     ` Sean Christopherson
  0 siblings, 2 replies; 183+ messages in thread
From: Chen, Zide @ 2024-10-25 16:16 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> Introduce PMU operator for setting counter overflow. When emulating counter
> increment, multiple counters could overflow at the same time, i.e., during
> the execution of the same instruction. In passthrough PMU, having an PMU
> operator provides convenience to update the PMU global status in one shot
> with details hidden behind the vendor specific implementation.

Since neither Intel nor AMD does implement this API, this patch should
be dropped.

> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
>  arch/x86/kvm/pmu.h                     | 1 +
>  arch/x86/kvm/vmx/pmu_intel.c           | 5 +++++
>  3 files changed, 7 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index 72ca78df8d2b..bd5b118a5ce5 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -28,6 +28,7 @@ KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
>  KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
>  KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
>  KVM_X86_PMU_OP_OPTIONAL(incr_counter)
> +KVM_X86_PMU_OP_OPTIONAL(set_overflow)
>  
>  #undef KVM_X86_PMU_OP
>  #undef KVM_X86_PMU_OP_OPTIONAL
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 325f17673a00..78a7f0c5f3ba 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -45,6 +45,7 @@ struct kvm_pmu_ops {
>  	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
>  	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
>  	bool (*incr_counter)(struct kvm_pmc *pmc);
> +	void (*set_overflow)(struct kvm_vcpu *vcpu);
>  
>  	const u64 EVENTSEL_EVENT;
>  	const int MAX_NR_GP_COUNTERS;
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 42af2404bdb9..2d46c911f0b7 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -881,6 +881,10 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
>  	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
>  }
>  
> +static void intel_set_overflow(struct kvm_vcpu *vcpu)
> +{
> +}
> +
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
>  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
> @@ -897,6 +901,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.save_pmu_context = intel_save_guest_pmu_context,
>  	.restore_pmu_context = intel_restore_guest_pmu_context,
>  	.incr_counter = intel_incr_counter,
> +	.set_overflow = intel_set_overflow,
>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = 1,


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  2024-10-25 16:16   ` Chen, Zide
@ 2024-10-27 12:06     ` Mi, Dapeng
  2024-11-20 18:48     ` Sean Christopherson
  1 sibling, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-27 12:06 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 10/26/2024 12:16 AM, Chen, Zide wrote:
>
> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>> Introduce PMU operator for setting counter overflow. When emulating counter
>> increment, multiple counters could overflow at the same time, i.e., during
>> the execution of the same instruction. In passthrough PMU, having an PMU
>> operator provides convenience to update the PMU global status in one shot
>> with details hidden behind the vendor specific implementation.
> Since neither Intel nor AMD does implement this API, this patch should
> be dropped.

oh, yes.


>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
>>  arch/x86/kvm/pmu.h                     | 1 +
>>  arch/x86/kvm/vmx/pmu_intel.c           | 5 +++++
>>  3 files changed, 7 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> index 72ca78df8d2b..bd5b118a5ce5 100644
>> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> @@ -28,6 +28,7 @@ KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
>>  KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
>>  KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
>>  KVM_X86_PMU_OP_OPTIONAL(incr_counter)
>> +KVM_X86_PMU_OP_OPTIONAL(set_overflow)
>>  
>>  #undef KVM_X86_PMU_OP
>>  #undef KVM_X86_PMU_OP_OPTIONAL
>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>> index 325f17673a00..78a7f0c5f3ba 100644
>> --- a/arch/x86/kvm/pmu.h
>> +++ b/arch/x86/kvm/pmu.h
>> @@ -45,6 +45,7 @@ struct kvm_pmu_ops {
>>  	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
>>  	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
>>  	bool (*incr_counter)(struct kvm_pmc *pmc);
>> +	void (*set_overflow)(struct kvm_vcpu *vcpu);
>>  
>>  	const u64 EVENTSEL_EVENT;
>>  	const int MAX_NR_GP_COUNTERS;
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 42af2404bdb9..2d46c911f0b7 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -881,6 +881,10 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
>>  	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
>>  }
>>  
>> +static void intel_set_overflow(struct kvm_vcpu *vcpu)
>> +{
>> +}
>> +
>>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
>>  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
>> @@ -897,6 +901,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>  	.save_pmu_context = intel_save_guest_pmu_context,
>>  	.restore_pmu_context = intel_restore_guest_pmu_context,
>>  	.incr_counter = intel_incr_counter,
>> +	.set_overflow = intel_set_overflow,
>>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>>  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
>>  	.MIN_NR_GP_COUNTERS = 1,

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-08-01  4:58 ` [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
  2024-10-24 20:26   ` Chen, Zide
@ 2024-10-31  3:14   ` Mi, Dapeng
  1 sibling, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-10-31  3:14 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>
> In PMU passthrough mode, use global_ctrl field in struct kvm_pmu as the
> cached value. This is convenient for KVM to set and get the value from the
> host side. In addition, load and save the value across VM enter/exit
> boundary in the following way:
>
>  - At VM exit, if processor supports
>    GUEST_VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, read guest
>    IA32_PERF_GLOBAL_CTRL GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else read
>    it from VM-exit MSR-stroe array in VMCS. The value is then assigned to
>    global_ctrl.
>
>  - At VM Entry, if processor supports
>    GUEST_VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, read guest
>    IA32_PERF_GLOBAL_CTRL from GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else
>    read it from VM-entry MSR-load array in VMCS. The value is then
>    assigned to global ctrl.
>
> Implement the above logic into two helper functions and invoke them around
> VM Enter/exit boundary.
>
> Co-developed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/vmx/vmx.c          | 49 ++++++++++++++++++++++++++++++++-
>  2 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 93c17da8271d..7bf901a53543 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -601,6 +601,8 @@ struct kvm_pmu {
>  	u8 event_count;
>  
>  	bool passthrough;
> +	int global_ctrl_slot_in_autoload;
> +	int global_ctrl_slot_in_autostore;
>  };
>  
>  struct kvm_pmu_ops;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 41102658ed21..b126de6569c8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4430,6 +4430,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
>  			}
>  			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
>  			m->val[i].value = 0;
> +			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autoload = i;
>  		}
>  		/*
>  		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
> @@ -4457,6 +4458,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
>  				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
>  			}
>  			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autostore = i;
>  		}
>  	} else {
>  		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
> @@ -4467,6 +4469,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
>  				m->val[i] = m->val[m->nr];
>  				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
>  			}
> +			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autoload = -ENOENT;
>  		}
>  		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
>  			m = &vmx->msr_autoload.host;
> @@ -4485,6 +4488,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
>  				m->val[i] = m->val[m->nr];
>  				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
>  			}
> +			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autostore = -ENOENT;
>  		}
>  	}
>  
> @@ -7272,7 +7276,7 @@ void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
>  }
>  
> -static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> +static void __atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>  {
>  	int i, nr_msrs;
>  	struct perf_guest_switch_msr *msrs;
> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>  					msrs[i].host, false);
>  }
>  
> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> +	int i;
> +
> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> +	} else {
> +		i = pmu->global_ctrl_slot_in_autostore;
> +		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
> +	}

When GLOBAL_CTRL MSR is in interception mode, if guest global_ctrl contains
some invalid bit, there may be issue here. The saved guest_ctrl here would
be restored in vm-entry, and it would contains invalid bit as well.

It looks we need to only save valid bits of guest global_ctrl here.


> +}
> +
> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> +	u64 global_ctrl = pmu->global_ctrl;
> +	int i;
> +
> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
> +	} else {
> +		i = pmu->global_ctrl_slot_in_autoload;
> +		vmx->msr_autoload.guest.val[i].value = global_ctrl;
> +	}
> +}
> +
> +static void __atomic_switch_perf_msrs_in_passthrough_pmu(struct vcpu_vmx *vmx)
> +{
> +	load_perf_global_ctrl_in_passthrough_pmu(vmx);
> +}
> +
> +static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> +{
> +	if (is_passthrough_pmu_enabled(&vmx->vcpu))
> +		__atomic_switch_perf_msrs_in_passthrough_pmu(vmx);
> +	else
> +		__atomic_switch_perf_msrs(vmx);
> +}
> +
>  static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -7405,6 +7449,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  	vcpu->arch.cr2 = native_read_cr2();
>  	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>  
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		save_perf_global_ctrl_in_passthrough_pmu(vmx);
> +
>  	vmx->idt_vectoring_info = 0;
>  
>  	vmx_enable_fb_clear(vmx);

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-10-25  2:51     ` Mi, Dapeng
@ 2024-11-19  1:46       ` Sean Christopherson
  2024-11-19  5:20         ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19  1:46 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Zide Chen, Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Fri, Oct 25, 2024, Dapeng Mi wrote:
> 
> On 10/25/2024 4:26 AM, Chen, Zide wrote:
> >
> > On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> >
> >> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> >>  					msrs[i].host, false);
> >>  }
> >>  
> >> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> >> +{
> >> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >> +	int i;
> >> +
> >> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> >> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> > As commented in patch 26, compared with MSR auto save/store area
> > approach, the exec control way needs one relatively expensive VMCS read
> > on every VM exit.
> 
> Anyway, let us have a evaluation and data speaks.

No, drop the unconditional VMREAD and VMWRITE, one way or another.  No benchmark
will notice ~50 extra cycles, but if we write poor code for every feature, those
50 cycles per feature add up.

Furthermore, checking to see if the CPU supports the load/save VMCS controls at
runtime beyond ridiculous.  The mediated PMU requires ***VERSION 4***; if a CPU
supports PMU version 4 and doesn't support the VMCS controls, KVM should yell and
disable the passthrough PMU.  The amount of complexity added here to support a
CPU that should never exist is silly.

> >> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> >> +{
> >> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >> +	u64 global_ctrl = pmu->global_ctrl;
> >> +	int i;
> >> +
> >> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> >> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
> > ditto.
> >
> > We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
> > update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
> > makes the code even more complicated.

I haven't looked at surrounding code too much, but I guarantee there's _zero_
reason to eat a VMWRITE+VMREAD on every transition.  If, emphasis on *if*, KVM
accesses PERF_GLOBAL_CTRL frequently, e.g. on most exits, then add a VCPU_EXREG_XXX
and let KVM's caching infrastructure do the heavy lifting.  Don't reinvent the
wheel.  But first, convince the world that KVM actually accesses the MSR somewhat
frequently.

I'll do a more thorough review of this series in the coming weeks (days?).  I
singled out this one because I happened to stumble across the code when digging
into something else.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-11-19  1:46       ` Sean Christopherson
@ 2024-11-19  5:20         ` Mi, Dapeng
  2024-11-19 13:44           ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-19  5:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zide Chen, Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On 11/19/2024 9:46 AM, Sean Christopherson wrote:
> On Fri, Oct 25, 2024, Dapeng Mi wrote:
>> On 10/25/2024 4:26 AM, Chen, Zide wrote:
>>> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>>>
>>>> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>>>>  					msrs[i].host, false);
>>>>  }
>>>>  
>>>> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
>>>> +{
>>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>>>> +	int i;
>>>> +
>>>> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
>>>> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
>>> As commented in patch 26, compared with MSR auto save/store area
>>> approach, the exec control way needs one relatively expensive VMCS read
>>> on every VM exit.
>> Anyway, let us have a evaluation and data speaks.
> No, drop the unconditional VMREAD and VMWRITE, one way or another.  No benchmark
> will notice ~50 extra cycles, but if we write poor code for every feature, those
> 50 cycles per feature add up.
>
> Furthermore, checking to see if the CPU supports the load/save VMCS controls at
> runtime beyond ridiculous.  The mediated PMU requires ***VERSION 4***; if a CPU
> supports PMU version 4 and doesn't support the VMCS controls, KVM should yell and
> disable the passthrough PMU.  The amount of complexity added here to support a
> CPU that should never exist is silly.
>
>>>> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
>>>> +{
>>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>>>> +	u64 global_ctrl = pmu->global_ctrl;
>>>> +	int i;
>>>> +
>>>> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
>>>> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
>>> ditto.
>>>
>>> We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
>>> update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
>>> makes the code even more complicated.
> I haven't looked at surrounding code too much, but I guarantee there's _zero_
> reason to eat a VMWRITE+VMREAD on every transition.  If, emphasis on *if*, KVM
> accesses PERF_GLOBAL_CTRL frequently, e.g. on most exits, then add a VCPU_EXREG_XXX
> and let KVM's caching infrastructure do the heavy lifting.  Don't reinvent the
> wheel.  But first, convince the world that KVM actually accesses the MSR somewhat
> frequently.

Sean, let me give more background here.

VMX supports two ways to save/restore PERF_GLOBAL_CTRL MSR, one is to
leverage VMCS_EXIT_CTRL/VMCS_ENTRY_CTRL to save/restore guest
PERF_GLOBAL_CTRL value to/from VMCS guest state. The other is to use the
VMCS MSR auto-load/restore bitmap to save/restore guest PERF_GLOBAL_CTRL. 

Currently we prefer to use the former way to save/restore guest
PERF_GLOBAL_CTRL as long as HW supports it. There is a limitation on the
MSR auto-load/restore feature. When there are multiple MSRs, the MSRs are
saved/restored in the order of MSR index. As the suggestion of SDM,
PERF_GLOBAL_CTRL should always be written at last after all other PMU MSRs
are manipulated. So if there are some PMU MSRs whose index is larger than
PERF_GLOBAL_CTRL (It would be true in archPerfmon v6+, all PMU MSRs in the
new MSR range have larger index than PERF_GLOBAL_CTRL), these PMU MSRs
would be restored after PERF_GLOBAL_CTRL. That would break the rule. Of
course, it's good to save/restore PERF_GLOBAL_CTRL right now with the VMCS
VMCS MSR auto-load/restore bitmap feature since only one PMU MSR
PERF_GLOBAL_CTRL is saved/restored in current implementation.

PERF_GLOBAL_CTRL MSR could be frequently accessed by perf/pmu driver, e.g.
on each task switch, so PERF_GLOBAL_CTRL MSR is configured to passthrough
to reduce the performance impact in mediated vPMU proposal if guest own all
PMU HW resource. But if guest only owns part of PMU HW resource,
PERF_GLOBAL_CTRL would be set to interception mode.

I suppose KVM doesn't need access PERF_GLOBAL_CTRL in passthrough mode.
This piece of code is intently just for PERF_GLOBAL_CTRL interception mode,
but think twice it looks unnecessary to save/restore PERF_GLOBAL_CTRL via
VMCS as KVM would always maintain the guest PERF_GLOBAL_CTRL value? Anyway,
this part of code can be optimized.

>
> I'll do a more thorough review of this series in the coming weeks (days?).  I
> singled out this one because I happened to stumble across the code when digging
> into something else.
Thanks, look forward to your more comments.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-11-19  5:20         ` Mi, Dapeng
@ 2024-11-19 13:44           ` Sean Christopherson
  2024-11-20  2:08             ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 13:44 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Zide Chen, Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Tue, Nov 19, 2024, Dapeng Mi wrote:
> 
> On 11/19/2024 9:46 AM, Sean Christopherson wrote:
> > On Fri, Oct 25, 2024, Dapeng Mi wrote:
> >> On 10/25/2024 4:26 AM, Chen, Zide wrote:
> >>> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> >>>
> >>>> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> >>>>  					msrs[i].host, false);
> >>>>  }
> >>>>  
> >>>> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> >>>> +{
> >>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >>>> +	int i;
> >>>> +
> >>>> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> >>>> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> >>> As commented in patch 26, compared with MSR auto save/store area
> >>> approach, the exec control way needs one relatively expensive VMCS read
> >>> on every VM exit.
> >> Anyway, let us have a evaluation and data speaks.
> > No, drop the unconditional VMREAD and VMWRITE, one way or another.  No benchmark
> > will notice ~50 extra cycles, but if we write poor code for every feature, those
> > 50 cycles per feature add up.
> >
> > Furthermore, checking to see if the CPU supports the load/save VMCS controls at
> > runtime beyond ridiculous.  The mediated PMU requires ***VERSION 4***; if a CPU
> > supports PMU version 4 and doesn't support the VMCS controls, KVM should yell and
> > disable the passthrough PMU.  The amount of complexity added here to support a
> > CPU that should never exist is silly.
> >
> >>>> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> >>>> +{
> >>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >>>> +	u64 global_ctrl = pmu->global_ctrl;
> >>>> +	int i;
> >>>> +
> >>>> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> >>>> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
> >>> ditto.
> >>>
> >>> We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
> >>> update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
> >>> makes the code even more complicated.
> > I haven't looked at surrounding code too much, but I guarantee there's _zero_
> > reason to eat a VMWRITE+VMREAD on every transition.  If, emphasis on *if*, KVM
> > accesses PERF_GLOBAL_CTRL frequently, e.g. on most exits, then add a VCPU_EXREG_XXX
> > and let KVM's caching infrastructure do the heavy lifting.  Don't reinvent the
> > wheel.  But first, convince the world that KVM actually accesses the MSR somewhat
> > frequently.
> 
> Sean, let me give more background here.
> 
> VMX supports two ways to save/restore PERF_GLOBAL_CTRL MSR, one is to
> leverage VMCS_EXIT_CTRL/VMCS_ENTRY_CTRL to save/restore guest
> PERF_GLOBAL_CTRL value to/from VMCS guest state. The other is to use the
> VMCS MSR auto-load/restore bitmap to save/restore guest PERF_GLOBAL_CTRL. 

I know.

> Currently we prefer to use the former way to save/restore guest
> PERF_GLOBAL_CTRL as long as HW supports it. There is a limitation on the
> MSR auto-load/restore feature. When there are multiple MSRs, the MSRs are
> saved/restored in the order of MSR index. As the suggestion of SDM,
> PERF_GLOBAL_CTRL should always be written at last after all other PMU MSRs
> are manipulated. So if there are some PMU MSRs whose index is larger than
> PERF_GLOBAL_CTRL (It would be true in archPerfmon v6+, all PMU MSRs in the
> new MSR range have larger index than PERF_GLOBAL_CTRL),

No, the entries in the load/store lists are processed in sequential order as they
appear in the lists.  Ordering them based on their MSR index would be insane and
would make the lists useless.

  VM entries may load MSRs from the VM-entry MSR-load area (see Section 25.8.2).
  Specifically each entry in that area (up to the number specified in the VM-entry
  MSR-load count) is processed in order by loading the MSR indexed by bits 31:0
  with the contents of bits 127:64 as they would be written by WRMSR.1

> these PMU MSRs would be restored after PERF_GLOBAL_CTRL. That would break the
> rule. Of course, it's good to save/restore PERF_GLOBAL_CTRL right now with
> the VMCS VMCS MSR auto-load/restore bitmap feature since only one PMU MSR
> PERF_GLOBAL_CTRL is saved/restored in current implementation.

No, it's never good to use the load/store lists.  They're slow as mud, because
they're essentially just wrappers to the standard WRMSR/RDMSR ucode.  Whereas
dedicated VMCS fields have dedicated, streamlined ucode to make loads and stores
as fast as possible.

I haven't measured PERF_GLOBAL_CTRL specifically, at least not in recent memory,
but generally speaking using a load/store entry is 100+ cycles, whereas using a
dedicated VMCS field is <20 cyles (often far less).

So what I am saying is that the mediated PMU should _require_ support for loading
and saving PERF_GLOBAL_CTRL via dedicated fields, and WARN if a CPU with a v4+
PMU doesn't support said fields.  E.g.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a4b2b0b69a68..cab8305e7bf0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8620,6 +8620,15 @@ __init int vmx_hardware_setup(void)
                enable_sgx = false;
 #endif
 
+       /*
+        * All CPUs that support a mediated PMU are expected to support loading
+        * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
+        */
+       if (enable_passthrough_pmu &&
+           (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
+                         !cpu_has_save_perf_global_ctrl())))
+               enable_passthrough_pmu = false;
+
        /*
         * set_apic_access_page_addr() is used to reload apic access
         * page upon invalidation.  No need to do anything if not

That will provide better, more consistent performance, and will eliminate a big
pile of non-trivial code.

> PERF_GLOBAL_CTRL MSR could be frequently accessed by perf/pmu driver, e.g.
> on each task switch, so PERF_GLOBAL_CTRL MSR is configured to passthrough
> to reduce the performance impact in mediated vPMU proposal if guest own all
> PMU HW resource. But if guest only owns part of PMU HW resource,
> PERF_GLOBAL_CTRL would be set to interception mode.

Again, I know.  What I am saying is that propagating PERF_GLOBAL_CTRL to/from the
VMCS on every entry and exit is extremely wasteful and completely unnecessary.

> I suppose KVM doesn't need access PERF_GLOBAL_CTRL in passthrough mode.
> This piece of code is intently just for PERF_GLOBAL_CTRL interception mode,

No, it's even more useless if PERF_GLOBAL_CTRL is intercepted, because in that
case the _only_ time KVM needs move the guest's value to/from the VMCS is when
the guest (or host userspace) is explicitly accessing the field.

> but think twice it looks unnecessary to save/restore PERF_GLOBAL_CTRL via
> VMCS as KVM would always maintain the guest PERF_GLOBAL_CTRL value? Anyway,
> this part of code can be optimized.

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (58 preceding siblings ...)
  2024-09-11 10:45 ` [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Ma, Yongwei
@ 2024-11-19 14:00 ` Sean Christopherson
  2024-11-20  2:31   ` Mi, Dapeng
  2024-11-20 11:55 ` Mi, Dapeng
  60 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 14:00 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> This series contains perf interface improvements to address Peter's
> comments. In addition, fix several bugs for v2. This version is based on
> 6.10-rc4. The main changes are:
> 
>  - Use atomics to replace refcounts to track the nr_mediated_pmu_vms.
>  - Use the generic ctx_sched_{in,out}() to switch PMU resources when a
>    guest is entering and exiting.
>  - Add a new EVENT_GUEST flag to indicate the context switch case of
>    entering and exiting a guest. Updates the generic ctx_sched_{in,out}
>    to specifically handle this case, especially for time management.
>  - Switch PMI vector in perf_guest_{enter,exit}() as well. Add a new
>    driver-specific interface to facilitate the switch.
>  - Remove the PMU_FL_PASSTHROUGH flag and uses the PASSTHROUGH pmu
>    capability instead.
>  - Adjust commit sequence in PERF and KVM PMI interrupt functions.
>  - Use pmc_is_globally_enabled() check in emulated counter increment [1]
>  - Fix PMU context switch [2] by using rdpmc() instead of rdmsr().
> 
> AMD fixes:
>  - Add support for legacy PMU MSRs in MSR interception.
>  - Make MSR usage consistent if PerfMonV2 is available.
>  - Avoid enabling passthrough vPMU when local APIC is not in kernel.
>  - increment counters in emulation mode.
> 
> This series is organized in the following order:
> 
> Patches 1-3:
>  - Immediate bug fixes that can be applied to Linux tip.
>  - Note: will put immediate fixes ahead in the future. These patches
>    might be duplicated with existing posts.
>  - Note: patches 1-2 are needed for AMD when host kernel enables
>    preemption. Otherwise, guest will suffer from softlockup.
> 
> Patches 4-17:
>  - Perf side changes, infra changes in core pmu with API for KVM.
> 
> Patches 18-48:
>  - KVM mediated passthrough vPMU framework + Intel CPU implementation.
> 
> Patches 49-58:
>  - AMD CPU implementation for vPMU.

Please rename everything in KVM to drop "passthrough" and simply use "mediated"
for the overall concept.  This is not a passthrough setup by any stretch of the
word.  I realize it's a ton of renaming, but calling this "passthrough" is very
misleading and actively harmful for unsuspecting readers.

For helpers and/or comments that deal with intercepting (or not) MSRs, use
"intercept" and appropriate variations.  E.g. intel_pmu_update_msr_intercepts().

And for RDPMC, maybe kvm_rdpmc_in_guest() to follow kvm_{hlt,mwait,pause,cstate_in_guest()?
I don't love the terminology, but there's a lot of value in being consistent
throughout KVM.

I am not willing to budge on this, at all.

I'm ok with the perf side of things using "passthrough" if "mediated" feels weird
in that context and we can't come up with a better option, but for the KVM side,
"passthrough" is simply wrong.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2024-08-01  4:58 ` [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
@ 2024-11-19 14:30   ` Sean Christopherson
  2024-11-20  3:21     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 14:30 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

As per my feedback in the initial RFC[*]:

 2. The module param absolutely must not be exposed to userspace until all patches
    are in place.  The easiest way to do that without creating dependency hell is
    to simply not create the module param.

[*] https://lore.kernel.org/all/ZhhQBHQ6V7Zcb8Ve@google.com

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
> variable is true only when the following conditions satisfies:
>  - set to true when module loaded.
>  - enable_pmu is true.
>  - is running on Intel CPU.
>  - supports PerfMon v4.
>  - host PMU supports passthrough mode.
> 
> The value is always read-only because passthrough PMU currently does not
> support features like LBR and PEBS, while emualted PMU does. This will end
> up with two different values for kvm_cap.supported_perf_cap, which is
> initialized at module load time. Maintaining two different perf
> capabilities will add complexity. Further, there is not enough motivation
> to support running two types of PMU implementations at the same time,
> although it is possible/feasible in reality.
> 
> Finally, always propagate enable_passthrough_pmu and perf_capabilities into
> kvm->arch for each KVM instance.
> 
> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/pmu.h              | 14 ++++++++++++++
>  arch/x86/kvm/vmx/vmx.c          |  7 +++++--
>  arch/x86/kvm/x86.c              |  8 ++++++++
>  arch/x86/kvm/x86.h              |  1 +
>  5 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f8ca74e7678f..a15c783f20b9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1406,6 +1406,7 @@ struct kvm_arch {
>  
>  	bool bus_lock_detection_enabled;
>  	bool enable_pmu;
> +	bool enable_passthrough_pmu;

Again, as I suggested/requested in the initial RFC[*], drop the per-VM flag as well
as kvm_pmu.passthrough.  There is zero reason to cache the module param.  KVM
should always query kvm->arch.enable_pmu prior to checking if the mediated PMU
is enabled, so I doubt we even need a helper to check both.

[*] https://lore.kernel.org/all/ZhhOEDAl6k-NzOkM@google.com

>  
>  	u32 notify_window;
>  	u32 notify_vmexit_flags;
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 4d52b0b539ba..cf93be5e7359 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>  			enable_pmu = false;
>  	}
>  
> +	/* Pass-through vPMU is only supported in Intel CPUs. */
> +	if (!is_intel)
> +		enable_passthrough_pmu = false;
> +
> +	/*
> +	 * Pass-through vPMU requires at least PerfMon version 4 because the
> +	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
> +	 * for counter emulation as well as PMU context switch.  In addition, it
> +	 * requires host PMU support on passthrough mode. Disable pass-through
> +	 * vPMU if any condition fails.
> +	 */
> +	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)

As is quite obvious by the end of the series, the v4 requirement is specific to
Intel.

	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
	    (is_intel && kvm_pmu_cap.version < 4) ||
	    (is_amd && kvm_pmu_cap.version < 2))
		enable_passthrough_pmu = false;

Furthermore, there is zero reason to explicitly and manually check the vendor,
kvm_init_pmu_capability() takes kvm_pmu_ops.  Adding a callback is somewhat
undesirable as it would lead to duplicate code, but we can still provide separation
of concerns by adding const variables to kvm_pmu_ops, a la MAX_NR_GP_COUNTERS.

E.g.

	if (enable_pmu) {
		perf_get_x86_pmu_capability(&kvm_pmu_cap);

		/*
		 * WARN if perf did NOT disable hardware PMU if the number of
		 * architecturally required GP counters aren't present, i.e. if
		 * there are a non-zero number of counters, but fewer than what
		 * is architecturally required.
		 */
		if (!kvm_pmu_cap.num_counters_gp ||
		    WARN_ON_ONCE(kvm_pmu_cap.num_counters_gp < min_nr_gp_ctrs))
			enable_pmu = false;
		else if (pmu_ops->MIN_PMU_VERSION > kvm_pmu_cap.version)
			enable_pmu = false;
	}

	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
	    pmu_ops->MIN_MEDIATED_PMU_VERSION > kvm_pmu_cap.version)
		enable_mediated_pmu = false;

> +		enable_passthrough_pmu = false;
> +
>  	if (!enable_pmu) {
>  		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
>  		return;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ad465881b043..2ad122995f11 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -146,6 +146,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  extern bool __read_mostly allow_smaller_maxphyaddr;
>  module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
>  
> +module_param(enable_passthrough_pmu, bool, 0444);

Hmm, we either need to put this param in kvm.ko, or move enable_pmu to vendor
modules (or duplicate it there if we need to for backwards compatibility?).

There are advantages to putting params in vendor modules, when it's safe to do so,
e.g. it allows toggling the param when (re)loading a vendor module, so I think I'm
supportive of having the param live in vendor code.  I just don't want to split
the two PMU knobs.

>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
>  #define KVM_VM_CR0_ALWAYS_ON				\
> @@ -7924,7 +7926,8 @@ static __init u64 vmx_get_perf_capabilities(void)
>  	if (boot_cpu_has(X86_FEATURE_PDCM))
>  		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
>  
> -	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
> +	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
> +	    !enable_passthrough_pmu) {
>  		x86_perf_get_lbr(&vmx_lbr_caps);
>  
>  		/*
> @@ -7938,7 +7941,7 @@ static __init u64 vmx_get_perf_capabilities(void)
>  			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
>  	}
>  
> -	if (vmx_pebs_supported()) {
> +	if (vmx_pebs_supported() && !enable_passthrough_pmu) {

Checking enable_mediated_pmu belongs in vmx_pebs_supported(), not in here,
otherwise KVM will incorrectly advertise support to userspace:

	if (vmx_pebs_supported()) {
		kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
		kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
	}

>  		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
>  		/*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f1d589c07068..0c40f551130e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -187,6 +187,10 @@ bool __read_mostly enable_pmu = true;
>  EXPORT_SYMBOL_GPL(enable_pmu);
>  module_param(enable_pmu, bool, 0444);
>  
> +/* Enable/disable mediated passthrough PMU virtualization */
> +bool __read_mostly enable_passthrough_pmu;
> +EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
> +
>  bool __read_mostly eager_page_split = true;
>  module_param(eager_page_split, bool, 0644);
>  
> @@ -6682,6 +6686,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		mutex_lock(&kvm->lock);
>  		if (!kvm->created_vcpus) {
>  			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
> +			/* Disable passthrough PMU if enable_pmu is false. */
> +			if (!kvm->arch.enable_pmu)
> +				kvm->arch.enable_passthrough_pmu = false;

And this code obviously goes away if the per-VM snapshot is removed.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  2024-08-01  4:58 ` [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
@ 2024-11-19 14:54   ` Sean Christopherson
  2024-11-20  3:47     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 14:54 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Plumb through pass-through PMU setting from kvm->arch into kvm_pmu on each
> vcpu created. Note that enabling PMU is decided by VMM when it sets the
> CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
> in intel_pmu_refresh().

Why?  As with the per-VM snapshot, I see zero reason for this to exist, it's
simply:

  kvm->arch.enable_pmu && enable_mediated_pmu && pmu->version;

And in literally every correct usage of pmu->passthrough, kvm->arch.enable_pmu
and pmu->version have been checked (though implicitly), i.e. KVM can check
enable_mediated_pmu and nothing else.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-08-01  4:58 ` [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
@ 2024-11-19 15:37   ` Sean Christopherson
  2024-11-20  5:19     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 15:37 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> From: Sandipan Das <sandipan.das@amd.com>
> 
> Currently, the global control bits for a vcpu are restored to the reset
> state only if the guest PMU version is less than 2. This works for
> emulated PMU as the MSRs are intercepted and backing events are created
> for and managed by the host PMU [1].
> 
> If such a guest in run with passthrough PMU, the counters no longer work
> because the global enable bits are cleared. Hence, set the global enable
> bits to their reset state if passthrough PMU is used.
> 
> A passthrough-capable host may not necessarily support PMU version 2 and
> it can choose to restore or save the global control state from struct
> kvm_pmu in the PMU context save and restore helpers depending on the
> availability of the global control register.
> 
> [1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");
> 
> Reported-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> [removed the fixes tag]
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/pmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 5768ea2935e9..e656f72fdace 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  	 * in the global controls).  Emulate that behavior when refreshing the
>  	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>  	 */
> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
> +	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
>  		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);

This is wrong and confusing.  From the guest's perspective, and therefore from
host userspace's perspective, PERF_GLOBAL_CTRL does not exist.  Therefore, the
value that is tracked for the guest must be '0'.

I see that intel_passthrough_pmu_msrs() and amd_passthrough_pmu_msrs() intercept
accesses to PERF_GLOBAL_CTRL if "pmu->version > 1" (which, by the by, needs to be
kvm_pmu_has_perf_global_ctrl()), so there's no weirdness with the guest being able
to access MSRs that shouldn't exist.

But KVM shouldn't stuff pmu->global_ctrl, and doing so is a symptom of another
flaw.  Unless I'm missing something, KVM stuffs pmu->global_ctrl so that the
correct value is loaded on VM-Enter, but loading and saving PERF_GLOBAL_CTRL on
entry/exit is unnecessary and confusing, as is loading the associated MSRs when
restoring (loading) the guest context.

For PERF_GLOBAL_CTRL on Intel, KVM needs to ensure all GP counters are enabled in
VMCS.GUEST_IA32_PERF_GLOBAL_CTRL, but that's a "set once and forget" operation,
not something that needs to be done on every entry and exit.  Of course, loading
and saving PERF_GLOBAL_CTRL on every entry/exit is unnecessary for other reasons,
but that's largely orthogonal.

On AMD, amd_restore_pmu_context()[*] needs to enable a maximal value for
PERF_GLOBAL_CTRL, but I don't think there's any need to load the other MSRs,
and the maximal value should come from the above logic, not pmu->global_ctrl.

[*] Side topic, in case I forget later, that API should be "load", not "restore".
    There is no assumption or guarantee that KVM is exactly restoring anything,
    e.g. if PERF_GLOBAL_CTRL doesn't exist in the guest PMU and on the first load.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init()
  2024-08-01  4:58 ` [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
@ 2024-11-19 15:43   ` Sean Christopherson
  2024-11-20  5:21     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 15:43 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0c40f551130e..6db4dc496d2b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -239,6 +239,9 @@ EXPORT_SYMBOL_GPL(host_xss);
>  u64 __read_mostly host_arch_capabilities;
>  EXPORT_SYMBOL_GPL(host_arch_capabilities);
>  
> +u64 __read_mostly host_perf_cap;
> +EXPORT_SYMBOL_GPL(host_perf_cap);

In case you don't get a conflict on rebase, this should go in "struct kvm_host_values"
as "perf_capabilities".

>  const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
>  	KVM_GENERIC_VM_STATS(),
>  	STATS_DESC_COUNTER(VM, mmu_shadow_zapped),
> @@ -9793,6 +9796,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>  	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
>  		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, host_arch_capabilities);
>  
> +	if (boot_cpu_has(X86_FEATURE_PDCM))
> +		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
> +
>  	r = ops->hardware_setup();
>  	if (r != 0)
>  		goto out_mmu_exit;
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-08-01  4:58 ` [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
@ 2024-11-19 16:32   ` Sean Christopherson
  2024-11-20  5:31     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 16:32 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Clear RDPMC_EXITING in vmcs when all counters on the host side are exposed
> to guest VM. This gives performance to passthrough PMU. However, when guest
> does not get all counters, intercept RDPMC to prevent access to unexposed
> counters. Make decision in vmx_vcpu_after_set_cpuid() when guest enables
> PMU and passthrough PMU is enabled.
> 
> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/kvm/pmu.c     | 16 ++++++++++++++++
>  arch/x86/kvm/pmu.h     |  1 +
>  arch/x86/kvm/vmx/vmx.c |  5 +++++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index e656f72fdace..19104e16a986 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -96,6 +96,22 @@ void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops)
>  #undef __KVM_X86_PMU_OP
>  }
>  
> +bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)

As suggested earlier, kvm_rdpmc_in_guest().

> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	if (is_passthrough_pmu_enabled(vcpu) &&
> +	    !enable_vmware_backdoor &&

Please add a comment about the VMware backdoor, I doubt most folks know about
VMware's tweaks to RDPMC behavior.  It's somewhat obvious from the code and
comment in check_rdpmc(), but I think it's worth calling out here too.

> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
> +	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
> +	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
> +	    pmu->counter_bitmask[KVM_PMC_FIXED] == (((u64)1 << kvm_pmu_cap.bit_width_fixed)  - 1))

BIT_ULL?  GENMASK_ULL?

> +		return true;
> +
> +	return false;

Do this:


	return <true>;

not:

	if (<true>)
		return true;

	return false;

Short-circuiting on certain cases is fine, and I would probably vote for that so
it's easier to add comments, but that's obviously not what's done here.  E.g. either

	if (!enable_mediated_pmu)
		return false;

	/* comment goes here */
	if (enable_vmware_backdoor)
		return false;

	return <counters checks>;

or

	return <massive combined check>;

> +}
> +EXPORT_SYMBOL_GPL(kvm_pmu_check_rdpmc_passthrough);

Maybe just make this an inline in a header?  enable_vmware_backdoor is exported,
and presumably enable_mediated_pmu will be too.

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4d60a8cf2dd1..339742350b7a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7911,6 +7911,11 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  		vmx->msr_ia32_feature_control_valid_bits &=
>  			~FEAT_CTL_SGX_LC_ENABLED;
>  
> +	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))

No need to follow vmx->vcpu, @vcpu is readily available.

> +		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
> +	else
> +		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);

I wonder if it makes sense to add a helper to change a bit.  IIRC, the only reason
I didn't add one along with the set/clear helpers was because there weren't many
users and I couldn't think of good alternative to "set".

I still don't have a good name, but I think we're reaching the point where it's
worth forcing the issue to avoid common goofs, e.g. handling only the "clear"
case and no the "set" case.

Maybe changebit?  E.g.

static __always_inline void lname##_controls_changebit(struct vcpu_vmx *vmx, u##bits val,	\
						       bool set)				\
{												\
	if (set)										\
		lname##_controls_setbit(vmx, val);						\
	else											\
		lname##_controls_clearbit(vmx, val);						\
}


and then vmx_refresh_apicv_exec_ctrl() can be:

	secondary_exec_controls_changebit(vmx,
					  SECONDARY_EXEC_APIC_REGISTER_VIRT |
					  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY,
					  kvm_vcpu_apicv_active(vcpu));
	tertiary_exec_controls_changebit(vmx, TERTIARY_EXEC_IPI_VIRT,
					 kvm_vcpu_apicv_active(vcpu) && enable_ipiv);

and this can be:

	exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
				!kvm_rdpmc_in_guest(vcpu));

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  2024-08-01  4:58 ` [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
@ 2024-11-19 17:03   ` Sean Christopherson
  2024-11-20  5:44     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 17:03 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> 
> Define macro PMU_CAP_PERF_METRICS to represent bit[15] of
> MSR_IA32_PERF_CAPABILITIES MSR. This bit is used to represent whether
> perf metrics feature is enabled.
> 
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/capabilities.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 41a4533f9989..d8317552b634 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
>  #define PT_MODE_HOST_GUEST	1
>  
>  #define PMU_CAP_FW_WRITES	(1ULL << 13)
> +#define PMU_CAP_PERF_METRICS	BIT_ULL(15)

BIT() should suffice.  The 1ULL used for FW_WRITES is unnecessary.  Speaking of
which, can you update the other #defines while you're at it?  The mix of styles
annoys me :-)

#define PMU_CAP_FW_WRITES	BIT(13)
#define PMU_CAP_PERF_METRICS	BIT(15)
#define PMU_CAP_LBR_FMT		GENMASK(5, 0)

>  #define PMU_CAP_LBR_FMT		0x3f
>  
>  struct nested_vmx_msrs {
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed
  2024-08-01  4:58 ` [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
@ 2024-11-19 17:32   ` Sean Christopherson
  2024-11-20  6:22     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 17:32 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Introduce a vendor specific API to check if rdpmc passthrough allowed.
> RDPMC passthrough requires guest VM have the full ownership of all
> counters. These include general purpose counters and fixed counters and
> some vendor specific MSRs such as PERF_METRICS. Since PERF_METRICS MSR is
> Intel specific, putting the check into vendor specific code.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-pmu-ops.h |  1 +
>  arch/x86/kvm/pmu.c                     |  1 +
>  arch/x86/kvm/pmu.h                     |  1 +
>  arch/x86/kvm/svm/pmu.c                 |  6 ++++++
>  arch/x86/kvm/vmx/pmu_intel.c           | 16 ++++++++++++++++
>  5 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index f852b13aeefe..fd986d5146e4 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -20,6 +20,7 @@ KVM_X86_PMU_OP(get_msr)
>  KVM_X86_PMU_OP(set_msr)
>  KVM_X86_PMU_OP(refresh)
>  KVM_X86_PMU_OP(init)
> +KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
>  KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 19104e16a986..3afefe4cf6e2 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -102,6 +102,7 @@ bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
>  
>  	if (is_passthrough_pmu_enabled(vcpu) &&
>  	    !enable_vmware_backdoor &&
> +	    static_call(kvm_x86_pmu_is_rdpmc_passthru_allowed)(vcpu) &&

If the polarity is inverted, the callback can be OPTIONAL_RET0 on AMD.  E.g.

	if (kvm_pmu_call(rdpmc_needs_intercept(vcpu)))
		return false;

> +static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Per Intel SDM vol. 2 for RDPMC, 

Please don't reference specific sections in the comments.  For changelogs it's
ok, because changelogs are a snapshot in time.  But comments are living things
and will become stale in almost every case.  And I don't see any reason to reference
the SDM, just state the behavior; it's implied that that's the architectural
behavior, otherwise KVM is buggy.

> MSR_PERF_METRICS is accessible by

This is technically wrong, the SDM states that the RDPMC behavior is implementation
specific.  That matters to some extent, because if it was _just_ one MSR and was
guaranteed to always be that one MSR, it might be worth creating a virtualization
hole.

	/*
	 * Intercept RDPMC if the host supports PERF_METRICS, but the guest
	 * does not, as RDPMC with type 0x2000 accesses implementation specific
	 * metrics.
	 */

All that said, isn't this redundant with the number of fixed counters?  I'm having
a hell of a time finding anything concrete in the SDM, but IIUC fixed counter 3
is tightly coupled to perf metrics.  E.g. rather than add a vendor hook just for
this, rely on the fixed counters and refuse to enable the mediated PMU if the
underlying CPU model is nonsensical, i.e. perf metrics exists without ctr3.

And I kinda think we have to go that route, because enabling RDPMC interception
based on future features is doomed from the start.  E.g. if this code had been
written prior to PERF_METRICS, older KVMs would have zero clue that RDPMC needs
to be intercepted on newer hardware.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-08-01  4:58 ` [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
  2024-08-06  7:04   ` Mi, Dapeng
  2024-10-24 20:26   ` Chen, Zide
@ 2024-11-19 18:16   ` Sean Christopherson
  2024-11-20  7:56     ` Mi, Dapeng
  2 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 18:16 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> ---
>  arch/x86/include/asm/vmx.h |   1 +
>  arch/x86/kvm/vmx/vmx.c     | 117 +++++++++++++++++++++++++++++++------
>  arch/x86/kvm/vmx/vmx.h     |   3 +-
>  3 files changed, 103 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index d77a31039f24..5ed89a099533 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -106,6 +106,7 @@
>  #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
>  #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
>  #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
> +#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL      0x40000000

Please add a helper in capabilities.h:

static inline bool cpu_has_save_perf_global_ctrl(void)
{
	return vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
}

>  #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 339742350b7a..34a420fa98c5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4394,6 +4394,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
>  	return pin_based_exec_ctrl;
>  }
>  
> +static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)

This is a misleading and inaccurate name.  It does far more than "set" PERF_GLOBAL_CTRL,
it arguably doesn't ever "set" the MSR, and it gets the VMWRITE for the guest field
wrong too.

> +{
> +	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
> +	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
> +	struct vmx_msrs *m;
> +	int i;
> +
> +	if (cpu_has_perf_global_ctrl_bug() ||

Note, cpu_has_perf_global_ctrl_bug() broken and needs to be purged:
https://lore.kernel.org/all/20241119011433.1797921-1-seanjc@google.com

Note #2, as mentioned earlier, the mediated PMU should take a hard depenency on
the load/save controls.

On to this code, it fails to enable the load/save controls, e.g. if userspace
does KVM_SET_CPUID2 without a PMU, then KVM_SET_CPUID2 with a PMU.  In that case,
KVM will fail to set the control bits, and will fallback to the slow MSR load/save
lists.

With all of the above and other ideas combined, something like so:

	bool set = enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl();

	vm_entry_controls_changebit(vmx, VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, set);
	vm_exit_controls_changebit(vmx,
				   VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
				   VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, set);

And I vote to put this in intel_pmu_refresh(); that avoids needing to figure out
a name for the helper, while giving more flexibililty on the local variable name.

Oh!  Definitely put it in intel_pmu_refresh(), because the RDPMC and MSR
interception logic needs to be there.  E.g. toggling CPU_BASED_RDPMC_EXITING
based solely on CPUID won't do the right thing if KVM ends up making the behavior
depend on PERF_CAPABILITIES.

Ditto for MSRs.  Though until my patch/series that drops kvm_pmu_refresh() from
kvm_pmu_init() lands[*], trying to update MSR intercepts during refresh() will hit
a NULL pointer deref as it's currently called before vmcs01 is allocated :-/

I expect to land that series before mediated PMU, but I don't think it makes sense
to take an explicit dependency for this series.  To fudge around the issue, maybe
do this for the next version?

static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
{
	__intel_pmu_refresh(vcpu);

	/*
	 * FIXME: Drop the MSR bitmap check if/when kvm_pmu_init() no longer
	 *        calls kvm_pmu_refresh(), i.e. when KVM refreshes the PMU only
	 *        after vmcs01 is allocated.
	 */
	if (to_vmx(vcpu)->vmcs01.msr_bitmap)
		intel_update_msr_intercepts(vcpu);

	vm_entry_controls_changebit(vmx, VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL,
				    enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl());

	vm_exit_controls_changebit(vmx,
				   VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
				   VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL,
				   enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl());
}

or with a local variable for "enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl()".
I can't come up with a decent name. :-)

[*] https://lore.kernel.org/all/20240517173926.965351-10-seanjc@google.com

> +	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> +		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> +	}
> +
> +	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
> +		/*
> +		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
> +		 */
> +		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);

This incorrectly clobbers the guest's value.  A simple way to handle this is to
always propagate writes to PERF_GLOBAL_CTRL to the VMCS, if the write is allowed
and enable_mediated_pmu.  I.e. ensure GUEST_IA32_PERF_GLOBAL_CTRL is up-to-date
regardless of whether or not it's configured to be loaded.  Then there's no need
to write it here.

> +		} else {
> +			m = &vmx->msr_autoload.guest;
> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
> +			if (i < 0) {
> +				i = m->nr++;
> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
> +			}
> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
> +			m->val[i].value = 0;
> +		}
> +		/*
> +		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
> +		 */
> +		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
> +			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);

This should be unnecessary.  KVM should clear HOST_IA32_PERF_GLOBAL_CTRL in
vmx_set_constant_host_state() if enable_mediated_pmu is true.  Arguably, it might
make sense to clear it unconditionally, but with a comment explaining that it's
only actually constant for the mediated PMU.

And if the mediated PMU requires the VMCS knobs, then all of the load/store list
complexity goes away.

>  static u32 vmx_vmentry_ctrl(void)
>  {
>  	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
> @@ -4401,17 +4492,10 @@ static u32 vmx_vmentry_ctrl(void)
>  	if (vmx_pt_mode_is_system())
>  		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
>  				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
> -	/*
> -	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
> -	 */
> -	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
> -			  VM_ENTRY_LOAD_IA32_EFER |
> -			  VM_ENTRY_IA32E_MODE);
> -
> -	if (cpu_has_perf_global_ctrl_bug())
> -		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> -
> -	return vmentry_ctrl;
> +	 /*
> +	  * IA32e mode, and loading of EFER is toggled dynamically.
> +	  */
> +	return vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_IA32E_MODE);

With my above suggestion, these changes are unnecessary.  If enable_mediated_pmu
is false, or the vCPU doesn't have a PMU, clearing the controls is correct.  And
when the vCPU is gifted a PMU, KVM will explicitly enabled the controls.

To discourage incorrect usage of these helpers maybe rename them to
vmx_get_initial_{vmentry,vmexit}_ctrl()?

>  }
>  
>  static u32 vmx_vmexit_ctrl(void)
> @@ -4429,12 +4513,8 @@ static u32 vmx_vmexit_ctrl(void)
>  		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
>  				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
>  
> -	if (cpu_has_perf_global_ctrl_bug())
> -		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
> -
> -	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
> -	return vmexit_ctrl &
> -		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);

But this code needs to *add* clearing of VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-08-01  4:58 ` [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
  2024-10-24 19:58   ` Chen, Zide
@ 2024-11-19 18:17   ` Sean Christopherson
  2024-11-20  7:57     ` Mi, Dapeng
  1 sibling, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 18:17 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
> interception.

This hook/patch should be unnecessary, the logic belongs in the vendor refresh()
callback.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs
  2024-08-01  4:58 ` [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
@ 2024-11-19 18:24   ` Sean Christopherson
  2024-11-20 10:12     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 18:24 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 02c9019c6f85..737de5bf1eee 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -740,6 +740,52 @@ static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> +/*
> + * Setup PMU MSR interception for both mediated passthrough vPMU and legacy
> + * emulated vPMU. Note that this function is called after each time userspace
> + * set CPUID.
> + */
> +static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)

Function verb is misleading.  This doesn't always "passthrough" MSRs, it's also
responsible for enabling interception as needed.  intel_pmu_update_msr_intercepts()?

> +{
> +	bool msr_intercept = !is_passthrough_pmu_enabled(vcpu);
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	int i;
> +
> +	/*
> +	 * Unexposed PMU MSRs are intercepted by default. However,
> +	 * KVM_SET_CPUID{,2} may be invoked multiple times. To ensure MSR
> +	 * interception is correct after each call of setting CPUID, explicitly
> +	 * touch msr bitmap for each PMU MSR.
> +	 */
> +	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
> +		if (i >= pmu->nr_arch_gp_counters)
> +			msr_intercept = true;

Hmm, I like the idea and that y'all remembered to intercept unsupported MSRs, but
it's way, way too easy to clobber msr_intercept and fail to re-initialize across
for-loops.

Rather than update the variable mid-loop, how about this?

	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, intercept);
		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW,
					  intercept || !fw_writes_is_enabled(vcpu));
	}
	for ( ; i < kvm_pmu_cap.num_counters_gp; i++) {
		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, true);
		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, true);
	}

	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, intercept);
	for ( ; i < kvm_pmu_cap.num_counters_fixed; i++)
		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, true);


> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, msr_intercept);
> +		if (fw_writes_is_enabled(vcpu))
> +			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, msr_intercept);
> +		else
> +			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, true);
> +	}
> +
> +	msr_intercept = !is_passthrough_pmu_enabled(vcpu);
> +	for (i = 0; i < kvm_pmu_cap.num_counters_fixed; i++) {
> +		if (i >= pmu->nr_arch_fixed_counters)
> +			msr_intercept = true;
> +		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, msr_intercept);
> +	}
> +
> +	if (pmu->version > 1 && is_passthrough_pmu_enabled(vcpu) &&

Don't open code kvm_pmu_has_perf_global_ctrl().

> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
> +	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed)
> +		msr_intercept = false;
> +	else
> +		msr_intercept = true;

This reinforces that checking PERF_CAPABILITIES for PERF_METRICS is likely doomed
to fail, because doesn't PERF_GLOBAL_CTRL need to be intercepted, strictly speaking,
to prevent setting EN_PERF_METRICS?

> +	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_STATUS, MSR_TYPE_RW, msr_intercept);
> +	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, MSR_TYPE_RW, msr_intercept);
> +	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
> +}
> +
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
>  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
> @@ -752,6 +798,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.deliver_pmi = intel_pmu_deliver_pmi,
>  	.cleanup = intel_pmu_cleanup,
>  	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
> +	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = 1,
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc
  2024-08-01  4:58 ` [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
@ 2024-11-19 18:58   ` Sean Christopherson
  2024-11-20 11:50     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-19 18:58 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

Please squash this with the patch that does the actual save/load.  Hmm, maybe it
should be put/load, now that I think about it more?  That's more consitent with
existing KVM terminology.

Anyways, please squash them together, it's very difficult to review them separately.

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Add the MSR indices for both selector and counter in each kvm_pmc. Giving
> convenience to mediated passthrough vPMU in scenarios of querying MSR from
> a given pmc. Note that legacy vPMU does not need this because it never
> directly accesses PMU MSRs, instead each kvm_pmc is bound to a perf_event.
> 
> For actual Zen 4 and later hardware, it will never be the case that the
> PerfMonV2 CPUID bit is set but the PerfCtrCore bit is not. However, a
> guest can be booted with PerfMonV2 enabled and PerfCtrCore disabled.
> KVM does not clear the PerfMonV2 bit from guest CPUID as long as the
> host has the PerfCtrCore capability.
> 
> In this case, passthrough mode will use the K7 legacy MSRs to program
> events but with the incorrect assumption that there are 6 such counters
> instead of 4 as advertised by CPUID leaf 0x80000022 EBX. The host kernel
> will also report unchecked MSR accesses for the absent counters while
> saving or restoring guest PMU contexts.
> 
> Ensure that K7 legacy MSRs are not used as long as the guest CPUID has
> either PerfCtrCore or PerfMonV2 set.
> 
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/svm/pmu.c          | 13 +++++++++++++
>  arch/x86/kvm/vmx/pmu_intel.c    | 13 +++++++++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4b3ce6194bdb..603727312f9c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -522,6 +522,8 @@ struct kvm_pmc {
>  	 */
>  	u64 emulated_counter;
>  	u64 eventsel;
> +	u64 msr_counter;
> +	u64 msr_eventsel;

There's no need to track these per PMC, the tracking can be per PMU, e.g.

	u64 gp_eventsel_base;
	u64 gp_counter_base;
	u64 gp_shift;
	u64 fixed_base;

Actually, there's no need for a per-PMU fixed base, as that can be shoved into
kvm_pmu_ops.  LOL, and the upcoming patch hardcodes INTEL_PMC_FIXED_RDPMC_BASE.
Naughty, naughty ;-)

It's not pretty, but 16 bytes per PMC isn't trivial. 

Hmm, actually, scratch all that.  A better alternative would be to provide a
helper to put/load counter/selector MSRs, and call that from vendor code.  Ooh,
I think there's a bug here.  On AMD, the guest event selector MSRs need to be
loaded _before_ PERF_GLOBAL_CTRL, no?  I.e. enable the guest's counters only
after all selectors have been switched AMD64_EVENTSEL_GUESTONLY.  Otherwise there
would be a brief window where KVM could incorrectly enable counters in the host.
And the reverse that for put().

But Intel has the opposite ordering, because MSR_CORE_PERF_GLOBAL_CTRL needs to
be cleared before changing event selectors.

And so trying to handle this entirely in common code, while noble, is at best
fragile and at worst buggy.

The common helper can take the bases and shift, and if we want to have it handle
fixed counters, the base for that too.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-11-19 13:44           ` Sean Christopherson
@ 2024-11-20  2:08             ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  2:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zide Chen, Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users


On 11/19/2024 9:44 PM, Sean Christopherson wrote:
> On Tue, Nov 19, 2024, Dapeng Mi wrote:
>> On 11/19/2024 9:46 AM, Sean Christopherson wrote:
>>> On Fri, Oct 25, 2024, Dapeng Mi wrote:
>>>> On 10/25/2024 4:26 AM, Chen, Zide wrote:
>>>>> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>>>>>
>>>>>> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>>>>>>  					msrs[i].host, false);
>>>>>>  }
>>>>>>  
>>>>>> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
>>>>>> +{
>>>>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>>>>>> +	int i;
>>>>>> +
>>>>>> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
>>>>>> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
>>>>> As commented in patch 26, compared with MSR auto save/store area
>>>>> approach, the exec control way needs one relatively expensive VMCS read
>>>>> on every VM exit.
>>>> Anyway, let us have a evaluation and data speaks.
>>> No, drop the unconditional VMREAD and VMWRITE, one way or another.  No benchmark
>>> will notice ~50 extra cycles, but if we write poor code for every feature, those
>>> 50 cycles per feature add up.
>>>
>>> Furthermore, checking to see if the CPU supports the load/save VMCS controls at
>>> runtime beyond ridiculous.  The mediated PMU requires ***VERSION 4***; if a CPU
>>> supports PMU version 4 and doesn't support the VMCS controls, KVM should yell and
>>> disable the passthrough PMU.  The amount of complexity added here to support a
>>> CPU that should never exist is silly.
>>>
>>>>>> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
>>>>>> +{
>>>>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
>>>>>> +	u64 global_ctrl = pmu->global_ctrl;
>>>>>> +	int i;
>>>>>> +
>>>>>> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
>>>>>> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
>>>>> ditto.
>>>>>
>>>>> We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
>>>>> update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
>>>>> makes the code even more complicated.
>>> I haven't looked at surrounding code too much, but I guarantee there's _zero_
>>> reason to eat a VMWRITE+VMREAD on every transition.  If, emphasis on *if*, KVM
>>> accesses PERF_GLOBAL_CTRL frequently, e.g. on most exits, then add a VCPU_EXREG_XXX
>>> and let KVM's caching infrastructure do the heavy lifting.  Don't reinvent the
>>> wheel.  But first, convince the world that KVM actually accesses the MSR somewhat
>>> frequently.
>> Sean, let me give more background here.
>>
>> VMX supports two ways to save/restore PERF_GLOBAL_CTRL MSR, one is to
>> leverage VMCS_EXIT_CTRL/VMCS_ENTRY_CTRL to save/restore guest
>> PERF_GLOBAL_CTRL value to/from VMCS guest state. The other is to use the
>> VMCS MSR auto-load/restore bitmap to save/restore guest PERF_GLOBAL_CTRL. 
> I know.
>
>> Currently we prefer to use the former way to save/restore guest
>> PERF_GLOBAL_CTRL as long as HW supports it. There is a limitation on the
>> MSR auto-load/restore feature. When there are multiple MSRs, the MSRs are
>> saved/restored in the order of MSR index. As the suggestion of SDM,
>> PERF_GLOBAL_CTRL should always be written at last after all other PMU MSRs
>> are manipulated. So if there are some PMU MSRs whose index is larger than
>> PERF_GLOBAL_CTRL (It would be true in archPerfmon v6+, all PMU MSRs in the
>> new MSR range have larger index than PERF_GLOBAL_CTRL),
> No, the entries in the load/store lists are processed in sequential order as they
> appear in the lists.  Ordering them based on their MSR index would be insane and
> would make the lists useless.
>
>   VM entries may load MSRs from the VM-entry MSR-load area (see Section 25.8.2).
>   Specifically each entry in that area (up to the number specified in the VM-entry
>   MSR-load count) is processed in order by loading the MSR indexed by bits 31:0
>   with the contents of bits 127:64 as they would be written by WRMSR.1

Thanks, just look at the SDM again. It seems I misunderstood the words. :(


>
>> these PMU MSRs would be restored after PERF_GLOBAL_CTRL. That would break the
>> rule. Of course, it's good to save/restore PERF_GLOBAL_CTRL right now with
>> the VMCS VMCS MSR auto-load/restore bitmap feature since only one PMU MSR
>> PERF_GLOBAL_CTRL is saved/restored in current implementation.
> No, it's never good to use the load/store lists.  They're slow as mud, because
> they're essentially just wrappers to the standard WRMSR/RDMSR ucode.  Whereas
> dedicated VMCS fields have dedicated, streamlined ucode to make loads and stores
> as fast as possible.
>
> I haven't measured PERF_GLOBAL_CTRL specifically, at least not in recent memory,
> but generally speaking using a load/store entry is 100+ cycles, whereas using a
> dedicated VMCS field is <20 cyles (often far less).
>
> So what I am saying is that the mediated PMU should _require_ support for loading
> and saving PERF_GLOBAL_CTRL via dedicated fields, and WARN if a CPU with a v4+
> PMU doesn't support said fields.  E.g.
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a4b2b0b69a68..cab8305e7bf0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8620,6 +8620,15 @@ __init int vmx_hardware_setup(void)
>                 enable_sgx = false;
>  #endif
>  
> +       /*
> +        * All CPUs that support a mediated PMU are expected to support loading
> +        * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
> +        */
> +       if (enable_passthrough_pmu &&
> +           (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
> +                         !cpu_has_save_perf_global_ctrl())))
> +               enable_passthrough_pmu = false;
> +
>         /*
>          * set_apic_access_page_addr() is used to reload apic access
>          * page upon invalidation.  No need to do anything if not
>
> That will provide better, more consistent performance, and will eliminate a big
> pile of non-trivial code.

Sure. When the mediated vPMU is proposed at the beginning, we considered to
support all kinds of HWs including some old may not broadly used HWs
nowadays, so we have to consider all kinds of corner cases. But think
twice, it may not be worthy...


>
>> PERF_GLOBAL_CTRL MSR could be frequently accessed by perf/pmu driver, e.g.
>> on each task switch, so PERF_GLOBAL_CTRL MSR is configured to passthrough
>> to reduce the performance impact in mediated vPMU proposal if guest own all
>> PMU HW resource. But if guest only owns part of PMU HW resource,
>> PERF_GLOBAL_CTRL would be set to interception mode.
> Again, I know.  What I am saying is that propagating PERF_GLOBAL_CTRL to/from the
> VMCS on every entry and exit is extremely wasteful and completely unnecessary.

Yes, I understood your meaning.  For passthough mode, KVM should not need
to access the guest PERF_GLOBAL_CTRL value. For interception mode, KVM
would maintain an emulated guest PERF_GLOBAL_CTRL value and don't need to
read/write it from/to VMCS guest fields.

Anyway, it looks these two helpers are useless, it can be removed.

>
>> I suppose KVM doesn't need access PERF_GLOBAL_CTRL in passthrough mode.
>> This piece of code is intently just for PERF_GLOBAL_CTRL interception mode,
> No, it's even more useless if PERF_GLOBAL_CTRL is intercepted, because in that
> case the _only_ time KVM needs move the guest's value to/from the VMCS is when
> the guest (or host userspace) is explicitly accessing the field.
>
>> but think twice it looks unnecessary to save/restore PERF_GLOBAL_CTRL via
>> VMCS as KVM would always maintain the guest PERF_GLOBAL_CTRL value? Anyway,
>> this part of code can be optimized.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86
  2024-11-19 14:00 ` Sean Christopherson
@ 2024-11-20  2:31   ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  2:31 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/19/2024 10:00 PM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> This series contains perf interface improvements to address Peter's
>> comments. In addition, fix several bugs for v2. This version is based on
>> 6.10-rc4. The main changes are:
>>
>>  - Use atomics to replace refcounts to track the nr_mediated_pmu_vms.
>>  - Use the generic ctx_sched_{in,out}() to switch PMU resources when a
>>    guest is entering and exiting.
>>  - Add a new EVENT_GUEST flag to indicate the context switch case of
>>    entering and exiting a guest. Updates the generic ctx_sched_{in,out}
>>    to specifically handle this case, especially for time management.
>>  - Switch PMI vector in perf_guest_{enter,exit}() as well. Add a new
>>    driver-specific interface to facilitate the switch.
>>  - Remove the PMU_FL_PASSTHROUGH flag and uses the PASSTHROUGH pmu
>>    capability instead.
>>  - Adjust commit sequence in PERF and KVM PMI interrupt functions.
>>  - Use pmc_is_globally_enabled() check in emulated counter increment [1]
>>  - Fix PMU context switch [2] by using rdpmc() instead of rdmsr().
>>
>> AMD fixes:
>>  - Add support for legacy PMU MSRs in MSR interception.
>>  - Make MSR usage consistent if PerfMonV2 is available.
>>  - Avoid enabling passthrough vPMU when local APIC is not in kernel.
>>  - increment counters in emulation mode.
>>
>> This series is organized in the following order:
>>
>> Patches 1-3:
>>  - Immediate bug fixes that can be applied to Linux tip.
>>  - Note: will put immediate fixes ahead in the future. These patches
>>    might be duplicated with existing posts.
>>  - Note: patches 1-2 are needed for AMD when host kernel enables
>>    preemption. Otherwise, guest will suffer from softlockup.
>>
>> Patches 4-17:
>>  - Perf side changes, infra changes in core pmu with API for KVM.
>>
>> Patches 18-48:
>>  - KVM mediated passthrough vPMU framework + Intel CPU implementation.
>>
>> Patches 49-58:
>>  - AMD CPU implementation for vPMU.
> Please rename everything in KVM to drop "passthrough" and simply use "mediated"
> for the overall concept.  This is not a passthrough setup by any stretch of the
> word.  I realize it's a ton of renaming, but calling this "passthrough" is very
> misleading and actively harmful for unsuspecting readers.

Sure.


>
> For helpers and/or comments that deal with intercepting (or not) MSRs, use
> "intercept" and appropriate variations.  E.g. intel_pmu_update_msr_intercepts().

Sure.


>
> And for RDPMC, maybe kvm_rdpmc_in_guest() to follow kvm_{hlt,mwait,pause,cstate_in_guest()?
> I don't love the terminology, but there's a lot of value in being consistent
> throughout KVM.

Sure.


>
> I am not willing to budge on this, at all.
>
> I'm ok with the perf side of things using "passthrough" if "mediated" feels weird
> in that context and we can't come up with a better option, but for the KVM side,
> "passthrough" is simply wrong.

Kan, what's you idea on perf side's naming? I prefer unified naming in the
whole patchset.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2024-11-19 14:30   ` Sean Christopherson
@ 2024-11-20  3:21     ` Mi, Dapeng
  2024-11-20 17:06       ` Sean Christopherson
  2025-01-15  0:17       ` Mingwei Zhang
  0 siblings, 2 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  3:21 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/19/2024 10:30 PM, Sean Christopherson wrote:
> As per my feedback in the initial RFC[*]:
>
>  2. The module param absolutely must not be exposed to userspace until all patches
>     are in place.  The easiest way to do that without creating dependency hell is
>     to simply not create the module param.
>
> [*] https://lore.kernel.org/all/ZhhQBHQ6V7Zcb8Ve@google.com

Sure. It looks we missed this comment. Would address it.


>
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
>> variable is true only when the following conditions satisfies:
>>  - set to true when module loaded.
>>  - enable_pmu is true.
>>  - is running on Intel CPU.
>>  - supports PerfMon v4.
>>  - host PMU supports passthrough mode.
>>
>> The value is always read-only because passthrough PMU currently does not
>> support features like LBR and PEBS, while emualted PMU does. This will end
>> up with two different values for kvm_cap.supported_perf_cap, which is
>> initialized at module load time. Maintaining two different perf
>> capabilities will add complexity. Further, there is not enough motivation
>> to support running two types of PMU implementations at the same time,
>> although it is possible/feasible in reality.
>>
>> Finally, always propagate enable_passthrough_pmu and perf_capabilities into
>> kvm->arch for each KVM instance.
>>
>> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  1 +
>>  arch/x86/kvm/pmu.h              | 14 ++++++++++++++
>>  arch/x86/kvm/vmx/vmx.c          |  7 +++++--
>>  arch/x86/kvm/x86.c              |  8 ++++++++
>>  arch/x86/kvm/x86.h              |  1 +
>>  5 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index f8ca74e7678f..a15c783f20b9 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1406,6 +1406,7 @@ struct kvm_arch {
>>  
>>  	bool bus_lock_detection_enabled;
>>  	bool enable_pmu;
>> +	bool enable_passthrough_pmu;
> Again, as I suggested/requested in the initial RFC[*], drop the per-VM flag as well
> as kvm_pmu.passthrough.  There is zero reason to cache the module param.  KVM
> should always query kvm->arch.enable_pmu prior to checking if the mediated PMU
> is enabled, so I doubt we even need a helper to check both.
>
> [*] https://lore.kernel.org/all/ZhhOEDAl6k-NzOkM@google.com

Sure.


>
>>  
>>  	u32 notify_window;
>>  	u32 notify_vmexit_flags;
>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>> index 4d52b0b539ba..cf93be5e7359 100644
>> --- a/arch/x86/kvm/pmu.h
>> +++ b/arch/x86/kvm/pmu.h
>> @@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>>  			enable_pmu = false;
>>  	}
>>  
>> +	/* Pass-through vPMU is only supported in Intel CPUs. */
>> +	if (!is_intel)
>> +		enable_passthrough_pmu = false;
>> +
>> +	/*
>> +	 * Pass-through vPMU requires at least PerfMon version 4 because the
>> +	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
>> +	 * for counter emulation as well as PMU context switch.  In addition, it
>> +	 * requires host PMU support on passthrough mode. Disable pass-through
>> +	 * vPMU if any condition fails.
>> +	 */
>> +	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
> As is quite obvious by the end of the series, the v4 requirement is specific to
> Intel.
>
> 	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
> 	    (is_intel && kvm_pmu_cap.version < 4) ||
> 	    (is_amd && kvm_pmu_cap.version < 2))
> 		enable_passthrough_pmu = false;
>
> Furthermore, there is zero reason to explicitly and manually check the vendor,
> kvm_init_pmu_capability() takes kvm_pmu_ops.  Adding a callback is somewhat
> undesirable as it would lead to duplicate code, but we can still provide separation
> of concerns by adding const variables to kvm_pmu_ops, a la MAX_NR_GP_COUNTERS.
>
> E.g.
>
> 	if (enable_pmu) {
> 		perf_get_x86_pmu_capability(&kvm_pmu_cap);
>
> 		/*
> 		 * WARN if perf did NOT disable hardware PMU if the number of
> 		 * architecturally required GP counters aren't present, i.e. if
> 		 * there are a non-zero number of counters, but fewer than what
> 		 * is architecturally required.
> 		 */
> 		if (!kvm_pmu_cap.num_counters_gp ||
> 		    WARN_ON_ONCE(kvm_pmu_cap.num_counters_gp < min_nr_gp_ctrs))
> 			enable_pmu = false;
> 		else if (pmu_ops->MIN_PMU_VERSION > kvm_pmu_cap.version)
> 			enable_pmu = false;
> 	}
>
> 	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
> 	    pmu_ops->MIN_MEDIATED_PMU_VERSION > kvm_pmu_cap.version)
> 		enable_mediated_pmu = false;

Sure.  would do.


>> +		enable_passthrough_pmu = false;
>> +
>>  	if (!enable_pmu) {
>>  		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
>>  		return;
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index ad465881b043..2ad122995f11 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -146,6 +146,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>>  extern bool __read_mostly allow_smaller_maxphyaddr;
>>  module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
>>  
>> +module_param(enable_passthrough_pmu, bool, 0444);
> Hmm, we either need to put this param in kvm.ko, or move enable_pmu to vendor
> modules (or duplicate it there if we need to for backwards compatibility?).
>
> There are advantages to putting params in vendor modules, when it's safe to do so,
> e.g. it allows toggling the param when (re)loading a vendor module, so I think I'm
> supportive of having the param live in vendor code.  I just don't want to split
> the two PMU knobs.

Since enable_passthrough_pmu has already been in vendor modules,  we'd
better duplicate enable_pmu module parameter in vendor modules as the 1st step.


>
>>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
>>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
>>  #define KVM_VM_CR0_ALWAYS_ON				\
>> @@ -7924,7 +7926,8 @@ static __init u64 vmx_get_perf_capabilities(void)
>>  	if (boot_cpu_has(X86_FEATURE_PDCM))
>>  		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
>>  
>> -	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
>> +	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
>> +	    !enable_passthrough_pmu) {
>>  		x86_perf_get_lbr(&vmx_lbr_caps);
>>  
>>  		/*
>> @@ -7938,7 +7941,7 @@ static __init u64 vmx_get_perf_capabilities(void)
>>  			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
>>  	}
>>  
>> -	if (vmx_pebs_supported()) {
>> +	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
> Checking enable_mediated_pmu belongs in vmx_pebs_supported(), not in here,
> otherwise KVM will incorrectly advertise support to userspace:
>
> 	if (vmx_pebs_supported()) {
> 		kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
> 		kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
> 	}

Sure. Thanks for pointing this.


>
>>  		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
>>  		/*
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index f1d589c07068..0c40f551130e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -187,6 +187,10 @@ bool __read_mostly enable_pmu = true;
>>  EXPORT_SYMBOL_GPL(enable_pmu);
>>  module_param(enable_pmu, bool, 0444);
>>  
>> +/* Enable/disable mediated passthrough PMU virtualization */
>> +bool __read_mostly enable_passthrough_pmu;
>> +EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
>> +
>>  bool __read_mostly eager_page_split = true;
>>  module_param(eager_page_split, bool, 0644);
>>  
>> @@ -6682,6 +6686,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>  		mutex_lock(&kvm->lock);
>>  		if (!kvm->created_vcpus) {
>>  			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
>> +			/* Disable passthrough PMU if enable_pmu is false. */
>> +			if (!kvm->arch.enable_pmu)
>> +				kvm->arch.enable_passthrough_pmu = false;
> And this code obviously goes away if the per-VM snapshot is removed.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  2024-11-19 14:54   ` Sean Christopherson
@ 2024-11-20  3:47     ` Mi, Dapeng
  2024-11-20 16:45       ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  3:47 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/19/2024 10:54 PM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Plumb through pass-through PMU setting from kvm->arch into kvm_pmu on each
>> vcpu created. Note that enabling PMU is decided by VMM when it sets the
>> CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
>> in intel_pmu_refresh().
> Why?  As with the per-VM snapshot, I see zero reason for this to exist, it's
> simply:
>
>   kvm->arch.enable_pmu && enable_mediated_pmu && pmu->version;
>
> And in literally every correct usage of pmu->passthrough, kvm->arch.enable_pmu
> and pmu->version have been checked (though implicitly), i.e. KVM can check
> enable_mediated_pmu and nothing else.

Ok, too many passthrough_pmu flags indeed confuse readers. Besides these
dependencies, mediated vPMU also depends on lapic_in_kernel(). We need to
set enable_mediated_pmu to false as well if lapic_in_kernel() returns false.




^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-11-19 15:37   ` Sean Christopherson
@ 2024-11-20  5:19     ` Mi, Dapeng
  2024-11-20 17:09       ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  5:19 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/19/2024 11:37 PM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> From: Sandipan Das <sandipan.das@amd.com>
>>
>> Currently, the global control bits for a vcpu are restored to the reset
>> state only if the guest PMU version is less than 2. This works for
>> emulated PMU as the MSRs are intercepted and backing events are created
>> for and managed by the host PMU [1].
>>
>> If such a guest in run with passthrough PMU, the counters no longer work
>> because the global enable bits are cleared. Hence, set the global enable
>> bits to their reset state if passthrough PMU is used.
>>
>> A passthrough-capable host may not necessarily support PMU version 2 and
>> it can choose to restore or save the global control state from struct
>> kvm_pmu in the PMU context save and restore helpers depending on the
>> availability of the global control register.
>>
>> [1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");
>>
>> Reported-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
>> [removed the fixes tag]
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/pmu.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 5768ea2935e9..e656f72fdace 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>  	 * in the global controls).  Emulate that behavior when refreshing the
>>  	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>>  	 */
>> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
>> +	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
>>  		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
> This is wrong and confusing.  From the guest's perspective, and therefore from
> host userspace's perspective, PERF_GLOBAL_CTRL does not exist.  Therefore, the
> value that is tracked for the guest must be '0'.
>
> I see that intel_passthrough_pmu_msrs() and amd_passthrough_pmu_msrs() intercept
> accesses to PERF_GLOBAL_CTRL if "pmu->version > 1" (which, by the by, needs to be
> kvm_pmu_has_perf_global_ctrl()), so there's no weirdness with the guest being able
> to access MSRs that shouldn't exist.
>
> But KVM shouldn't stuff pmu->global_ctrl, and doing so is a symptom of another
> flaw.  Unless I'm missing something, KVM stuffs pmu->global_ctrl so that the
> correct value is loaded on VM-Enter, but loading and saving PERF_GLOBAL_CTRL on
> entry/exit is unnecessary and confusing, as is loading the associated MSRs when
> restoring (loading) the guest context.
>
> For PERF_GLOBAL_CTRL on Intel, KVM needs to ensure all GP counters are enabled in
> VMCS.GUEST_IA32_PERF_GLOBAL_CTRL, but that's a "set once and forget" operation,
> not something that needs to be done on every entry and exit.  Of course, loading
> and saving PERF_GLOBAL_CTRL on every entry/exit is unnecessary for other reasons,
> but that's largely orthogonal.
>
> On AMD, amd_restore_pmu_context()[*] needs to enable a maximal value for
> PERF_GLOBAL_CTRL, but I don't think there's any need to load the other MSRs,
> and the maximal value should come from the above logic, not pmu->global_ctrl.

Sean, just double confirm, you are suggesting to do one-shot initialization
for guest PERF_GLOBAL_CTRL (VMCS.GUEST_IA32_PERF_GLOBAL_CTRL for Intel)
after vCPU resets, right?



>
> [*] Side topic, in case I forget later, that API should be "load", not "restore".
>     There is no assumption or guarantee that KVM is exactly restoring anything,
>     e.g. if PERF_GLOBAL_CTRL doesn't exist in the guest PMU and on the first load.

Sure.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init()
  2024-11-19 15:43   ` Sean Christopherson
@ 2024-11-20  5:21     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  5:21 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/19/2024 11:43 PM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 0c40f551130e..6db4dc496d2b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -239,6 +239,9 @@ EXPORT_SYMBOL_GPL(host_xss);
>>  u64 __read_mostly host_arch_capabilities;
>>  EXPORT_SYMBOL_GPL(host_arch_capabilities);
>>  
>> +u64 __read_mostly host_perf_cap;
>> +EXPORT_SYMBOL_GPL(host_perf_cap);
> In case you don't get a conflict on rebase, this should go in "struct kvm_host_values"
> as "perf_capabilities".

Sure. Thanks.


>
>>  const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
>>  	KVM_GENERIC_VM_STATS(),
>>  	STATS_DESC_COUNTER(VM, mmu_shadow_zapped),
>> @@ -9793,6 +9796,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>>  	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
>>  		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, host_arch_capabilities);
>>  
>> +	if (boot_cpu_has(X86_FEATURE_PDCM))
>> +		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
>> +
>>  	r = ops->hardware_setup();
>>  	if (r != 0)
>>  		goto out_mmu_exit;
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-11-19 16:32   ` Sean Christopherson
@ 2024-11-20  5:31     ` Mi, Dapeng
  2025-01-22  5:08       ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  5:31 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 12:32 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Clear RDPMC_EXITING in vmcs when all counters on the host side are exposed
>> to guest VM. This gives performance to passthrough PMU. However, when guest
>> does not get all counters, intercept RDPMC to prevent access to unexposed
>> counters. Make decision in vmx_vcpu_after_set_cpuid() when guest enables
>> PMU and passthrough PMU is enabled.
>>
>> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> ---
>>  arch/x86/kvm/pmu.c     | 16 ++++++++++++++++
>>  arch/x86/kvm/pmu.h     |  1 +
>>  arch/x86/kvm/vmx/vmx.c |  5 +++++
>>  3 files changed, 22 insertions(+)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index e656f72fdace..19104e16a986 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -96,6 +96,22 @@ void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops)
>>  #undef __KVM_X86_PMU_OP
>>  }
>>  
>> +bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
> As suggested earlier, kvm_rdpmc_in_guest().

Sure.


>
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	if (is_passthrough_pmu_enabled(vcpu) &&
>> +	    !enable_vmware_backdoor &&
> Please add a comment about the VMware backdoor, I doubt most folks know about
> VMware's tweaks to RDPMC behavior.  It's somewhat obvious from the code and
> comment in check_rdpmc(), but I think it's worth calling out here too.

Sure.


>
>> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
>> +	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
>> +	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
>> +	    pmu->counter_bitmask[KVM_PMC_FIXED] == (((u64)1 << kvm_pmu_cap.bit_width_fixed)  - 1))
> BIT_ULL?  GENMASK_ULL?

Sure.


>
>> +		return true;
>> +
>> +	return false;
> Do this:
>
>
> 	return <true>;
>
> not:
>
> 	if (<true>)
> 		return true;
>
> 	return false;
>
> Short-circuiting on certain cases is fine, and I would probably vote for that so
> it's easier to add comments, but that's obviously not what's done here.  E.g. either
>
> 	if (!enable_mediated_pmu)
> 		return false;
>
> 	/* comment goes here */
> 	if (enable_vmware_backdoor)
> 		return false;
>
> 	return <counters checks>;
>
> or
>
> 	return <massive combined check>;

Nice suggestion. Thanks.


>
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_pmu_check_rdpmc_passthrough);
> Maybe just make this an inline in a header?  enable_vmware_backdoor is exported,
> and presumably enable_mediated_pmu will be too.

Sure.


>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 4d60a8cf2dd1..339742350b7a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7911,6 +7911,11 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>  		vmx->msr_ia32_feature_control_valid_bits &=
>>  			~FEAT_CTL_SGX_LC_ENABLED;
>>  
>> +	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))
> No need to follow vmx->vcpu, @vcpu is readily available.

Yes.


>
>> +		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
>> +	else
>> +		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
> I wonder if it makes sense to add a helper to change a bit.  IIRC, the only reason
> I didn't add one along with the set/clear helpers was because there weren't many
> users and I couldn't think of good alternative to "set".
>
> I still don't have a good name, but I think we're reaching the point where it's
> worth forcing the issue to avoid common goofs, e.g. handling only the "clear"
> case and no the "set" case.
>
> Maybe changebit?  E.g.
>
> static __always_inline void lname##_controls_changebit(struct vcpu_vmx *vmx, u##bits val,	\
> 						       bool set)				\
> {												\
> 	if (set)										\
> 		lname##_controls_setbit(vmx, val);						\
> 	else											\
> 		lname##_controls_clearbit(vmx, val);						\
> }
>
>
> and then vmx_refresh_apicv_exec_ctrl() can be:
>
> 	secondary_exec_controls_changebit(vmx,
> 					  SECONDARY_EXEC_APIC_REGISTER_VIRT |
> 					  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY,
> 					  kvm_vcpu_apicv_active(vcpu));
> 	tertiary_exec_controls_changebit(vmx, TERTIARY_EXEC_IPI_VIRT,
> 					 kvm_vcpu_apicv_active(vcpu) && enable_ipiv);
>
> and this can be:
>
> 	exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
> 				!kvm_rdpmc_in_guest(vcpu));

Sure. would add a separate patch to add these helpers.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  2024-11-19 17:03   ` Sean Christopherson
@ 2024-11-20  5:44     ` Mi, Dapeng
  2024-11-20 17:21       ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  5:44 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 1:03 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>
>> Define macro PMU_CAP_PERF_METRICS to represent bit[15] of
>> MSR_IA32_PERF_CAPABILITIES MSR. This bit is used to represent whether
>> perf metrics feature is enabled.
>>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/vmx/capabilities.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>> index 41a4533f9989..d8317552b634 100644
>> --- a/arch/x86/kvm/vmx/capabilities.h
>> +++ b/arch/x86/kvm/vmx/capabilities.h
>> @@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
>>  #define PT_MODE_HOST_GUEST	1
>>  
>>  #define PMU_CAP_FW_WRITES	(1ULL << 13)
>> +#define PMU_CAP_PERF_METRICS	BIT_ULL(15)
> BIT() should suffice.  The 1ULL used for FW_WRITES is unnecessary.  Speaking of
> which, can you update the other #defines while you're at it?  The mix of styles
> annoys me :-)
>
> #define PMU_CAP_FW_WRITES	BIT(13)
> #define PMU_CAP_PERF_METRICS	BIT(15)
> #define PMU_CAP_LBR_FMT		GENMASK(5, 0)

Sure.  Could we further move all these  PERF_CAPBILITIES macros into
arch/x86/include/asm/msr-index.h? I just found there are already some
PERF_CAPBILITIES macros defined in this file and it looks a better place to
define these macros.

#define PERF_CAP_PEBS_TRAP             BIT_ULL(6)
#define PERF_CAP_ARCH_REG              BIT_ULL(7)
#define PERF_CAP_PEBS_FORMAT           0xf00
#define PERF_CAP_PEBS_BASELINE         BIT_ULL(14)


>>  #define PMU_CAP_LBR_FMT		0x3f
>>  
>>  struct nested_vmx_msrs {
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed
  2024-11-19 17:32   ` Sean Christopherson
@ 2024-11-20  6:22     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  6:22 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 1:32 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Introduce a vendor specific API to check if rdpmc passthrough allowed.
>> RDPMC passthrough requires guest VM have the full ownership of all
>> counters. These include general purpose counters and fixed counters and
>> some vendor specific MSRs such as PERF_METRICS. Since PERF_METRICS MSR is
>> Intel specific, putting the check into vendor specific code.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>> ---
>>  arch/x86/include/asm/kvm-x86-pmu-ops.h |  1 +
>>  arch/x86/kvm/pmu.c                     |  1 +
>>  arch/x86/kvm/pmu.h                     |  1 +
>>  arch/x86/kvm/svm/pmu.c                 |  6 ++++++
>>  arch/x86/kvm/vmx/pmu_intel.c           | 16 ++++++++++++++++
>>  5 files changed, 25 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> index f852b13aeefe..fd986d5146e4 100644
>> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
>> @@ -20,6 +20,7 @@ KVM_X86_PMU_OP(get_msr)
>>  KVM_X86_PMU_OP(set_msr)
>>  KVM_X86_PMU_OP(refresh)
>>  KVM_X86_PMU_OP(init)
>> +KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
>>  KVM_X86_PMU_OP_OPTIONAL(reset)
>>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 19104e16a986..3afefe4cf6e2 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -102,6 +102,7 @@ bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
>>  
>>  	if (is_passthrough_pmu_enabled(vcpu) &&
>>  	    !enable_vmware_backdoor &&
>> +	    static_call(kvm_x86_pmu_is_rdpmc_passthru_allowed)(vcpu) &&
> If the polarity is inverted, the callback can be OPTIONAL_RET0 on AMD.  E.g.
>
> 	if (kvm_pmu_call(rdpmc_needs_intercept(vcpu)))
> 		return false;
>
>> +static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
>> +{
>> +	/*
>> +	 * Per Intel SDM vol. 2 for RDPMC, 
>
> Please don't reference specific sections in the comments.  For changelogs it's
> ok, because changelogs are a snapshot in time.  But comments are living things
> and will become stale in almost every case.  And I don't see any reason to reference
> the SDM, just state the behavior; it's implied that that's the architectural
> behavior, otherwise KVM is buggy.
>
>> MSR_PERF_METRICS is accessible by
> This is technically wrong, the SDM states that the RDPMC behavior is implementation
> specific.  That matters to some extent, because if it was _just_ one MSR and was
> guaranteed to always be that one MSR, it might be worth creating a virtualization
> hole.
>
> 	/*
> 	 * Intercept RDPMC if the host supports PERF_METRICS, but the guest
> 	 * does not, as RDPMC with type 0x2000 accesses implementation specific
> 	 * metrics.
> 	 */
>
>
> All that said, isn't this redundant with the number of fixed counters?  I'm having
> a hell of a time finding anything concrete in the SDM, but IIUC fixed counter 3
> is tightly coupled to perf metrics.  E.g. rather than add a vendor hook just for
> this, rely on the fixed counters and refuse to enable the mediated PMU if the
> underlying CPU model is nonsensical, i.e. perf metrics exists without ctr3.
>
> And I kinda think we have to go that route, because enabling RDPMC interception
> based on future features is doomed from the start.  E.g. if this code had been
> written prior to PERF_METRICS, older KVMs would have zero clue that RDPMC needs
> to be intercepted on newer hardware.

Yeah, this sounds make sense. Fixed counter 3 and PERF_METRICS are always
coupled as a whole, and the previous code has already checked the fixed
counter bitmap. I think we can drop this patch and just add a comment to
explain the reason.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-11-19 18:16   ` Sean Christopherson
@ 2024-11-20  7:56     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  7:56 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 2:16 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> ---
>>  arch/x86/include/asm/vmx.h |   1 +
>>  arch/x86/kvm/vmx/vmx.c     | 117 +++++++++++++++++++++++++++++++------
>>  arch/x86/kvm/vmx/vmx.h     |   3 +-
>>  3 files changed, 103 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index d77a31039f24..5ed89a099533 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -106,6 +106,7 @@
>>  #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
>>  #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
>>  #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
>> +#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL      0x40000000
> Please add a helper in capabilities.h:
>
> static inline bool cpu_has_save_perf_global_ctrl(void)
> {
> 	return vmcs_config.vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
> }

Sure.


>
>>  #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
>>  
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 339742350b7a..34a420fa98c5 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -4394,6 +4394,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
>>  	return pin_based_exec_ctrl;
>>  }
>>  
>> +static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
> This is a misleading and inaccurate name.  It does far more than "set" PERF_GLOBAL_CTRL,
> it arguably doesn't ever "set" the MSR, and it gets the VMWRITE for the guest field
> wrong too.
>
>> +{
>> +	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
>> +	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
>> +	struct vmx_msrs *m;
>> +	int i;
>> +
>> +	if (cpu_has_perf_global_ctrl_bug() ||
> Note, cpu_has_perf_global_ctrl_bug() broken and needs to be purged:
> https://lore.kernel.org/all/20241119011433.1797921-1-seanjc@google.com
>
> Note #2, as mentioned earlier, the mediated PMU should take a hard depenency on
> the load/save controls.
>
> On to this code, it fails to enable the load/save controls, e.g. if userspace
> does KVM_SET_CPUID2 without a PMU, then KVM_SET_CPUID2 with a PMU.  In that case,
> KVM will fail to set the control bits, and will fallback to the slow MSR load/save
> lists.
>
> With all of the above and other ideas combined, something like so:
>
> 	bool set = enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl();
>
> 	vm_entry_controls_changebit(vmx, VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, set);
> 	vm_exit_controls_changebit(vmx,
> 				   VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> 				   VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, set);
>
>
> And I vote to put this in intel_pmu_refresh(); that avoids needing to figure out
> a name for the helper, while giving more flexibililty on the local variable name.
>
> Oh!  Definitely put it in intel_pmu_refresh(), because the RDPMC and MSR
> interception logic needs to be there.  E.g. toggling CPU_BASED_RDPMC_EXITING
> based solely on CPUID won't do the right thing if KVM ends up making the behavior
> depend on PERF_CAPABILITIES.
>
> Ditto for MSRs.  Though until my patch/series that drops kvm_pmu_refresh() from
> kvm_pmu_init() lands[*], trying to update MSR intercepts during refresh() will hit
> a NULL pointer deref as it's currently called before vmcs01 is allocated :-/
>
> I expect to land that series before mediated PMU, but I don't think it makes sense
> to take an explicit dependency for this series.  To fudge around the issue, maybe
> do this for the next version?
>
> static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> {
> 	__intel_pmu_refresh(vcpu);
>
> 	/*
> 	 * FIXME: Drop the MSR bitmap check if/when kvm_pmu_init() no longer
> 	 *        calls kvm_pmu_refresh(), i.e. when KVM refreshes the PMU only
> 	 *        after vmcs01 is allocated.
> 	 */
> 	if (to_vmx(vcpu)->vmcs01.msr_bitmap)
> 		intel_update_msr_intercepts(vcpu);
>
> 	vm_entry_controls_changebit(vmx, VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL,
> 				    enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl());
>
> 	vm_exit_controls_changebit(vmx,
> 				   VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
> 				   VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL,
> 				   enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl());
> }
>
> or with a local variable for "enable_mediated_pmu && kvm_pmu_has_perf_global_ctrl()".
> I can't come up with a decent name. :-)
>
> [*] https://lore.kernel.org/all/20240517173926.965351-10-seanjc@google.com

Sure. This looks better.


>
>> +	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
>> +		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
>> +		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
>> +		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
>> +	}
>> +
>> +	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
>> +		/*
>> +		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
>> +		 */
>> +		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
>> +			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
> This incorrectly clobbers the guest's value.  A simple way to handle this is to
> always propagate writes to PERF_GLOBAL_CTRL to the VMCS, if the write is allowed
> and enable_mediated_pmu.  I.e. ensure GUEST_IA32_PERF_GLOBAL_CTRL is up-to-date
> regardless of whether or not it's configured to be loaded.  Then there's no need
> to write it here.

For mediated vPMU, I think we can move this into intel_pmu_refresh() as
well, and just after the vm_entry_controls_changebit() helpers. the value
should be supported gp counters mask which respects the behavior of
PERF_GLOBAL_CTRL reset.


>
>> +		} else {
>> +			m = &vmx->msr_autoload.guest;
>> +			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
>> +			if (i < 0) {
>> +				i = m->nr++;
>> +				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
>> +			}
>> +			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
>> +			m->val[i].value = 0;
>> +		}
>> +		/*
>> +		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
>> +		 */
>> +		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
>> +			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
> This should be unnecessary.  KVM should clear HOST_IA32_PERF_GLOBAL_CTRL in
> vmx_set_constant_host_state() if enable_mediated_pmu is true.  Arguably, it might
> make sense to clear it unconditionally, but with a comment explaining that it's
> only actually constant for the mediated PMU.

Sure. would move this clearing into vmx_set_constant_host_state(). Yeah, I
suppose it can be cleared unconditionally since
VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL bit never be set for legacy emulated vPMU.


>
> And if the mediated PMU requires the VMCS knobs, then all of the load/store list
> complexity goes away.
>
>>  static u32 vmx_vmentry_ctrl(void)
>>  {
>>  	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
>> @@ -4401,17 +4492,10 @@ static u32 vmx_vmentry_ctrl(void)
>>  	if (vmx_pt_mode_is_system())
>>  		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
>>  				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
>> -	/*
>> -	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
>> -	 */
>> -	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
>> -			  VM_ENTRY_LOAD_IA32_EFER |
>> -			  VM_ENTRY_IA32E_MODE);
>> -
>> -	if (cpu_has_perf_global_ctrl_bug())
>> -		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
>> -
>> -	return vmentry_ctrl;
>> +	 /*
>> +	  * IA32e mode, and loading of EFER is toggled dynamically.
>> +	  */
>> +	return vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_IA32E_MODE);
> With my above suggestion, these changes are unnecessary.  If enable_mediated_pmu
> is false, or the vCPU doesn't have a PMU, clearing the controls is correct.  And
> when the vCPU is gifted a PMU, KVM will explicitly enabled the controls.
Yes.

>
> To discourage incorrect usage of these helpers maybe rename them to
> vmx_get_initial_{vmentry,vmexit}_ctrl()?

Sure.


>
>>  }
>>  
>>  static u32 vmx_vmexit_ctrl(void)
>> @@ -4429,12 +4513,8 @@ static u32 vmx_vmexit_ctrl(void)
>>  		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
>>  				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
>>  
>> -	if (cpu_has_perf_global_ctrl_bug())
>> -		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
>> -
>> -	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
>> -	return vmexit_ctrl &
>> -		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
> But this code needs to *add* clearing of VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL.

Sure.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-11-19 18:17   ` Sean Christopherson
@ 2024-11-20  7:57     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20  7:57 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 2:17 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
>> interception.
> This hook/patch should be unnecessary, the logic belongs in the vendor refresh()
> callback.

Yes.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs
  2024-11-19 18:24   ` Sean Christopherson
@ 2024-11-20 10:12     ` Mi, Dapeng
  2024-11-20 18:32       ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20 10:12 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 2:24 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 02c9019c6f85..737de5bf1eee 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -740,6 +740,52 @@ static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
>>  	return true;
>>  }
>>  
>> +/*
>> + * Setup PMU MSR interception for both mediated passthrough vPMU and legacy
>> + * emulated vPMU. Note that this function is called after each time userspace
>> + * set CPUID.
>> + */
>> +static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> Function verb is misleading.  This doesn't always "passthrough" MSRs, it's also
> responsible for enabling interception as needed.  intel_pmu_update_msr_intercepts()?

Yes, it's better. Thanks.


>
>> +{
>> +	bool msr_intercept = !is_passthrough_pmu_enabled(vcpu);
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	int i;
>> +
>> +	/*
>> +	 * Unexposed PMU MSRs are intercepted by default. However,
>> +	 * KVM_SET_CPUID{,2} may be invoked multiple times. To ensure MSR
>> +	 * interception is correct after each call of setting CPUID, explicitly
>> +	 * touch msr bitmap for each PMU MSR.
>> +	 */
>> +	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
>> +		if (i >= pmu->nr_arch_gp_counters)
>> +			msr_intercept = true;
> Hmm, I like the idea and that y'all remembered to intercept unsupported MSRs, but
> it's way, way too easy to clobber msr_intercept and fail to re-initialize across
> for-loops.
>
> Rather than update the variable mid-loop, how about this?
>
> 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, intercept);
> 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW,
> 					  intercept || !fw_writes_is_enabled(vcpu));
> 	}
> 	for ( ; i < kvm_pmu_cap.num_counters_gp; i++) {
> 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, true);
> 		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, true);
> 	}
>
> 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> 		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, intercept);
> 	for ( ; i < kvm_pmu_cap.num_counters_fixed; i++)
> 		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, true);

Yeah, it's indeed better.


>
>
>> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, msr_intercept);
>> +		if (fw_writes_is_enabled(vcpu))
>> +			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, msr_intercept);
>> +		else
>> +			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, true);
>> +	}
>> +
>> +	msr_intercept = !is_passthrough_pmu_enabled(vcpu);
>> +	for (i = 0; i < kvm_pmu_cap.num_counters_fixed; i++) {
>> +		if (i >= pmu->nr_arch_fixed_counters)
>> +			msr_intercept = true;
>> +		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, msr_intercept);
>> +	}
>> +
>> +	if (pmu->version > 1 && is_passthrough_pmu_enabled(vcpu) &&
> Don't open code kvm_pmu_has_perf_global_ctrl().

Oh, yes.


>
>> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
>> +	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed)
>> +		msr_intercept = false;
>> +	else
>> +		msr_intercept = true;
> This reinforces that checking PERF_CAPABILITIES for PERF_METRICS is likely doomed
> to fail, because doesn't PERF_GLOBAL_CTRL need to be intercepted, strictly speaking,
> to prevent setting EN_PERF_METRICS?

Sean, do you mean we need to check if guest supports PERF_METRICS here? If
not, we need to set global MSRs to interception and then avoid guest tries
to enable guest PERF_METRICS, right?



>
>> +	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_STATUS, MSR_TYPE_RW, msr_intercept);
>> +	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, MSR_TYPE_RW, msr_intercept);
>> +	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
>> +}
>> +
>>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
>>  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
>> @@ -752,6 +798,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>  	.deliver_pmi = intel_pmu_deliver_pmi,
>>  	.cleanup = intel_pmu_cleanup,
>>  	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
>> +	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
>>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>>  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
>>  	.MIN_NR_GP_COUNTERS = 1,
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc
  2024-11-19 18:58   ` Sean Christopherson
@ 2024-11-20 11:50     ` Mi, Dapeng
  2024-11-20 17:30       ` Sean Christopherson
  0 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20 11:50 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 2:58 AM, Sean Christopherson wrote:
> Please squash this with the patch that does the actual save/load.  Hmm, maybe it
> should be put/load, now that I think about it more?  That's more consitent with
> existing KVM terminology.

Sure. I ever noticed that this in-consistence, but "put" seem not so
intuitionistic as "save", so didn't change it.


>
> Anyways, please squash them together, it's very difficult to review them separately.
>
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Add the MSR indices for both selector and counter in each kvm_pmc. Giving
>> convenience to mediated passthrough vPMU in scenarios of querying MSR from
>> a given pmc. Note that legacy vPMU does not need this because it never
>> directly accesses PMU MSRs, instead each kvm_pmc is bound to a perf_event.
>>
>> For actual Zen 4 and later hardware, it will never be the case that the
>> PerfMonV2 CPUID bit is set but the PerfCtrCore bit is not. However, a
>> guest can be booted with PerfMonV2 enabled and PerfCtrCore disabled.
>> KVM does not clear the PerfMonV2 bit from guest CPUID as long as the
>> host has the PerfCtrCore capability.
>>
>> In this case, passthrough mode will use the K7 legacy MSRs to program
>> events but with the incorrect assumption that there are 6 such counters
>> instead of 4 as advertised by CPUID leaf 0x80000022 EBX. The host kernel
>> will also report unchecked MSR accesses for the absent counters while
>> saving or restoring guest PMU contexts.
>>
>> Ensure that K7 legacy MSRs are not used as long as the guest CPUID has
>> either PerfCtrCore or PerfMonV2 set.
>>
>> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  2 ++
>>  arch/x86/kvm/svm/pmu.c          | 13 +++++++++++++
>>  arch/x86/kvm/vmx/pmu_intel.c    | 13 +++++++++++++
>>  3 files changed, 28 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 4b3ce6194bdb..603727312f9c 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -522,6 +522,8 @@ struct kvm_pmc {
>>  	 */
>>  	u64 emulated_counter;
>>  	u64 eventsel;
>> +	u64 msr_counter;
>> +	u64 msr_eventsel;
> There's no need to track these per PMC, the tracking can be per PMU, e.g.
>
> 	u64 gp_eventsel_base;
> 	u64 gp_counter_base;
> 	u64 gp_shift;
> 	u64 fixed_base;
>
> Actually, there's no need for a per-PMU fixed base, as that can be shoved into
> kvm_pmu_ops.  LOL, and the upcoming patch hardcodes INTEL_PMC_FIXED_RDPMC_BASE.
> Naughty, naughty ;-)
>
> It's not pretty, but 16 bytes per PMC isn't trivial. 
>
> Hmm, actually, scratch all that.  A better alternative would be to provide a
> helper to put/load counter/selector MSRs, and call that from vendor code.  Ooh,
> I think there's a bug here.  On AMD, the guest event selector MSRs need to be
> loaded _before_ PERF_GLOBAL_CTRL, no?  I.e. enable the guest's counters only
> after all selectors have been switched AMD64_EVENTSEL_GUESTONLY.  Otherwise there
> would be a brief window where KVM could incorrectly enable counters in the host.
> And the reverse that for put().
>
> But Intel has the opposite ordering, because MSR_CORE_PERF_GLOBAL_CTRL needs to
> be cleared before changing event selectors.

Not quite sure about AMD platforms, but it seems both Intel and AMD
platforms follow below sequence to manipulated PMU MSRs.

disable PERF_GLOBAL_CTRL MSR

manipulate counter-level PMU MSR

enable PERF_GLOBAL_CTRL MSR

It seems there is no issues?


>
> And so trying to handle this entirely in common code, while noble, is at best
> fragile and at worst buggy.
>
> The common helper can take the bases and shift, and if we want to have it handle
> fixed counters, the base for that too.

Sure. Would add a callback and vendor specific code would fill the base and
shift fields by leveraging the callback.


>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86
  2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
                   ` (59 preceding siblings ...)
  2024-11-19 14:00 ` Sean Christopherson
@ 2024-11-20 11:55 ` Mi, Dapeng
  2024-11-20 18:34   ` Sean Christopherson
  60 siblings, 1 reply; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-20 11:55 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users

Hi Sean,

Do you think we can remove the "RFC" tag in next version patchset? You and
Peter have reviewed the proposal thoroughly (Thanks for your reviewing
efforts) and it seems there are no hard obstacles besides the
implementation details? Thanks.


On 8/1/2024 12:58 PM, Mingwei Zhang wrote:
> This series contains perf interface improvements to address Peter's
> comments. In addition, fix several bugs for v2. This version is based on
> 6.10-rc4. The main changes are:
>
>  - Use atomics to replace refcounts to track the nr_mediated_pmu_vms.
>  - Use the generic ctx_sched_{in,out}() to switch PMU resources when a
>    guest is entering and exiting.
>  - Add a new EVENT_GUEST flag to indicate the context switch case of
>    entering and exiting a guest. Updates the generic ctx_sched_{in,out}
>    to specifically handle this case, especially for time management.
>  - Switch PMI vector in perf_guest_{enter,exit}() as well. Add a new
>    driver-specific interface to facilitate the switch.
>  - Remove the PMU_FL_PASSTHROUGH flag and uses the PASSTHROUGH pmu
>    capability instead.
>  - Adjust commit sequence in PERF and KVM PMI interrupt functions.
>  - Use pmc_is_globally_enabled() check in emulated counter increment [1]
>  - Fix PMU context switch [2] by using rdpmc() instead of rdmsr().
>
> AMD fixes:
>  - Add support for legacy PMU MSRs in MSR interception.
>  - Make MSR usage consistent if PerfMonV2 is available.
>  - Avoid enabling passthrough vPMU when local APIC is not in kernel.
>  - increment counters in emulation mode.
>
> This series is organized in the following order:
>
> Patches 1-3:
>  - Immediate bug fixes that can be applied to Linux tip.
>  - Note: will put immediate fixes ahead in the future. These patches
>    might be duplicated with existing posts.
>  - Note: patches 1-2 are needed for AMD when host kernel enables
>    preemption. Otherwise, guest will suffer from softlockup.
>
> Patches 4-17:
>  - Perf side changes, infra changes in core pmu with API for KVM.
>
> Patches 18-48:
>  - KVM mediated passthrough vPMU framework + Intel CPU implementation.
>
> Patches 49-58:
>  - AMD CPU implementation for vPMU.
>
> Reference (patches in v2):
> [1] [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
>  - https://lore.kernel.org/all/20240506053020.3911940-43-mizhang@google.com/
> [2] [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
>  - https://lore.kernel.org/all/20240506053020.3911940-31-mizhang@google.com/
>
> V2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
>
> Dapeng Mi (3):
>   x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
>   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
>   KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU
>     MSRs
>
> Kan Liang (8):
>   perf: Support get/put passthrough PMU interfaces
>   perf: Skip pmu_ctx based on event_type
>   perf: Clean up perf ctx time
>   perf: Add a EVENT_GUEST flag
>   perf: Add generic exclude_guest support
>   perf: Add switch_interrupt() interface
>   perf/x86: Support switch_interrupt interface
>   perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
>
> Manali Shukla (1):
>   KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough
>     PMU
>
> Mingwei Zhang (24):
>   perf/x86: Forbid PMI handler when guest own PMU
>   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
>   KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
>   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
>   KVM: x86/pmu: Add host_perf_cap and initialize it in
>     kvm_x86_vendor_init()
>   KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to
>     guest
>   KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough
>     allowed
>   KVM: x86/pmu: Create a function prototype to disable MSR interception
>   KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in
>     passthrough vPMU
>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>   KVM: x86/pmu: Add counter MSR and selector MSR index into struct
>     kvm_pmc
>   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
>     context
>   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
>   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
>   KVM: x86/pmu: Allow writing to event selector for GP counters if event
>     is allowed
>   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
>     exposed
>   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
>   KVM: x86/pmu: Introduce PMU operator to increment counter
>   KVM: x86/pmu: Introduce PMU operator for setting counter overflow
>   KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
>   KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
>   KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
>   KVM: nVMX: Add nested virtualization support for passthrough PMU
>
> Sandipan Das (12):
>   perf/x86: Do not set bit width for unavailable counters
>   x86/msr: Define PerfCntrGlobalStatusSet register
>   KVM: x86/pmu: Always set global enable bits in passthrough mode
>   KVM: x86/pmu/svm: Set passthrough capability for vcpus
>   KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
>   KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed
>     to guest
>   KVM: x86/pmu/svm: Implement callback to disable MSR interception
>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>     write to event selectors
>   KVM: x86/pmu/svm: Add registers to direct access list
>   KVM: x86/pmu/svm: Implement handlers to save and restore context
>   KVM: x86/pmu/svm: Implement callback to increment counters
>   perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host
>
> Sean Christopherson (2):
>   sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
>   sched/core: Drop spinlocks on contention iff kernel is preemptible
>
> Xiong Zhang (8):
>   x86/irq: Factor out common code for installing kvm irq handler
>   perf: core/x86: Register a new vector for KVM GUEST PMI
>   KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
>   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
>   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
>   KVM: x86/pmu: Notify perf core at KVM context switch boundary
>   KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
>   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
>
>  .../admin-guide/kernel-parameters.txt         |   4 +-
>  arch/x86/events/amd/core.c                    |   2 +
>  arch/x86/events/core.c                        |  44 +-
>  arch/x86/events/intel/core.c                  |   5 +
>  arch/x86/include/asm/hardirq.h                |   1 +
>  arch/x86/include/asm/idtentry.h               |   1 +
>  arch/x86/include/asm/irq.h                    |   2 +-
>  arch/x86/include/asm/irq_vectors.h            |   5 +-
>  arch/x86/include/asm/kvm-x86-pmu-ops.h        |   6 +
>  arch/x86/include/asm/kvm_host.h               |   9 +
>  arch/x86/include/asm/msr-index.h              |   2 +
>  arch/x86/include/asm/perf_event.h             |   1 +
>  arch/x86/include/asm/vmx.h                    |   1 +
>  arch/x86/kernel/idt.c                         |   1 +
>  arch/x86/kernel/irq.c                         |  39 +-
>  arch/x86/kvm/cpuid.c                          |   3 +
>  arch/x86/kvm/pmu.c                            | 154 +++++-
>  arch/x86/kvm/pmu.h                            |  49 ++
>  arch/x86/kvm/svm/pmu.c                        | 136 +++++-
>  arch/x86/kvm/svm/svm.c                        |  31 ++
>  arch/x86/kvm/svm/svm.h                        |   2 +-
>  arch/x86/kvm/vmx/capabilities.h               |   1 +
>  arch/x86/kvm/vmx/nested.c                     |  52 +++
>  arch/x86/kvm/vmx/pmu_intel.c                  | 192 +++++++-
>  arch/x86/kvm/vmx/vmx.c                        | 197 ++++++--
>  arch/x86/kvm/vmx/vmx.h                        |   3 +-
>  arch/x86/kvm/x86.c                            |  45 ++
>  arch/x86/kvm/x86.h                            |   1 +
>  include/linux/perf_event.h                    |  38 +-
>  include/linux/preempt.h                       |  41 ++
>  include/linux/sched.h                         |  41 --
>  include/linux/spinlock.h                      |  14 +-
>  kernel/events/core.c                          | 441 ++++++++++++++----
>  .../beauty/arch/x86/include/asm/irq_vectors.h |   5 +-
>  34 files changed, 1373 insertions(+), 196 deletions(-)
>
>
> base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  2024-11-20  3:47     ` Mi, Dapeng
@ 2024-11-20 16:45       ` Sean Christopherson
  2024-11-21  0:29         ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 16:45 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> 
> On 11/19/2024 10:54 PM, Sean Christopherson wrote:
> > On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> >> Plumb through pass-through PMU setting from kvm->arch into kvm_pmu on each
> >> vcpu created. Note that enabling PMU is decided by VMM when it sets the
> >> CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
> >> in intel_pmu_refresh().
> > Why?  As with the per-VM snapshot, I see zero reason for this to exist, it's
> > simply:
> >
> >   kvm->arch.enable_pmu && enable_mediated_pmu && pmu->version;
> >
> > And in literally every correct usage of pmu->passthrough, kvm->arch.enable_pmu
> > and pmu->version have been checked (though implicitly), i.e. KVM can check
> > enable_mediated_pmu and nothing else.
> 
> Ok, too many passthrough_pmu flags indeed confuse readers. Besides these
> dependencies, mediated vPMU also depends on lapic_in_kernel(). We need to
> set enable_mediated_pmu to false as well if lapic_in_kernel() returns false.

No, just kill the entire vPMU.

Also, the need for an in-kernel APIC isn't unique to the mediated PMU.  KVM simply
drops PMIs if there's no APIC.

If we're feeling lucky, we could try a breaking change like so:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index fcd188cc389a..bb08155f6198 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -817,7 +817,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
        pmu->pebs_data_cfg_mask = ~0ull;
        bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
 
-       if (!vcpu->kvm->arch.enable_pmu)
+       if (!vcpu->kvm->arch.enable_pmu || !lapic_in_kernel(vcpu))
                return;
 
        static_call(kvm_x86_pmu_refresh)(vcpu);


If we don't want to risk breaking weird setups, we could restrict the behavior
to the mediated PMU being enabled:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index fcd188cc389a..bc9673190574 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -817,7 +817,8 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
        pmu->pebs_data_cfg_mask = ~0ull;
        bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
 
-       if (!vcpu->kvm->arch.enable_pmu)
+       if (!vcpu->kvm->arch.enable_pmu ||
+           (!lapic_in_kernel(vcpu) && enable_mediated_pmu))
                return;
 
        static_call(kvm_x86_pmu_refresh)(vcpu);

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2024-11-20  3:21     ` Mi, Dapeng
@ 2024-11-20 17:06       ` Sean Christopherson
  2025-01-15  0:17       ` Mingwei Zhang
  1 sibling, 0 replies; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 17:06 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> On 11/19/2024 10:30 PM, Sean Christopherson wrote:
> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >> index ad465881b043..2ad122995f11 100644
> >> --- a/arch/x86/kvm/vmx/vmx.c
> >> +++ b/arch/x86/kvm/vmx/vmx.c
> >> @@ -146,6 +146,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> >>  extern bool __read_mostly allow_smaller_maxphyaddr;
> >>  module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
> >>  
> >> +module_param(enable_passthrough_pmu, bool, 0444);
> > Hmm, we either need to put this param in kvm.ko, or move enable_pmu to vendor
> > modules (or duplicate it there if we need to for backwards compatibility?).
> >
> > There are advantages to putting params in vendor modules, when it's safe to do so,
> > e.g. it allows toggling the param when (re)loading a vendor module, so I think I'm
> > supportive of having the param live in vendor code.  I just don't want to split
> > the two PMU knobs.
> 
> Since enable_passthrough_pmu has already been in vendor modules,  we'd better
> duplicate enable_pmu module parameter in vendor modules as the 1st step.

Paolo,

Any thoughts on having enable_pmu in vendor modules, vs. putting enable_mediated_pmu
in kvm.ko?  I don't have a strong opinion, there are pros and cons to both.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-11-20  5:19     ` Mi, Dapeng
@ 2024-11-20 17:09       ` Sean Christopherson
  2024-11-21  0:37         ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 17:09 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> On 11/19/2024 11:37 PM, Sean Christopherson wrote:
> >> ---
> >>  arch/x86/kvm/pmu.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> >> index 5768ea2935e9..e656f72fdace 100644
> >> --- a/arch/x86/kvm/pmu.c
> >> +++ b/arch/x86/kvm/pmu.c
> >> @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
> >>  	 * in the global controls).  Emulate that behavior when refreshing the
> >>  	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
> >>  	 */
> >> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
> >> +	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
> >>  		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
> > This is wrong and confusing.  From the guest's perspective, and therefore from
> > host userspace's perspective, PERF_GLOBAL_CTRL does not exist.  Therefore, the
> > value that is tracked for the guest must be '0'.
> >
> > I see that intel_passthrough_pmu_msrs() and amd_passthrough_pmu_msrs() intercept
> > accesses to PERF_GLOBAL_CTRL if "pmu->version > 1" (which, by the by, needs to be
> > kvm_pmu_has_perf_global_ctrl()), so there's no weirdness with the guest being able
> > to access MSRs that shouldn't exist.
> >
> > But KVM shouldn't stuff pmu->global_ctrl, and doing so is a symptom of another
> > flaw.  Unless I'm missing something, KVM stuffs pmu->global_ctrl so that the
> > correct value is loaded on VM-Enter, but loading and saving PERF_GLOBAL_CTRL on
> > entry/exit is unnecessary and confusing, as is loading the associated MSRs when
> > restoring (loading) the guest context.
> >
> > For PERF_GLOBAL_CTRL on Intel, KVM needs to ensure all GP counters are enabled in
> > VMCS.GUEST_IA32_PERF_GLOBAL_CTRL, but that's a "set once and forget" operation,
> > not something that needs to be done on every entry and exit.  Of course, loading
> > and saving PERF_GLOBAL_CTRL on every entry/exit is unnecessary for other reasons,
> > but that's largely orthogonal.
> >
> > On AMD, amd_restore_pmu_context()[*] needs to enable a maximal value for
> > PERF_GLOBAL_CTRL, but I don't think there's any need to load the other MSRs,
> > and the maximal value should come from the above logic, not pmu->global_ctrl.
> 
> Sean, just double confirm, you are suggesting to do one-shot initialization
> for guest PERF_GLOBAL_CTRL (VMCS.GUEST_IA32_PERF_GLOBAL_CTRL for Intel)
> after vCPU resets, right?

No, it would need to be written during refresh().  VMCS.GUEST_IA32_PERF_GLOBAL_CTRL
is only static (because it's unreachable) if the guest does NOT have version > 1.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  2024-11-20  5:44     ` Mi, Dapeng
@ 2024-11-20 17:21       ` Sean Christopherson
  0 siblings, 0 replies; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 17:21 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> 
> On 11/20/2024 1:03 AM, Sean Christopherson wrote:
> > On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> >> From: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >>
> >> Define macro PMU_CAP_PERF_METRICS to represent bit[15] of
> >> MSR_IA32_PERF_CAPABILITIES MSR. This bit is used to represent whether
> >> perf metrics feature is enabled.
> >>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> ---
> >>  arch/x86/kvm/vmx/capabilities.h | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> >> index 41a4533f9989..d8317552b634 100644
> >> --- a/arch/x86/kvm/vmx/capabilities.h
> >> +++ b/arch/x86/kvm/vmx/capabilities.h
> >> @@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
> >>  #define PT_MODE_HOST_GUEST	1
> >>  
> >>  #define PMU_CAP_FW_WRITES	(1ULL << 13)
> >> +#define PMU_CAP_PERF_METRICS	BIT_ULL(15)
> > BIT() should suffice.  The 1ULL used for FW_WRITES is unnecessary.  Speaking of
> > which, can you update the other #defines while you're at it?  The mix of styles
> > annoys me :-)
> >
> > #define PMU_CAP_FW_WRITES	BIT(13)
> > #define PMU_CAP_PERF_METRICS	BIT(15)
> > #define PMU_CAP_LBR_FMT		GENMASK(5, 0)
> 
> Sure.  Could we further move all these  PERF_CAPBILITIES macros into
> arch/x86/include/asm/msr-index.h?

Yes, definitely.  I didn't even realize this was KVM code.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc
  2024-11-20 11:50     ` Mi, Dapeng
@ 2024-11-20 17:30       ` Sean Christopherson
  2024-11-21  0:56         ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 17:30 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> 
> On 11/20/2024 2:58 AM, Sean Christopherson wrote:
> > Please squash this with the patch that does the actual save/load.  Hmm, maybe it
> > should be put/load, now that I think about it more?  That's more consitent with
> > existing KVM terminology.
> 
> Sure. I ever noticed that this in-consistence, but "put" seem not so
> intuitionistic as "save", so didn't change it.

Yeah, "put" isn't perfect, but neither is "save", because the save/put path also
purges hardware state.

> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index 4b3ce6194bdb..603727312f9c 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -522,6 +522,8 @@ struct kvm_pmc {
> >>  	 */
> >>  	u64 emulated_counter;
> >>  	u64 eventsel;
> >> +	u64 msr_counter;
> >> +	u64 msr_eventsel;
> > There's no need to track these per PMC, the tracking can be per PMU, e.g.
> >
> > 	u64 gp_eventsel_base;
> > 	u64 gp_counter_base;
> > 	u64 gp_shift;
> > 	u64 fixed_base;
> >
> > Actually, there's no need for a per-PMU fixed base, as that can be shoved into
> > kvm_pmu_ops.  LOL, and the upcoming patch hardcodes INTEL_PMC_FIXED_RDPMC_BASE.
> > Naughty, naughty ;-)
> >
> > It's not pretty, but 16 bytes per PMC isn't trivial. 
> >
> > Hmm, actually, scratch all that.  A better alternative would be to provide a
> > helper to put/load counter/selector MSRs, and call that from vendor code.  Ooh,
> > I think there's a bug here.  On AMD, the guest event selector MSRs need to be
> > loaded _before_ PERF_GLOBAL_CTRL, no?  I.e. enable the guest's counters only
> > after all selectors have been switched AMD64_EVENTSEL_GUESTONLY.  Otherwise there
> > would be a brief window where KVM could incorrectly enable counters in the host.
> > And the reverse that for put().
> >
> > But Intel has the opposite ordering, because MSR_CORE_PERF_GLOBAL_CTRL needs to
> > be cleared before changing event selectors.
> 
> Not quite sure about AMD platforms, but it seems both Intel and AMD
> platforms follow below sequence to manipulated PMU MSRs.
> 
> disable PERF_GLOBAL_CTRL MSR
> 
> manipulate counter-level PMU MSR
> 
> enable PERF_GLOBAL_CTRL MSR

Nope.  kvm_pmu_restore_pmu_context() does:

	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);


	/*
	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
	 * in this case will be intercepted. Accessing to these counters and
	 * selectors will cause #GP in the guest.
	 */
	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
		pmc = &pmu->gp_counters[i];
		wrmsrl(pmc->msr_counter, pmc->counter);
		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel_hw);
	}

	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
		pmc = &pmu->fixed_counters[i];
		wrmsrl(pmc->msr_counter, pmc->counter);
	}

And amd_restore_pmu_context() does:

	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);

	/* Clear host global_status MSR if non-zero. */
	if (global_status)
		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);

	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);

	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);

So the sequence on AMD is currently:

  disable PERF_GLOBAL_CTRL

  save host PERF_GLOBAL_STATUS 

  load guest PERF_GLOBAL_STATUS (clear+set)

  load guest PERF_GLOBAL_CTRL

  load guest per-counter MSRs

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs
  2024-11-20 10:12     ` Mi, Dapeng
@ 2024-11-20 18:32       ` Sean Christopherson
  0 siblings, 0 replies; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 18:32 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> On 11/20/2024 2:24 AM, Sean Christopherson wrote:
> >> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
> >> +	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed)
> >> +		msr_intercept = false;
> >> +	else
> >> +		msr_intercept = true;
> > This reinforces that checking PERF_CAPABILITIES for PERF_METRICS is likely doomed
> > to fail, because doesn't PERF_GLOBAL_CTRL need to be intercepted, strictly speaking,
> > to prevent setting EN_PERF_METRICS?
> 
> Sean, do you mean we need to check if guest supports PERF_METRICS here? If
> not, we need to set global MSRs to interception and then avoid guest tries
> to enable guest PERF_METRICS, right?

Yep, exactly.  And that the "nr_arch_fixed_counter == num_counters_fixed" takes
care of things here as well.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86
  2024-11-20 11:55 ` Mi, Dapeng
@ 2024-11-20 18:34   ` Sean Christopherson
  0 siblings, 0 replies; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 18:34 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Dapeng Mi wrote:
> Hi Sean,
> 
> Do you think we can remove the "RFC" tag in next version patchset? You and
> Peter have reviewed the proposal thoroughly (Thanks for your reviewing
> efforts) and it seems there are no hard obstacles besides the
> implementation details? Thanks.

No objection on my end.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  2024-08-01  4:58 ` [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
@ 2024-11-20 18:42   ` Sean Christopherson
  2024-11-21  1:13     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 18:42 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Excluding existing vLBR logic from the passthrough PMU because the it does
> not support LBR related MSRs. So to avoid any side effect, do not call
> vLBR related code in both vcpu_enter_guest() and pmi injection function.

This is unnecessary.  PMU_CAP_LBR_FMT will be cleared in kvm_caps.supported_perf_cap
when the mediated PMU is enabled, which will prevent relevant bits from being set
in the vCPU's PERF_CAPABILITIES, and that in turn will ensure the number of LBR
records is always zero.

If we wanted a sanity check, then it should go in intel_pmu_refresh().  But I don't
think that's justified.  E.g. legacy LBRs are incompatible with arch LBRs.  At some
point we have to rely on us not screwing up.

A selftest for this though, that's a different story, but we already have coverage
thanks to vmx_pmu_caps_test.c.  If we wanted to be paranoid, that test could be
extended to assert that LBRs are unsupported if the mediated PMU is enabled.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
  2024-08-01  4:58 ` [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
@ 2024-11-20 18:46   ` Sean Christopherson
  2024-11-21  2:04     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 18:46 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> When passthrough PMU is enabled by kvm and perf, KVM call
> perf_get_mediated_pmu() to exclusive own x86 core PMU at VM creation, KVM
> call perf_put_mediated_pmu() to return x86 core PMU to host perf at VM
> destroy.
> 
> When perf_get_mediated_pmu() fail, the host has system wide perf events
> without exclude_guest = 1 which must be disabled to enable VM with
> passthrough PMU.

I still much prefer my idea of making the mediated PMU opt-in.  I haven't seen
any argument against that approach.

https://lore.kernel.org/all/ZiFGRFb45oZrmqnJ@google.com

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  2024-10-25 16:16   ` Chen, Zide
  2024-10-27 12:06     ` Mi, Dapeng
@ 2024-11-20 18:48     ` Sean Christopherson
  2024-11-21  2:05       ` Mi, Dapeng
  1 sibling, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 18:48 UTC (permalink / raw)
  To: Zide Chen
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Fri, Oct 25, 2024, Zide Chen wrote:
> 
> 
> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> > Introduce PMU operator for setting counter overflow. When emulating counter
> > increment, multiple counters could overflow at the same time, i.e., during
> > the execution of the same instruction. In passthrough PMU, having an PMU
> > operator provides convenience to update the PMU global status in one shot
> > with details hidden behind the vendor specific implementation.
> 
> Since neither Intel nor AMD does implement this API, this patch should
> be dropped.

For all of these small APIs, please introduce and use the API in the same patch.
That avoids goofs like this, where something is never used, and makes the patches
far easier to review.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-08-01  4:58 ` [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
@ 2024-11-20 20:13   ` Sean Christopherson
  2024-11-21  2:27     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 20:13 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Implement emulated counter increment for passthrough PMU under KVM_REQ_PMU.
> Defer the counter increment to KVM_REQ_PMU handler because counter
> increment requests come from kvm_pmu_trigger_event() which can be triggered
> within the KVM_RUN inner loop or outside of the inner loop. This means the
> counter increment could happen before or after PMU context switch.
> 
> So process counter increment in one place makes the implementation simple.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/pmu.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 5cc539bdcc7e..41057d0122bd 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -510,6 +510,18 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>  				     eventsel & ARCH_PERFMON_EVENTSEL_INT);
>  }
>  
> +static void kvm_pmu_handle_event_in_passthrough_pmu(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	static_call_cond(kvm_x86_pmu_set_overflow)(vcpu);
> +
> +	if (atomic64_read(&pmu->__reprogram_pmi)) {
> +		kvm_make_request(KVM_REQ_PMI, vcpu);
> +		atomic64_set(&pmu->__reprogram_pmi, 0ull);
> +	}
> +}
> +
>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
>  {
>  	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
> @@ -517,6 +529,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
>  	struct kvm_pmc *pmc;
>  	int bit;
>  
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		return kvm_pmu_handle_event_in_passthrough_pmu(vcpu);
> +
>  	bitmap_copy(bitmap, pmu->reprogram_pmi, X86_PMC_IDX_MAX);
>  
>  	/*
> @@ -848,6 +863,17 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>  	kvm_pmu_reset(vcpu);
>  }
>  
> +static void kvm_passthrough_pmu_incr_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
> +{
> +	if (static_call(kvm_x86_pmu_incr_counter)(pmc)) {

This is absurd.  It's the same ugly code in both Intel and AMD.

static bool intel_incr_counter(struct kvm_pmc *pmc)
{
	pmc->counter += 1;
	pmc->counter &= pmc_bitmask(pmc);

	if (!pmc->counter)
		return true;

	return false;
}

static bool amd_incr_counter(struct kvm_pmc *pmc)
{
	pmc->counter += 1;
	pmc->counter &= pmc_bitmask(pmc);

	if (!pmc->counter)
		return true;

	return false;
}

> +		__set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->global_status);

Using __set_bit() is unnecessary, ugly, and dangerous.  KVM uses set_bit(), no
underscores, for things like reprogram_pmi because the updates need to be atomic.

The downside of __set_bit() and friends is that if pmc->idx is garbage, KVM will
clobber memory, whereas BIT_ULL(pmc->idx) is "just" undefined behavior.  But
dropping the update is far better than clobbering memory, and can be detected by
UBSAN (though I doubt anyone is hitting this code with UBSAN).

For this code, a regular ol' bitwise-OR will suffice.  

> +		kvm_make_request(KVM_REQ_PMU, vcpu);
> +
> +		if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT)
> +			set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);

This is badly in need of a comment, and the ordering is unnecessarily weird.
Set bits in reprogram_pmi *before* making the request.  It doesn't matter here
since this is all on the same vCPU, but it's good practice since KVM_REQ_XXX
provides the necessary barriers to allow for safe, correct cross-CPU updates.

That said, why on earth is the mediated PMU using KVM_REQ_PMU?  Set global_status
and KVM_REQ_PMI, done.

> +	}
> +}
> +
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
>  	pmc->emulated_counter++;
> @@ -880,7 +906,8 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc)
>  	return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
>  }
>  
> -void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
> +static void __kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel,
> +				    bool is_passthrough)
>  {
>  	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> @@ -914,9 +941,19 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>  		    !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc))
>  			continue;
>  
> -		kvm_pmu_incr_counter(pmc);
> +		if (is_passthrough)
> +			kvm_passthrough_pmu_incr_counter(vcpu, pmc);
> +		else
> +			kvm_pmu_incr_counter(pmc);
>  	}
>  }
> +
> +void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
> +{
> +	bool is_passthrough = is_passthrough_pmu_enabled(vcpu);
> +
> +	__kvm_pmu_trigger_event(vcpu, eventsel, is_passthrough);

Using an inner helper for this is silly, even if the mediated information were
snapshot per-vCPU.  Just grab the snapshot in a local variable.  Using a param
adds no value and unnecessarily obfuscates the code.

That's all a moot point though, because (a) KVM can check enable_mediated_pmu
directy and (b) pivoting on behavior belongs in kvm_pmu_incr_counter(), not here.

And I am leaning towards having the mediated vs. perf-based code live in the same
function, unless one or both is "huge", so that it's easier to understand and
appreciate the differences in the implementations.

Not an action item for y'all, but this is also a great time to add comments, which
are sorely lacking in the code.  I am more than happy to do that, as it helps me
understand (and thus review) the code.  I'll throw in suggestions here and there
as I review.

Anyways, this?

static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
{
	/*
	 * For perf-based PMUs, accumulate software-emulated events separately
	 * from pmc->counter, as pmc->counter is offset by the count of the
	 * associated perf event.  Request reprogramming, which will consult
	 * both emulated and hardware-generated events to detect overflow.
	 */
	if (!enable_mediated_pmu) {
		pmc->emulated_counter++;
		kvm_pmu_request_counter_reprogram(pmc);
		return;
	}

	/*
	 * For mediated PMUs, pmc->counter is updated when the vCPU's PMU is
	 * put, and will be loaded into hardware when the PMU is loaded.  Simply
	 * increment the counter and signal overflow if it wraps to zero.
	 */
	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
	if (!pmc->counter) {
		pmc_to_pmu(pmc)->global_status) |= BIT_ULL(pmc->idx);
		kvm_make_request(KVM_REQ_PMI, vcpu);
	}
}

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
  2024-08-01  4:58 ` [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
@ 2024-11-20 20:19   ` Sean Christopherson
  2024-11-21  2:52     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 20:19 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Update pmc_{read,write}_counter() to disconnect perf API because
> passthrough PMU does not use host PMU on backend. Because of that
> pmc->counter contains directly the actual value of the guest VM when set by
> the host (VMM) side.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/pmu.c | 5 +++++
>  arch/x86/kvm/pmu.h | 4 ++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 41057d0122bd..3604cf467b34 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -322,6 +322,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>  
>  void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
>  {
> +	if (pmc_to_pmu(pmc)->passthrough) {
> +		pmc->counter = val;

This needs to mask the value with pmc_bitmask(pmc), otherwise emulated events
will operate on a bad value, and loading the PMU state into hardware will #GP
if the PMC is written through the sign-extended MSRs, i.e. if val = -1 and the
CPU supports full-width writes.

> +		return;
> +	}
> +
>  	/*
>  	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
>  	 * read-modify-write.  Adjust the counter value so that its value is
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 78a7f0c5f3ba..7e006cb61296 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -116,6 +116,10 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>  {
>  	u64 counter, enabled, running;
>  
> +	counter = pmc->counter;

Using a local variable is pointless, the perf-based path immediately clobbers it.

> +	if (pmc_to_pmu(pmc)->passthrough)
> +		return counter & pmc_bitmask(pmc);

And then this can simply return pmc->counter.  We _could_ add a WARN on pmc->counter
overlapping with pmc_bitmask(), but IMO that's unnecessary.  If anything, WARN and
mask pmc->counter when loading state into hardware.

> +
>  	counter = pmc->counter + pmc->emulated_counter;
>  
>  	if (pmc->perf_event && !pmc->is_paused)
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
  2024-08-01  4:58 ` [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
@ 2024-11-20 20:40   ` Sean Christopherson
  2024-11-21  3:02     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 20:40 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Disconnect counter reprogram logic because passthrough PMU never use host
> PMU nor does it use perf API to do anything. Instead, when passthrough PMU
> is enabled, touching anywhere around counter reprogram part should be an
> error.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/pmu.c | 3 +++
>  arch/x86/kvm/pmu.h | 8 ++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 3604cf467b34..fcd188cc389a 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -478,6 +478,9 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>  	bool emulate_overflow;
>  	u8 fixed_ctr_ctrl;
>  
> +	if (WARN_ONCE(pmu->passthrough, "Passthrough PMU never reprogram counter\n"))

Eh, a WARN_ON_ONCE() is enough.

That said, this isn't entirely correct.  The mediated PMU needs to "reprogram"
event selectors if the event filter is changed. KVM currently handles this by
setting  all __reprogram_pmi bits and blasting KVM_REQ_PMU.

LOL, and there's even a reprogram_fixed_counters_in_passthrough_pmu(), so the
above message is a lie.

There is also far too much duplicate code in things like reprogram_fixed_counters()
versus reprogram_fixed_counters_in_passthrough_pmu().

Reprogramming on each write is also technically suboptimal, as _very_ theoretically
KVM could emulate multiple WRMSRs without re-entering the guest.

So, I think the mediated PMU should use the reprogramming infrastructure, and
handle the bulk of the different behavior in reprogram_counter(), not in a half
dozen different paths.

> +		return 0;
> +
>  	emulate_overflow = pmc_pause_counter(pmc);
>  
>  	if (!pmc_event_is_allowed(pmc))
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 7e006cb61296..10553bc1ae1d 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -256,6 +256,10 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>  
>  static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
>  {
> +	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
> +	if (pmc_to_pmu(pmc)->passthrough)
> +		return;
> +
>  	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
>  	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
>  }
> @@ -264,6 +268,10 @@ static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
>  {
>  	int bit;
>  
> +	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
> +	if (pmu->passthrough)
> +		return;

Make up your mind :-)  Either handle the mediated PMU here, or handle it in the
callers.  Don't do both.

I vote to handle it here, i.e. drop the check in kvm_pmu_set_msr() when handling
MSR_CORE_PERF_GLOBAL_CTRL, and then add a comment that reprogramming GP counters
doesn't need to happen on control updates because KVM enforces the event filters
when emulating writes to event selectors (and because the guest can write
PERF_GLOBAL_CTRL directly).

> +
>  	if (!diff)
>  		return;
>  
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for passthrough PMU
  2024-08-01  4:58 ` [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
@ 2024-11-20 20:52   ` Sean Christopherson
  2024-11-21  3:14     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 20:52 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> Add nested virtualization support for passthrough PMU by combining the MSR
> interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
> this patch, nested virtualization works for passthrough PMU because L1 will
> see Perfmon v2 and will have to use legacy vPMU implementation if it is
> Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
> even be Linux.
> 
> If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
> MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
> the access.
> 
> However, in current implementation, if without adding anything for nested,
> KVM always set MSR interception bits in vmcs02. This leads to the fact that
> L0 will emulate all MSR read/writes for L2, leading to errors, since the
> current passthrough vPMU never implements set_msr() and get_msr() for any
> counter access except counter accesses from the VMM side.
> 
> So fix the issue by setting up the correct MSR interception for PMU MSRs.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 52 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 643935a0f70a..ef385f9e7513 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -612,6 +612,55 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
>  						   msr_bitmap_l0, msr);
>  }
>  
> +/* Pass PMU MSRs to nested VM if L0 and L1 are set to passthrough. */
> +static void nested_vmx_set_passthru_pmu_intercept_for_msr(struct kvm_vcpu *vcpu,
> +							  unsigned long *msr_bitmap_l1,
> +							  unsigned long *msr_bitmap_l0)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int i;
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_ARCH_PERFMON_EVENTSEL0 + i,
> +						 MSR_TYPE_RW);
> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_IA32_PERFCTR0 + i,
> +						 MSR_TYPE_RW);
> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_IA32_PMC0 + i,
> +						 MSR_TYPE_RW);

I think we should add (gross) macros to dedup the bulk of this boilerplate, by
referencing the local variables in the macros.  Like I said, gross.  But I think
it'd be less error prone and easier to read than the copy+paste mess we have today.
E.g. it's easy to miss that only writes are allowed for MSR_IA32_FLUSH_CMD and
MSR_IA32_PRED_CMD, because there's so much boilerplate.

Something like:

#define nested_vmx_merge_msr_bitmaps(msr, type)	\
	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, msr, type)

#define nested_vmx_merge_msr_bitmaps_read(msr)	\
	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_R);

#define nested_vmx_merge_msr_bitmaps_write(msr)	\
	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_W);

#define nested_vmx_merge_msr_bitmaps_rw(msr)	\
	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_RW);


	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
		nested_vmx_merge_msr_bitmaps_rw(MSR_ARCH_PERFMON_EVENTSEL0 + i);
		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PERFCTR0+ i);
		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PMC0+ i);
	}

	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++)
		nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR_CTRL);

	blah blah blah

> +	}
> +
> +	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++) {

Curly braces are unnecessary.

> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +						 msr_bitmap_l0,
> +						 MSR_CORE_PERF_FIXED_CTR0 + i,
> +						 MSR_TYPE_RW);
> +	}
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_FIXED_CTR_CTRL,
> +					 MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_GLOBAL_STATUS,
> +					 MSR_TYPE_RW);
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_GLOBAL_CTRL,
> +					 MSR_TYPE_RW);
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
> +					 msr_bitmap_l0,
> +					 MSR_CORE_PERF_GLOBAL_OVF_CTRL,
> +					 MSR_TYPE_RW);
> +}
> +
>  /*
>   * Merge L0's and L1's MSR bitmap, return false to indicate that
>   * we do not use the hardware.
> @@ -713,6 +762,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>  					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>  
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		nested_vmx_set_passthru_pmu_intercept_for_msr(vcpu, msr_bitmap_l1, msr_bitmap_l0);

Please wrap.  Or better yet:

	nested_vmx_merge_pmu_msr_bitmaps(vmx, msr_bitmap_1, msr_bitmap_l0);

and handle the enable_mediated_pmu check in the helper.

> +
>  	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>  
>  	vmx->nested.force_msr_bitmap_recalc = false;
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception
  2024-08-01  4:59 ` [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
@ 2024-11-20 21:02   ` Sean Christopherson
  2024-11-21  3:24     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 21:02 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users, Aaron Lewis

+Aaron

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> +static void amd_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	int msr_clear = !!(is_passthrough_pmu_enabled(vcpu));
> +	int i;
> +
> +	for (i = 0; i < min(pmu->nr_arch_gp_counters, AMD64_NUM_COUNTERS); i++) {
> +		/*
> +		 * Legacy counters are always available irrespective of any
> +		 * CPUID feature bits and when X86_FEATURE_PERFCTR_CORE is set,
> +		 * PERF_LEGACY_CTLx and PERF_LEGACY_CTRx registers are mirrored
> +		 * with PERF_CTLx and PERF_CTRx respectively.
> +		 */
> +		set_msr_interception(vcpu, svm->msrpm, MSR_K7_EVNTSEL0 + i, 0, 0);
> +		set_msr_interception(vcpu, svm->msrpm, MSR_K7_PERFCTR0 + i, msr_clear, msr_clear);
> +	}
> +
> +	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
> +		/*
> +		 * PERF_CTLx registers require interception in order to clear
> +		 * HostOnly bit and set GuestOnly bit. This is to prevent the
> +		 * PERF_CTRx registers from counting before VM entry and after
> +		 * VM exit.
> +		 */
> +		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTL + 2 * i, 0, 0);
> +
> +		/*
> +		 * Pass through counters exposed to the guest and intercept
> +		 * counters that are unexposed. Do this explicitly since this
> +		 * function may be set multiple times before vcpu runs.
> +		 */
> +		if (i >= pmu->nr_arch_gp_counters)
> +			msr_clear = 0;

Similar to my comments on the Intel side, explicitly enable interception for
MSRs that don't exist in the guest model in a separate for-loop, i.e. don't
toggle msr_clear in the middle of a loop.

I would also love to de-dup the bulk of this code, which is very doable since
the base+shift for the MSRs is going to be stashed in kvm_pmu.  All that's needed
on top is unified MSR interception logic, which is something that's been on my
wish list for some time.  SVM's inverted polarity needs to die a horrible death.

Lucky for me, Aaron is picking up that torch.

Aaron, what's your ETA on the MSR unification?  No rush, but if you think it'll
be ready in the next month or so, I'll plan on merging that first and landing
this code on top.

> +		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTR + 2 * i, msr_clear, msr_clear);
> +	}
> +
> +	/*
> +	 * In mediated passthrough vPMU, intercept global PMU MSRs when guest
> +	 * PMU only owns a subset of counters provided in HW or its version is
> +	 * less than 2.
> +	 */
> +	if (is_passthrough_pmu_enabled(vcpu) && pmu->version > 1 &&

kvm_pmu_has_perf_global_ctrl(), no?

> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)
> +		msr_clear = 1;
> +	else
> +		msr_clear = 0;
> +
> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_CTL, msr_clear, msr_clear);
> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, msr_clear, msr_clear);
> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, msr_clear, msr_clear);
> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, msr_clear, msr_clear);
> +}
> +
>  struct kvm_pmu_ops amd_pmu_ops __initdata = {
>  	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
>  	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
> @@ -258,6 +312,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
>  	.refresh = amd_pmu_refresh,
>  	.init = amd_pmu_init,
>  	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
> +	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
>  	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors
  2024-08-01  4:59 ` [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
@ 2024-11-20 21:38   ` Sean Christopherson
  2024-11-21  3:26     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 21:38 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> From: Sandipan Das <sandipan.das@amd.com>
> 
> On AMD platforms, there is no way to restore PerfCntrGlobalCtl at
> VM-Entry or clear it at VM-Exit. Since the register states will be
> restored before entering and saved after exiting guest context, the
> counters can keep ticking and even overflow leading to chaos while
> still in host context.
> 
> To avoid this, the PERF_CTLx MSRs (event selectors) are always
> intercepted. KVM will always set the GuestOnly bit and clear the
> HostOnly bit so that the counters run only in guest context even if
> their enable bits are set. Intercepting these MSRs is also necessary
> for guest event filtering.
> 
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/svm/pmu.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index cc03c3e9941f..2b7cc7616162 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -165,7 +165,12 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		data &= ~pmu->reserved_bits;
>  		if (data != pmc->eventsel) {
>  			pmc->eventsel = data;
> -			kvm_pmu_request_counter_reprogram(pmc);
> +			if (is_passthrough_pmu_enabled(vcpu)) {
> +				data &= ~AMD64_EVENTSEL_HOSTONLY;
> +				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;

Do both in a single statment, i.e.

				pmc->eventsel_hw = (data & ~AMD64_EVENTSEL_HOSTONLY) |
						   AMD64_EVENTSEL_GUESTONLY;

Though per my earlier comments, this likely needs to end up in reprogram_counter().

> +			} else {
> +				kvm_pmu_request_counter_reprogram(pmc);
> +			}
>  		}
>  		return 0;
>  	}
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU
  2024-08-01  4:59 ` [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
@ 2024-11-20 21:39   ` Sean Christopherson
  2024-11-21  3:29     ` Mi, Dapeng
  0 siblings, 1 reply; 183+ messages in thread
From: Sean Christopherson @ 2024-11-20 21:39 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Dapeng Mi, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> From: Manali Shukla <manali.shukla@amd.com>
> 
> With the Passthrough PMU enabled, the PERF_CTLx MSRs (event selectors) are
> always intercepted and the event filter checking can be directly done
> inside amd_pmu_set_msr().
> 
> Add a check to allow writing to event selector for GP counters if and only
> if the event is allowed in filter.

This belongs in the patch that adds AMD support for setting pmc->eventsel_hw.
E.g. reverting just this patch would leave KVM in a very broken state.  And it's
unnecessarily difficult to review.

> Signed-off-by: Manali Shukla <manali.shukla@amd.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/svm/pmu.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 86818da66bbe..9f3e910ee453 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -166,6 +166,15 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (data != pmc->eventsel) {
>  			pmc->eventsel = data;
>  			if (is_passthrough_pmu_enabled(vcpu)) {
> +				if (!check_pmu_event_filter(pmc)) {
> +					/*
> +					 * When guest request an invalid event,
> +					 * stop the counter by clearing the
> +					 * event selector MSR.
> +					 */
> +					pmc->eventsel_hw = 0;
> +					return 0;
> +				}
>  				data &= ~AMD64_EVENTSEL_HOSTONLY;
>  				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
>  			} else {
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  2024-11-20 16:45       ` Sean Christopherson
@ 2024-11-21  0:29         ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  0:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users


On 11/21/2024 12:45 AM, Sean Christopherson wrote:
> On Wed, Nov 20, 2024, Dapeng Mi wrote:
>> On 11/19/2024 10:54 PM, Sean Christopherson wrote:
>>> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>>>> Plumb through pass-through PMU setting from kvm->arch into kvm_pmu on each
>>>> vcpu created. Note that enabling PMU is decided by VMM when it sets the
>>>> CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
>>>> in intel_pmu_refresh().
>>> Why?  As with the per-VM snapshot, I see zero reason for this to exist, it's
>>> simply:
>>>
>>>   kvm->arch.enable_pmu && enable_mediated_pmu && pmu->version;
>>>
>>> And in literally every correct usage of pmu->passthrough, kvm->arch.enable_pmu
>>> and pmu->version have been checked (though implicitly), i.e. KVM can check
>>> enable_mediated_pmu and nothing else.
>> Ok, too many passthrough_pmu flags indeed confuse readers. Besides these
>> dependencies, mediated vPMU also depends on lapic_in_kernel(). We need to
>> set enable_mediated_pmu to false as well if lapic_in_kernel() returns false.
> No, just kill the entire vPMU.
>
> Also, the need for an in-kernel APIC isn't unique to the mediated PMU.  KVM simply
> drops PMIs if there's no APIC.
>
> If we're feeling lucky, we could try a breaking change like so:
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index fcd188cc389a..bb08155f6198 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -817,7 +817,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>         pmu->pebs_data_cfg_mask = ~0ull;
>         bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>  
> -       if (!vcpu->kvm->arch.enable_pmu)
> +       if (!vcpu->kvm->arch.enable_pmu || !lapic_in_kernel(vcpu))
>                 return;
>  
>         static_call(kvm_x86_pmu_refresh)(vcpu);
>
>
> If we don't want to risk breaking weird setups, we could restrict the behavior
> to the mediated PMU being enabled:
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index fcd188cc389a..bc9673190574 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -817,7 +817,8 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>         pmu->pebs_data_cfg_mask = ~0ull;
>         bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);
>  
> -       if (!vcpu->kvm->arch.enable_pmu)
> +       if (!vcpu->kvm->arch.enable_pmu ||
> +           (!lapic_in_kernel(vcpu) && enable_mediated_pmu))
>                 return;
>  
>         static_call(kvm_x86_pmu_refresh)(vcpu);

Sure. would adopt the latter one for safe. :)



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-11-20 17:09       ` Sean Christopherson
@ 2024-11-21  0:37         ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  0:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users


On 11/21/2024 1:09 AM, Sean Christopherson wrote:
> On Wed, Nov 20, 2024, Dapeng Mi wrote:
>> On 11/19/2024 11:37 PM, Sean Christopherson wrote:
>>>> ---
>>>>  arch/x86/kvm/pmu.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>>> index 5768ea2935e9..e656f72fdace 100644
>>>> --- a/arch/x86/kvm/pmu.c
>>>> +++ b/arch/x86/kvm/pmu.c
>>>> @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>  	 * in the global controls).  Emulate that behavior when refreshing the
>>>>  	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>>>>  	 */
>>>> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
>>>> +	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
>>>>  		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>>> This is wrong and confusing.  From the guest's perspective, and therefore from
>>> host userspace's perspective, PERF_GLOBAL_CTRL does not exist.  Therefore, the
>>> value that is tracked for the guest must be '0'.
>>>
>>> I see that intel_passthrough_pmu_msrs() and amd_passthrough_pmu_msrs() intercept
>>> accesses to PERF_GLOBAL_CTRL if "pmu->version > 1" (which, by the by, needs to be
>>> kvm_pmu_has_perf_global_ctrl()), so there's no weirdness with the guest being able
>>> to access MSRs that shouldn't exist.
>>>
>>> But KVM shouldn't stuff pmu->global_ctrl, and doing so is a symptom of another
>>> flaw.  Unless I'm missing something, KVM stuffs pmu->global_ctrl so that the
>>> correct value is loaded on VM-Enter, but loading and saving PERF_GLOBAL_CTRL on
>>> entry/exit is unnecessary and confusing, as is loading the associated MSRs when
>>> restoring (loading) the guest context.
>>>
>>> For PERF_GLOBAL_CTRL on Intel, KVM needs to ensure all GP counters are enabled in
>>> VMCS.GUEST_IA32_PERF_GLOBAL_CTRL, but that's a "set once and forget" operation,
>>> not something that needs to be done on every entry and exit.  Of course, loading
>>> and saving PERF_GLOBAL_CTRL on every entry/exit is unnecessary for other reasons,
>>> but that's largely orthogonal.
>>>
>>> On AMD, amd_restore_pmu_context()[*] needs to enable a maximal value for
>>> PERF_GLOBAL_CTRL, but I don't think there's any need to load the other MSRs,
>>> and the maximal value should come from the above logic, not pmu->global_ctrl.
>> Sean, just double confirm, you are suggesting to do one-shot initialization
>> for guest PERF_GLOBAL_CTRL (VMCS.GUEST_IA32_PERF_GLOBAL_CTRL for Intel)
>> after vCPU resets, right?
> No, it would need to be written during refresh().  VMCS.GUEST_IA32_PERF_GLOBAL_CTRL
> is only static (because it's unreachable) if the guest does NOT have version > 1.
oh, yeah, refresh() instead of reset() to be exact.
>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc
  2024-11-20 17:30       ` Sean Christopherson
@ 2024-11-21  0:56         ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  0:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users


On 11/21/2024 1:30 AM, Sean Christopherson wrote:
> On Wed, Nov 20, 2024, Dapeng Mi wrote:
>> On 11/20/2024 2:58 AM, Sean Christopherson wrote:
>>> Please squash this with the patch that does the actual save/load.  Hmm, maybe it
>>> should be put/load, now that I think about it more?  That's more consitent with
>>> existing KVM terminology.
>> Sure. I ever noticed that this in-consistence, but "put" seem not so
>> intuitionistic as "save", so didn't change it.
> Yeah, "put" isn't perfect, but neither is "save", because the save/put path also
> purges hardware state.
>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index 4b3ce6194bdb..603727312f9c 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -522,6 +522,8 @@ struct kvm_pmc {
>>>>  	 */
>>>>  	u64 emulated_counter;
>>>>  	u64 eventsel;
>>>> +	u64 msr_counter;
>>>> +	u64 msr_eventsel;
>>> There's no need to track these per PMC, the tracking can be per PMU, e.g.
>>>
>>> 	u64 gp_eventsel_base;
>>> 	u64 gp_counter_base;
>>> 	u64 gp_shift;
>>> 	u64 fixed_base;
>>>
>>> Actually, there's no need for a per-PMU fixed base, as that can be shoved into
>>> kvm_pmu_ops.  LOL, and the upcoming patch hardcodes INTEL_PMC_FIXED_RDPMC_BASE.
>>> Naughty, naughty ;-)
>>>
>>> It's not pretty, but 16 bytes per PMC isn't trivial. 
>>>
>>> Hmm, actually, scratch all that.  A better alternative would be to provide a
>>> helper to put/load counter/selector MSRs, and call that from vendor code.  Ooh,
>>> I think there's a bug here.  On AMD, the guest event selector MSRs need to be
>>> loaded _before_ PERF_GLOBAL_CTRL, no?  I.e. enable the guest's counters only
>>> after all selectors have been switched AMD64_EVENTSEL_GUESTONLY.  Otherwise there
>>> would be a brief window where KVM could incorrectly enable counters in the host.
>>> And the reverse that for put().
>>>
>>> But Intel has the opposite ordering, because MSR_CORE_PERF_GLOBAL_CTRL needs to
>>> be cleared before changing event selectors.
>> Not quite sure about AMD platforms, but it seems both Intel and AMD
>> platforms follow below sequence to manipulated PMU MSRs.
>>
>> disable PERF_GLOBAL_CTRL MSR
>>
>> manipulate counter-level PMU MSR
>>
>> enable PERF_GLOBAL_CTRL MSR
> Nope.  kvm_pmu_restore_pmu_context() does:
>
> 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
>
>
> 	/*
> 	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
> 	 * in this case will be intercepted. Accessing to these counters and
> 	 * selectors will cause #GP in the guest.
> 	 */
> 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> 		pmc = &pmu->gp_counters[i];
> 		wrmsrl(pmc->msr_counter, pmc->counter);
> 		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel_hw);
> 	}
>
> 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> 		pmc = &pmu->fixed_counters[i];
> 		wrmsrl(pmc->msr_counter, pmc->counter);
> 	}
>
> And amd_restore_pmu_context() does:
>
> 	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
> 	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);
>
> 	/* Clear host global_status MSR if non-zero. */
> 	if (global_status)
> 		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);
>
> 	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);
>
> 	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
>
> So the sequence on AMD is currently:
>
>   disable PERF_GLOBAL_CTRL
>
>   save host PERF_GLOBAL_STATUS 
>
>   load guest PERF_GLOBAL_STATUS (clear+set)
>
>   load guest PERF_GLOBAL_CTRL
>
>   load guest per-counter MSRs

Checked again, yes, indeed. So the better way to handle this is to define a
common helper to manipulate counters MSR in the common code, but call this
common helper in the vendor specific callback.




^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  2024-11-20 18:42   ` Sean Christopherson
@ 2024-11-21  1:13     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  1:13 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 2:42 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Excluding existing vLBR logic from the passthrough PMU because the it does
>> not support LBR related MSRs. So to avoid any side effect, do not call
>> vLBR related code in both vcpu_enter_guest() and pmi injection function.
> This is unnecessary.  PMU_CAP_LBR_FMT will be cleared in kvm_caps.supported_perf_cap
> when the mediated PMU is enabled, which will prevent relevant bits from being set
> in the vCPU's PERF_CAPABILITIES, and that in turn will ensure the number of LBR
> records is always zero.
Yes, that's true.

>
> If we wanted a sanity check, then it should go in intel_pmu_refresh().  But I don't
> think that's justified.  E.g. legacy LBRs are incompatible with arch LBRs.  At some
> point we have to rely on us not screwing up.
>
> A selftest for this though, that's a different story, but we already have coverage
> thanks to vmx_pmu_caps_test.c.  If we wanted to be paranoid, that test could be
> extended to assert that LBRs are unsupported if the mediated PMU is enabled.

we have already supported arch-LBR feature internally (still not sent to
upstream since the dependency) base on this mediated vPMU framework and we
would add a arch-LBR selftest.  It looks unnecessary to add temporary test
case and then remove it soon.

would drop this patch.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
  2024-11-20 18:46   ` Sean Christopherson
@ 2024-11-21  2:04     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  2:04 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 2:46 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>
>> When passthrough PMU is enabled by kvm and perf, KVM call
>> perf_get_mediated_pmu() to exclusive own x86 core PMU at VM creation, KVM
>> call perf_put_mediated_pmu() to return x86 core PMU to host perf at VM
>> destroy.
>>
>> When perf_get_mediated_pmu() fail, the host has system wide perf events
>> without exclude_guest = 1 which must be disabled to enable VM with
>> passthrough PMU.
> I still much prefer my idea of making the mediated PMU opt-in.  I haven't seen
> any argument against that approach.
>
> https://lore.kernel.org/all/ZiFGRFb45oZrmqnJ@google.com

Yeah, I agree this look a more flexible method and it gives the VMM right
to control if the mediated vPMU should be enabled instead of static enabling.

The original issue is that this requires VMM to be involved and not all VMM
call the ioctl KVM_CAP_PMU_CAPABILITY, like qemu.

But I see there is already a patch
(https://lore.kernel.org/qemu-devel/20241104094119.4131-3-dongli.zhang@oracle.com/)
which tries to add KVM_CAP_PMU_CAPABILITY support in Qemu although it's not
complete (only disable, but no enable)

Yeah, we would follow this suggestion and make the mediated vPMU opt-in. If
need, we would add corresponding changes for Qemu as well.



^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  2024-11-20 18:48     ` Sean Christopherson
@ 2024-11-21  2:05       ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  2:05 UTC (permalink / raw)
  To: Sean Christopherson, Zide Chen
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users


On 11/21/2024 2:48 AM, Sean Christopherson wrote:
> On Fri, Oct 25, 2024, Zide Chen wrote:
>>
>> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
>>> Introduce PMU operator for setting counter overflow. When emulating counter
>>> increment, multiple counters could overflow at the same time, i.e., during
>>> the execution of the same instruction. In passthrough PMU, having an PMU
>>> operator provides convenience to update the PMU global status in one shot
>>> with details hidden behind the vendor specific implementation.
>> Since neither Intel nor AMD does implement this API, this patch should
>> be dropped.
> For all of these small APIs, please introduce and use the API in the same patch.
> That avoids goofs like this, where something is never used, and makes the patches
> far easier to review.

Sure.


>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-11-20 20:13   ` Sean Christopherson
@ 2024-11-21  2:27     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  2:27 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 4:13 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Implement emulated counter increment for passthrough PMU under KVM_REQ_PMU.
>> Defer the counter increment to KVM_REQ_PMU handler because counter
>> increment requests come from kvm_pmu_trigger_event() which can be triggered
>> within the KVM_RUN inner loop or outside of the inner loop. This means the
>> counter increment could happen before or after PMU context switch.
>>
>> So process counter increment in one place makes the implementation simple.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/kvm/pmu.c | 41 +++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 39 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 5cc539bdcc7e..41057d0122bd 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -510,6 +510,18 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>>  				     eventsel & ARCH_PERFMON_EVENTSEL_INT);
>>  }
>>  
>> +static void kvm_pmu_handle_event_in_passthrough_pmu(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> +	static_call_cond(kvm_x86_pmu_set_overflow)(vcpu);
>> +
>> +	if (atomic64_read(&pmu->__reprogram_pmi)) {
>> +		kvm_make_request(KVM_REQ_PMI, vcpu);
>> +		atomic64_set(&pmu->__reprogram_pmi, 0ull);
>> +	}
>> +}
>> +
>>  void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
>>  {
>>  	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
>> @@ -517,6 +529,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
>>  	struct kvm_pmc *pmc;
>>  	int bit;
>>  
>> +	if (is_passthrough_pmu_enabled(vcpu))
>> +		return kvm_pmu_handle_event_in_passthrough_pmu(vcpu);
>> +
>>  	bitmap_copy(bitmap, pmu->reprogram_pmi, X86_PMC_IDX_MAX);
>>  
>>  	/*
>> @@ -848,6 +863,17 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>>  	kvm_pmu_reset(vcpu);
>>  }
>>  
>> +static void kvm_passthrough_pmu_incr_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
>> +{
>> +	if (static_call(kvm_x86_pmu_incr_counter)(pmc)) {
> This is absurd.  It's the same ugly code in both Intel and AMD.
>
> static bool intel_incr_counter(struct kvm_pmc *pmc)
> {
> 	pmc->counter += 1;
> 	pmc->counter &= pmc_bitmask(pmc);
>
> 	if (!pmc->counter)
> 		return true;
>
> 	return false;
> }
>
> static bool amd_incr_counter(struct kvm_pmc *pmc)
> {
> 	pmc->counter += 1;
> 	pmc->counter &= pmc_bitmask(pmc);
>
> 	if (!pmc->counter)
> 		return true;
>
> 	return false;
> }
>
>> +		__set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->global_status);
> Using __set_bit() is unnecessary, ugly, and dangerous.  KVM uses set_bit(), no
> underscores, for things like reprogram_pmi because the updates need to be atomic.
>
> The downside of __set_bit() and friends is that if pmc->idx is garbage, KVM will
> clobber memory, whereas BIT_ULL(pmc->idx) is "just" undefined behavior.  But
> dropping the update is far better than clobbering memory, and can be detected by
> UBSAN (though I doubt anyone is hitting this code with UBSAN).
>
> For this code, a regular ol' bitwise-OR will suffice.  
>
>> +		kvm_make_request(KVM_REQ_PMU, vcpu);
>> +
>> +		if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT)
>> +			set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
> This is badly in need of a comment, and the ordering is unnecessarily weird.
> Set bits in reprogram_pmi *before* making the request.  It doesn't matter here
> since this is all on the same vCPU, but it's good practice since KVM_REQ_XXX
> provides the necessary barriers to allow for safe, correct cross-CPU updates.
>
> That said, why on earth is the mediated PMU using KVM_REQ_PMU?  Set global_status
> and KVM_REQ_PMI, done.
>
>> +	}
>> +}
>> +
>>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>>  {
>>  	pmc->emulated_counter++;
>> @@ -880,7 +906,8 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc)
>>  	return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
>>  }
>>  
>> -void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>> +static void __kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel,
>> +				    bool is_passthrough)
>>  {
>>  	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
>>  	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> @@ -914,9 +941,19 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>>  		    !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc))
>>  			continue;
>>  
>> -		kvm_pmu_incr_counter(pmc);
>> +		if (is_passthrough)
>> +			kvm_passthrough_pmu_incr_counter(vcpu, pmc);
>> +		else
>> +			kvm_pmu_incr_counter(pmc);
>>  	}
>>  }
>> +
>> +void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>> +{
>> +	bool is_passthrough = is_passthrough_pmu_enabled(vcpu);
>> +
>> +	__kvm_pmu_trigger_event(vcpu, eventsel, is_passthrough);
> Using an inner helper for this is silly, even if the mediated information were
> snapshot per-vCPU.  Just grab the snapshot in a local variable.  Using a param
> adds no value and unnecessarily obfuscates the code.
>
> That's all a moot point though, because (a) KVM can check enable_mediated_pmu
> directy and (b) pivoting on behavior belongs in kvm_pmu_incr_counter(), not here.
>
> And I am leaning towards having the mediated vs. perf-based code live in the same
> function, unless one or both is "huge", so that it's easier to understand and
> appreciate the differences in the implementations.
>
> Not an action item for y'all, but this is also a great time to add comments, which
> are sorely lacking in the code.  I am more than happy to do that, as it helps me
> understand (and thus review) the code.  I'll throw in suggestions here and there
> as I review.
>
> Anyways, this?
>
> static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
> {
> 	/*
> 	 * For perf-based PMUs, accumulate software-emulated events separately
> 	 * from pmc->counter, as pmc->counter is offset by the count of the
> 	 * associated perf event.  Request reprogramming, which will consult
> 	 * both emulated and hardware-generated events to detect overflow.
> 	 */
> 	if (!enable_mediated_pmu) {
> 		pmc->emulated_counter++;
> 		kvm_pmu_request_counter_reprogram(pmc);
> 		return;
> 	}
>
> 	/*
> 	 * For mediated PMUs, pmc->counter is updated when the vCPU's PMU is
> 	 * put, and will be loaded into hardware when the PMU is loaded.  Simply
> 	 * increment the counter and signal overflow if it wraps to zero.
> 	 */
> 	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> 	if (!pmc->counter) {
> 		pmc_to_pmu(pmc)->global_status) |= BIT_ULL(pmc->idx);
> 		kvm_make_request(KVM_REQ_PMI, vcpu);
> 	}
> }

Yes, thanks.




^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
  2024-11-20 20:19   ` Sean Christopherson
@ 2024-11-21  2:52     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  2:52 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 4:19 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Update pmc_{read,write}_counter() to disconnect perf API because
>> passthrough PMU does not use host PMU on backend. Because of that
>> pmc->counter contains directly the actual value of the guest VM when set by
>> the host (VMM) side.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/kvm/pmu.c | 5 +++++
>>  arch/x86/kvm/pmu.h | 4 ++++
>>  2 files changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 41057d0122bd..3604cf467b34 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -322,6 +322,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>>  
>>  void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
>>  {
>> +	if (pmc_to_pmu(pmc)->passthrough) {
>> +		pmc->counter = val;
> This needs to mask the value with pmc_bitmask(pmc), otherwise emulated events
> will operate on a bad value, and loading the PMU state into hardware will #GP
> if the PMC is written through the sign-extended MSRs, i.e. if val = -1 and the
> CPU supports full-width writes.

Sure.



>
>> +		return;
>> +	}
>> +
>>  	/*
>>  	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
>>  	 * read-modify-write.  Adjust the counter value so that its value is
>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>> index 78a7f0c5f3ba..7e006cb61296 100644
>> --- a/arch/x86/kvm/pmu.h
>> +++ b/arch/x86/kvm/pmu.h
>> @@ -116,6 +116,10 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
>>  {
>>  	u64 counter, enabled, running;
>>  
>> +	counter = pmc->counter;
> Using a local variable is pointless, the perf-based path immediately clobbers it.

Sure. would drop it and directly return pmc->counter.


>
>> +	if (pmc_to_pmu(pmc)->passthrough)
>> +		return counter & pmc_bitmask(pmc);
> And then this can simply return pmc->counter.  We _could_ add a WARN on pmc->counter
> overlapping with pmc_bitmask(), but IMO that's unnecessary.  If anything, WARN and
> mask pmc->counter when loading state into hardware.
>
>> +
>>  	counter = pmc->counter + pmc->emulated_counter;
>>  
>>  	if (pmc->perf_event && !pmc->is_paused)
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
  2024-11-20 20:40   ` Sean Christopherson
@ 2024-11-21  3:02     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  3:02 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 4:40 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Disconnect counter reprogram logic because passthrough PMU never use host
>> PMU nor does it use perf API to do anything. Instead, when passthrough PMU
>> is enabled, touching anywhere around counter reprogram part should be an
>> error.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/kvm/pmu.c | 3 +++
>>  arch/x86/kvm/pmu.h | 8 ++++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 3604cf467b34..fcd188cc389a 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -478,6 +478,9 @@ static int reprogram_counter(struct kvm_pmc *pmc)
>>  	bool emulate_overflow;
>>  	u8 fixed_ctr_ctrl;
>>  
>> +	if (WARN_ONCE(pmu->passthrough, "Passthrough PMU never reprogram counter\n"))
> Eh, a WARN_ON_ONCE() is enough.
>
> That said, this isn't entirely correct.  The mediated PMU needs to "reprogram"
> event selectors if the event filter is changed. KVM currently handles this by
> setting  all __reprogram_pmi bits and blasting KVM_REQ_PMU.
>
> LOL, and there's even a reprogram_fixed_counters_in_passthrough_pmu(), so the
> above message is a lie.
>
> There is also far too much duplicate code in things like reprogram_fixed_counters()
> versus reprogram_fixed_counters_in_passthrough_pmu().
>
> Reprogramming on each write is also technically suboptimal, as _very_ theoretically
> KVM could emulate multiple WRMSRs without re-entering the guest.
>
> So, I think the mediated PMU should use the reprogramming infrastructure, and
> handle the bulk of the different behavior in reprogram_counter(), not in a half
> dozen different paths.

Sure. would handle the reprogram counter behavior of meidated vPMU
reprogram_counter(), and you comments reminds me current mediated vPMU code
missed to handle the case of event filter changing. Would add them in next
version.


>
>> +		return 0;
>> +
>>  	emulate_overflow = pmc_pause_counter(pmc);
>>  
>>  	if (!pmc_event_is_allowed(pmc))
>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>> index 7e006cb61296..10553bc1ae1d 100644
>> --- a/arch/x86/kvm/pmu.h
>> +++ b/arch/x86/kvm/pmu.h
>> @@ -256,6 +256,10 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>>  
>>  static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
>>  {
>> +	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
>> +	if (pmc_to_pmu(pmc)->passthrough)
>> +		return;
>> +
>>  	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
>>  	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
>>  }
>> @@ -264,6 +268,10 @@ static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
>>  {
>>  	int bit;
>>  
>> +	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
>> +	if (pmu->passthrough)
>> +		return;
> Make up your mind :-)  Either handle the mediated PMU here, or handle it in the
> callers.  Don't do both.
>
> I vote to handle it here, i.e. drop the check in kvm_pmu_set_msr() when handling
> MSR_CORE_PERF_GLOBAL_CTRL, and then add a comment that reprogramming GP counters
> doesn't need to happen on control updates because KVM enforces the event filters
> when emulating writes to event selectors (and because the guest can write
> PERF_GLOBAL_CTRL directly).

:) Yeah, we found this and removed them internally. Sure. would follow your
suggestion.


>
>> +
>>  	if (!diff)
>>  		return;
>>  
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for passthrough PMU
  2024-11-20 20:52   ` Sean Christopherson
@ 2024-11-21  3:14     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  3:14 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 4:52 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> Add nested virtualization support for passthrough PMU by combining the MSR
>> interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
>> this patch, nested virtualization works for passthrough PMU because L1 will
>> see Perfmon v2 and will have to use legacy vPMU implementation if it is
>> Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
>> even be Linux.
>>
>> If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
>> MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
>> the access.
>>
>> However, in current implementation, if without adding anything for nested,
>> KVM always set MSR interception bits in vmcs02. This leads to the fact that
>> L0 will emulate all MSR read/writes for L2, leading to errors, since the
>> current passthrough vPMU never implements set_msr() and get_msr() for any
>> counter access except counter accesses from the VMM side.
>>
>> So fix the issue by setting up the correct MSR interception for PMU MSRs.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/vmx/nested.c | 52 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 52 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 643935a0f70a..ef385f9e7513 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -612,6 +612,55 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
>>  						   msr_bitmap_l0, msr);
>>  }
>>  
>> +/* Pass PMU MSRs to nested VM if L0 and L1 are set to passthrough. */
>> +static void nested_vmx_set_passthru_pmu_intercept_for_msr(struct kvm_vcpu *vcpu,
>> +							  unsigned long *msr_bitmap_l1,
>> +							  unsigned long *msr_bitmap_l0)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +	int i;
>> +
>> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +						 msr_bitmap_l0,
>> +						 MSR_ARCH_PERFMON_EVENTSEL0 + i,
>> +						 MSR_TYPE_RW);
>> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +						 msr_bitmap_l0,
>> +						 MSR_IA32_PERFCTR0 + i,
>> +						 MSR_TYPE_RW);
>> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +						 msr_bitmap_l0,
>> +						 MSR_IA32_PMC0 + i,
>> +						 MSR_TYPE_RW);
> I think we should add (gross) macros to dedup the bulk of this boilerplate, by
> referencing the local variables in the macros.  Like I said, gross.  But I think
> it'd be less error prone and easier to read than the copy+paste mess we have today.
> E.g. it's easy to miss that only writes are allowed for MSR_IA32_FLUSH_CMD and
> MSR_IA32_PRED_CMD, because there's so much boilerplate.
>
> Something like:
>
> #define nested_vmx_merge_msr_bitmaps(msr, type)	\
> 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, msr, type)
>
> #define nested_vmx_merge_msr_bitmaps_read(msr)	\
> 	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_R);
>
> #define nested_vmx_merge_msr_bitmaps_write(msr)	\
> 	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_W);
>
> #define nested_vmx_merge_msr_bitmaps_rw(msr)	\
> 	nested_vmx_merge_msr_bitmaps(msr, MSR_TYPE_RW);
>
>
> 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> 		nested_vmx_merge_msr_bitmaps_rw(MSR_ARCH_PERFMON_EVENTSEL0 + i);
> 		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PERFCTR0+ i);
> 		nested_vmx_merge_msr_bitmaps_rw(MSR_IA32_PMC0+ i);
> 	}
>
> 	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++)
> 		nested_vmx_merge_msr_bitmaps_rw(MSR_CORE_PERF_FIXED_CTR_CTRL);
>
> 	blah blah blah

Sure. Thanks.

>
>> +	}
>> +
>> +	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++) {
> Curly braces are unnecessary.

Sure.


>
>> +		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +						 msr_bitmap_l0,
>> +						 MSR_CORE_PERF_FIXED_CTR0 + i,
>> +						 MSR_TYPE_RW);
>> +	}
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +					 msr_bitmap_l0,
>> +					 MSR_CORE_PERF_FIXED_CTR_CTRL,
>> +					 MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +					 msr_bitmap_l0,
>> +					 MSR_CORE_PERF_GLOBAL_STATUS,
>> +					 MSR_TYPE_RW);
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +					 msr_bitmap_l0,
>> +					 MSR_CORE_PERF_GLOBAL_CTRL,
>> +					 MSR_TYPE_RW);
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
>> +					 msr_bitmap_l0,
>> +					 MSR_CORE_PERF_GLOBAL_OVF_CTRL,
>> +					 MSR_TYPE_RW);
>> +}
>> +
>>  /*
>>   * Merge L0's and L1's MSR bitmap, return false to indicate that
>>   * we do not use the hardware.
>> @@ -713,6 +762,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>>  	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>  					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>  
>> +	if (is_passthrough_pmu_enabled(vcpu))
>> +		nested_vmx_set_passthru_pmu_intercept_for_msr(vcpu, msr_bitmap_l1, msr_bitmap_l0);
> Please wrap.  Or better yet:
>
> 	nested_vmx_merge_pmu_msr_bitmaps(vmx, msr_bitmap_1, msr_bitmap_l0);
>
> and handle the enable_mediated_pmu check in the helper.

Sure.


>
>> +
>>  	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>  
>>  	vmx->nested.force_msr_bitmap_recalc = false;
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception
  2024-11-20 21:02   ` Sean Christopherson
@ 2024-11-21  3:24     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  3:24 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users, Aaron Lewis


On 11/21/2024 5:02 AM, Sean Christopherson wrote:
> +Aaron
>
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> +static void amd_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +	int msr_clear = !!(is_passthrough_pmu_enabled(vcpu));
>> +	int i;
>> +
>> +	for (i = 0; i < min(pmu->nr_arch_gp_counters, AMD64_NUM_COUNTERS); i++) {
>> +		/*
>> +		 * Legacy counters are always available irrespective of any
>> +		 * CPUID feature bits and when X86_FEATURE_PERFCTR_CORE is set,
>> +		 * PERF_LEGACY_CTLx and PERF_LEGACY_CTRx registers are mirrored
>> +		 * with PERF_CTLx and PERF_CTRx respectively.
>> +		 */
>> +		set_msr_interception(vcpu, svm->msrpm, MSR_K7_EVNTSEL0 + i, 0, 0);
>> +		set_msr_interception(vcpu, svm->msrpm, MSR_K7_PERFCTR0 + i, msr_clear, msr_clear);
>> +	}
>> +
>> +	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
>> +		/*
>> +		 * PERF_CTLx registers require interception in order to clear
>> +		 * HostOnly bit and set GuestOnly bit. This is to prevent the
>> +		 * PERF_CTRx registers from counting before VM entry and after
>> +		 * VM exit.
>> +		 */
>> +		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTL + 2 * i, 0, 0);
>> +
>> +		/*
>> +		 * Pass through counters exposed to the guest and intercept
>> +		 * counters that are unexposed. Do this explicitly since this
>> +		 * function may be set multiple times before vcpu runs.
>> +		 */
>> +		if (i >= pmu->nr_arch_gp_counters)
>> +			msr_clear = 0;
> Similar to my comments on the Intel side, explicitly enable interception for
> MSRs that don't exist in the guest model in a separate for-loop, i.e. don't
> toggle msr_clear in the middle of a loop.

Sure.


>
> I would also love to de-dup the bulk of this code, which is very doable since
> the base+shift for the MSRs is going to be stashed in kvm_pmu.  All that's needed
> on top is unified MSR interception logic, which is something that's been on my
> wish list for some time.  SVM's inverted polarity needs to die a horrible death.
>
> Lucky for me, Aaron is picking up that torch.
>
> Aaron, what's your ETA on the MSR unification?  No rush, but if you think it'll
> be ready in the next month or so, I'll plan on merging that first and landing
> this code on top.

Is there a public link for Aaron's patches? If so, we can rebase the next
version patches on top of Aaron's patches.


>
>> +		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTR + 2 * i, msr_clear, msr_clear);
>> +	}
>> +
>> +	/*
>> +	 * In mediated passthrough vPMU, intercept global PMU MSRs when guest
>> +	 * PMU only owns a subset of counters provided in HW or its version is
>> +	 * less than 2.
>> +	 */
>> +	if (is_passthrough_pmu_enabled(vcpu) && pmu->version > 1 &&
> kvm_pmu_has_perf_global_ctrl(), no?

Yes.


>
>> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)
>> +		msr_clear = 1;
>> +	else
>> +		msr_clear = 0;
>> +
>> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_CTL, msr_clear, msr_clear);
>> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, msr_clear, msr_clear);
>> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, msr_clear, msr_clear);
>> +	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, msr_clear, msr_clear);
>> +}
>> +
>>  struct kvm_pmu_ops amd_pmu_ops __initdata = {
>>  	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
>>  	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
>> @@ -258,6 +312,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
>>  	.refresh = amd_pmu_refresh,
>>  	.init = amd_pmu_init,
>>  	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
>> +	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
>>  	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
>>  	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
>>  	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors
  2024-11-20 21:38   ` Sean Christopherson
@ 2024-11-21  3:26     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  3:26 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 5:38 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> From: Sandipan Das <sandipan.das@amd.com>
>>
>> On AMD platforms, there is no way to restore PerfCntrGlobalCtl at
>> VM-Entry or clear it at VM-Exit. Since the register states will be
>> restored before entering and saved after exiting guest context, the
>> counters can keep ticking and even overflow leading to chaos while
>> still in host context.
>>
>> To avoid this, the PERF_CTLx MSRs (event selectors) are always
>> intercepted. KVM will always set the GuestOnly bit and clear the
>> HostOnly bit so that the counters run only in guest context even if
>> their enable bits are set. Intercepting these MSRs is also necessary
>> for guest event filtering.
>>
>> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/svm/pmu.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index cc03c3e9941f..2b7cc7616162 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -165,7 +165,12 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  		data &= ~pmu->reserved_bits;
>>  		if (data != pmc->eventsel) {
>>  			pmc->eventsel = data;
>> -			kvm_pmu_request_counter_reprogram(pmc);
>> +			if (is_passthrough_pmu_enabled(vcpu)) {
>> +				data &= ~AMD64_EVENTSEL_HOSTONLY;
>> +				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
> Do both in a single statment, i.e.
>
> 				pmc->eventsel_hw = (data & ~AMD64_EVENTSEL_HOSTONLY) |
> 						   AMD64_EVENTSEL_GUESTONLY;
>
> Though per my earlier comments, this likely needs to end up in reprogram_counter().

It looks we need to add a PMU callback and call it from reprogram_counter().


>
>> +			} else {
>> +				kvm_pmu_request_counter_reprogram(pmc);
>> +			}
>>  		}
>>  		return 0;
>>  	}
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU
  2024-11-20 21:39   ` Sean Christopherson
@ 2024-11-21  3:29     ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2024-11-21  3:29 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/21/2024 5:39 AM, Sean Christopherson wrote:
> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>> From: Manali Shukla <manali.shukla@amd.com>
>>
>> With the Passthrough PMU enabled, the PERF_CTLx MSRs (event selectors) are
>> always intercepted and the event filter checking can be directly done
>> inside amd_pmu_set_msr().
>>
>> Add a check to allow writing to event selector for GP counters if and only
>> if the event is allowed in filter.
> This belongs in the patch that adds AMD support for setting pmc->eventsel_hw.
> E.g. reverting just this patch would leave KVM in a very broken state.  And it's
> unnecessarily difficult to review.

Sure. would merge them into one.


>
>> Signed-off-by: Manali Shukla <manali.shukla@amd.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/svm/pmu.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index 86818da66bbe..9f3e910ee453 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -166,6 +166,15 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  		if (data != pmc->eventsel) {
>>  			pmc->eventsel = data;
>>  			if (is_passthrough_pmu_enabled(vcpu)) {
>> +				if (!check_pmu_event_filter(pmc)) {
>> +					/*
>> +					 * When guest request an invalid event,
>> +					 * stop the counter by clearing the
>> +					 * event selector MSR.
>> +					 */
>> +					pmc->eventsel_hw = 0;
>> +					return 0;
>> +				}
>>  				data &= ~AMD64_EVENTSEL_HOSTONLY;
>>  				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
>>  			} else {
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
                     ` (2 preceding siblings ...)
  2024-10-14 11:14   ` Peter Zijlstra
@ 2024-12-13  9:37   ` Sandipan Das
  2024-12-13 16:26     ` Liang, Kan
  3 siblings, 1 reply; 183+ messages in thread
From: Sandipan Das @ 2024-12-13  9:37 UTC (permalink / raw)
  To: Mingwei Zhang, Kan Liang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Zhenyu Wang, Manali Shukla, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Current perf doesn't explicitly schedule out all exclude_guest events
> while the guest is running. There is no problem with the current
> emulated vPMU. Because perf owns all the PMU counters. It can mask the
> counter which is assigned to an exclude_guest event when a guest is
> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
> (AMD way). The counter doesn't count when a guest is running.
> 
> However, either way doesn't work with the introduced passthrough vPMU.
> A guest owns all the PMU counters when it's running. The host should not
> mask any counters. The counter may be used by the guest. The evsentsel
> may be overwritten.
> 
> Perf should explicitly schedule out all exclude_guest events to release
> the PMU resources when entering a guest, and resume the counting when
> exiting the guest.
> 
> It's possible that an exclude_guest event is created when a guest is
> running. The new event should not be scheduled in as well.
> 
> The ctx time is shared among different PMUs. The time cannot be stopped
> when a guest is running. It is required to calculate the time for events
> from other PMUs, e.g., uncore events. Add timeguest to track the guest
> run time. For an exclude_guest event, the elapsed time equals
> the ctx time - guest time.
> Cgroup has dedicated times. Use the same method to deduct the guest time
> from the cgroup time as well.
> 
> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |   6 ++
>  kernel/events/core.c       | 178 +++++++++++++++++++++++++++++++------
>  2 files changed, 155 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e22cdb6486e6..81a5f8399cb8 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -952,6 +952,11 @@ struct perf_event_context {
>  	 */
>  	struct perf_time_ctx		time;
>  
> +	/*
> +	 * Context clock, runs when in the guest mode.
> +	 */
> +	struct perf_time_ctx		timeguest;
> +
>  	/*
>  	 * These fields let us detect when two contexts have both
>  	 * been cloned (inherited) from a common ancestor.
> @@ -1044,6 +1049,7 @@ struct bpf_perf_event_data_kern {
>   */
>  struct perf_cgroup_info {
>  	struct perf_time_ctx		time;
> +	struct perf_time_ctx		timeguest;
>  	int				active;
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c25e2bf27001..57648736e43e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -376,7 +376,8 @@ enum event_type_t {
>  	/* see ctx_resched() for details */
>  	EVENT_CPU = 0x8,
>  	EVENT_CGROUP = 0x10,
> -	EVENT_FLAGS = EVENT_CGROUP,
> +	EVENT_GUEST = 0x20,
> +	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
>  	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
>  };
>  
> @@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>  
>  static atomic_t nr_mediated_pmu_vms;
>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
> +static DEFINE_PER_CPU(bool, perf_in_guest);
>  
>  /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
>  static inline bool is_include_guest_event(struct perf_event *event)
> @@ -706,6 +708,10 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>  		return true;
>  
> +	if ((event_type & EVENT_GUEST) &&
> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -770,12 +776,21 @@ static inline int is_cgroup_event(struct perf_event *event)
>  	return event->cgrp != NULL;
>  }
>  
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *time,
> +					struct perf_time_ctx *timeguest);
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *time,
> +					    struct perf_time_ctx *timeguest,
> +					    u64 now);
> +
>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
>  	struct perf_cgroup_info *t;
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	return t->time.time;
> +	return __perf_event_time_ctx(event, &t->time, &t->timeguest);
>  }
>  
>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> @@ -784,9 +799,9 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>  	if (!__load_acquire(&t->active))
> -		return t->time.time;
> -	now += READ_ONCE(t->time.offset);
> -	return now;
> +		return __perf_event_time_ctx(event, &t->time, &t->timeguest);
> +
> +	return __perf_event_time_ctx_now(event, &t->time, &t->timeguest, now);
>  }
>  
>  static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
> @@ -796,6 +811,18 @@ static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bo
>  	update_perf_time_ctx(&info->time, now, adv);
>  }
>  
> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
> +{
> +	update_perf_time_ctx(&info->timeguest, now, adv);
> +}
> +
> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
> +{
> +	__update_cgrp_time(info, now, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_cgrp_guest_time(info, now, true);
> +}
> +
>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>  {
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> @@ -809,7 +836,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
>  			cgrp = container_of(css, struct perf_cgroup, css);
>  			info = this_cpu_ptr(cgrp->info);
>  
> -			__update_cgrp_time(info, now, true);
> +			update_cgrp_time(info, now);
>  			if (final)
>  				__store_release(&info->active, 0);
>  		}
> @@ -832,11 +859,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
>  	 * Do not update time when cgroup is not active
>  	 */
>  	if (info->active)
> -		__update_cgrp_time(info, perf_clock(), true);
> +		update_cgrp_time(info, perf_clock());
>  }
>  
>  static inline void
> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>  {
>  	struct perf_event_context *ctx = &cpuctx->ctx;
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> @@ -856,8 +883,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>  	for (css = &cgrp->css; css; css = css->parent) {
>  		cgrp = container_of(css, struct perf_cgroup, css);
>  		info = this_cpu_ptr(cgrp->info);
> -		__update_cgrp_time(info, ctx->time.stamp, false);
> -		__store_release(&info->active, 1);
> +		if (guest) {
> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
> +		} else {
> +			__update_cgrp_time(info, ctx->time.stamp, false);
> +			__store_release(&info->active, 1);
> +		}
>  	}
>  }
>  
> @@ -1061,7 +1092,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>  }
>  
>  static inline void
> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>  {
>  }
>  
> @@ -1488,16 +1519,34 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>   */
>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
>  {
> -	u64 now = perf_clock();
> +	lockdep_assert_held(&ctx->lock);
> +
> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
> +}
>  
> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
> +{
>  	lockdep_assert_held(&ctx->lock);
>  
> -	update_perf_time_ctx(&ctx->time, now, adv);
> +	/* must be called after __update_context_time(); */
> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
>  }
>  
>  static void update_context_time(struct perf_event_context *ctx)
>  {
>  	__update_context_time(ctx, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_context_guest_time(ctx, true);
> +}
> +
> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
> +					struct perf_time_ctx *time,
> +					struct perf_time_ctx *timeguest)
> +{
> +	if (event->attr.exclude_guest)
> +		return time->time - timeguest->time;
> +	else
> +		return time->time;
>  }
>  
>  static u64 perf_event_time(struct perf_event *event)
> @@ -1510,7 +1559,26 @@ static u64 perf_event_time(struct perf_event *event)
>  	if (is_cgroup_event(event))
>  		return perf_cgroup_event_time(event);
>  
> -	return ctx->time.time;
> +	return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
> +}
> +
> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
> +					    struct perf_time_ctx *time,
> +					    struct perf_time_ctx *timeguest,
> +					    u64 now)
> +{
> +	/*
> +	 * The exclude_guest event time should be calculated from
> +	 * the ctx time -  the guest time.
> +	 * The ctx time is now + READ_ONCE(time->offset).
> +	 * The guest time is now + READ_ONCE(timeguest->offset).
> +	 * So the exclude_guest time is
> +	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
> +	 */
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
> +		return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
> +	else
> +		return now + READ_ONCE(time->offset);
>  }
>  
>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
> @@ -1524,10 +1592,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
>  		return perf_cgroup_event_time_now(event, now);
>  
>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> -		return ctx->time.time;
> +		return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
>  
> -	now += READ_ONCE(ctx->time.offset);
> -	return now;
> +	return __perf_event_time_ctx_now(event, &ctx->time, &ctx->timeguest, now);
>  }
>  
>  static enum event_type_t get_event_type(struct perf_event *event)
> @@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  	 * would only update time for the pinned events.
>  	 */
>  	if (is_active & EVENT_TIME) {
> +		bool stop;
> +
> +		/* vPMU should not stop time */
> +		stop = !(event_type & EVENT_GUEST) &&
> +		       ctx == &cpuctx->ctx;
> +
>  		/* update (and stop) ctx time */
>  		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>  			cpuctx->task_ctx = NULL;
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule out all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */
> +		is_active = EVENT_ALL;
> +		__update_context_guest_time(ctx, false);
> +		perf_cgroup_set_timestamp(cpuctx, true);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> @@ -3853,10 +3937,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
>  		event_update_userpage(event);
>  }
>  
> +struct merge_sched_data {
> +	int can_add_hw;
> +	enum event_type_t event_type;
> +};
> +
>  static int merge_sched_in(struct perf_event *event, void *data)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> -	int *can_add_hw = data;
> +	struct merge_sched_data *msd = data;
>  
>  	if (event->state <= PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  	if (!event_filter_match(event))
>  		return 0;
>  
> -	if (group_can_go_on(event, *can_add_hw)) {
> +	/*
> +	 * Don't schedule in any exclude_guest events of PMU with
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
> +	 */
> +	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
> +	    !(msd->event_type & EVENT_GUEST))
> +		return 0;
> +

It is possible for event groups to have a mix of software and core PMU events.
If the group leader is a software event, event->pmu will point to the software
PMU but event->pmu_ctx->pmu will point to the core PMU. When perf_in_guest is
true for a CPU and the group leader is passed to merge_sched_in(), the condition
above fails as the software PMU does not have PERF_PMU_CAP_PASSTHROUGH_VPMU
capability. This can lead to group_sched_in() getting called later where all
the sibling events, which includes core PMU events that are not supposed to
be scheduled in, to be brought in. So event->pmu_ctx->pmu->capabilities needs
to be looked at instead.

> +	if (group_can_go_on(event, msd->can_add_hw)) {
>  		if (!group_sched_in(event, ctx))
>  			list_add_tail(&event->active_list, get_event_list(event));
>  	}
>  
>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		*can_add_hw = 0;
> +		msd->can_add_hw = 0;
>  		if (event->attr.pinned) {
>  			perf_cgroup_event_disable(event, ctx);
>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> @@ -3889,11 +3987,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
>  
>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>  				struct perf_event_groups *groups,
> -				struct pmu *pmu)
> +				struct pmu *pmu,
> +				enum event_type_t event_type)
>  {
> -	int can_add_hw = 1;
> +	struct merge_sched_data msd = {
> +		.can_add_hw = 1,
> +		.event_type = event_type,
> +	};
>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
> -			   merge_sched_in, &can_add_hw);
> +			   merge_sched_in, &msd);
>  }
>  
>  static void ctx_groups_sched_in(struct perf_event_context *ctx,
> @@ -3905,14 +4007,14 @@ static void ctx_groups_sched_in(struct perf_event_context *ctx,
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>  			continue;
> -		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
> +		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
>  	}
>  }
>  
>  static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
>  			       struct pmu *pmu)
>  {
> -	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
> +	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
>  }
>  
>  static void
> @@ -3927,9 +4029,11 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  		return;
>  
>  	if (!(is_active & EVENT_TIME)) {
> +		/* EVENT_TIME should be active while the guest runs */
> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>  		/* start ctx time */
>  		__update_context_time(ctx, false);
> -		perf_cgroup_set_timestamp(cpuctx);
> +		perf_cgroup_set_timestamp(cpuctx, false);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>  	}
>  
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {
> +		/*
> +		 * Schedule in all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */
> +		is_active = EVENT_ALL;
> +
> +		/*
> +		 * Update ctx time to set the new start time for
> +		 * the exclude_guest events.
> +		 */
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> +		barrier();
> +	} else {
> +		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	/*
>  	 * First go through the list and put on any pinned groups


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag
  2024-12-13  9:37   ` Sandipan Das
@ 2024-12-13 16:26     ` Liang, Kan
  0 siblings, 0 replies; 183+ messages in thread
From: Liang, Kan @ 2024-12-13 16:26 UTC (permalink / raw)
  To: Sandipan Das, Mingwei Zhang, Kan Liang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Manali Shukla, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users



On 2024-12-13 4:37 a.m., Sandipan Das wrote:
> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Current perf doesn't explicitly schedule out all exclude_guest events
>> while the guest is running. There is no problem with the current
>> emulated vPMU. Because perf owns all the PMU counters. It can mask the
>> counter which is assigned to an exclude_guest event when a guest is
>> running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
>> (AMD way). The counter doesn't count when a guest is running.
>>
>> However, either way doesn't work with the introduced passthrough vPMU.
>> A guest owns all the PMU counters when it's running. The host should not
>> mask any counters. The counter may be used by the guest. The evsentsel
>> may be overwritten.
>>
>> Perf should explicitly schedule out all exclude_guest events to release
>> the PMU resources when entering a guest, and resume the counting when
>> exiting the guest.
>>
>> It's possible that an exclude_guest event is created when a guest is
>> running. The new event should not be scheduled in as well.
>>
>> The ctx time is shared among different PMUs. The time cannot be stopped
>> when a guest is running. It is required to calculate the time for events
>> from other PMUs, e.g., uncore events. Add timeguest to track the guest
>> run time. For an exclude_guest event, the elapsed time equals
>> the ctx time - guest time.
>> Cgroup has dedicated times. Use the same method to deduct the guest time
>> from the cgroup time as well.
>>
>> Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |   6 ++
>>  kernel/events/core.c       | 178 +++++++++++++++++++++++++++++++------
>>  2 files changed, 155 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index e22cdb6486e6..81a5f8399cb8 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -952,6 +952,11 @@ struct perf_event_context {
>>  	 */
>>  	struct perf_time_ctx		time;
>>  
>> +	/*
>> +	 * Context clock, runs when in the guest mode.
>> +	 */
>> +	struct perf_time_ctx		timeguest;
>> +
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>>  	 * been cloned (inherited) from a common ancestor.
>> @@ -1044,6 +1049,7 @@ struct bpf_perf_event_data_kern {
>>   */
>>  struct perf_cgroup_info {
>>  	struct perf_time_ctx		time;
>> +	struct perf_time_ctx		timeguest;
>>  	int				active;
>>  };
>>  
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index c25e2bf27001..57648736e43e 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -376,7 +376,8 @@ enum event_type_t {
>>  	/* see ctx_resched() for details */
>>  	EVENT_CPU = 0x8,
>>  	EVENT_CGROUP = 0x10,
>> -	EVENT_FLAGS = EVENT_CGROUP,
>> +	EVENT_GUEST = 0x20,
>> +	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
>>  	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
>>  };
>>  
>> @@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;
>>  
>>  static atomic_t nr_mediated_pmu_vms;
>>  static DEFINE_MUTEX(perf_mediated_pmu_mutex);
>> +static DEFINE_PER_CPU(bool, perf_in_guest);
>>  
>>  /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
>>  static inline bool is_include_guest_event(struct perf_event *event)
>> @@ -706,6 +708,10 @@ static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
>>  	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
>>  		return true;
>>  
>> +	if ((event_type & EVENT_GUEST) &&
>> +	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
>> +		return true;
>> +
>>  	return false;
>>  }
>>  
>> @@ -770,12 +776,21 @@ static inline int is_cgroup_event(struct perf_event *event)
>>  	return event->cgrp != NULL;
>>  }
>>  
>> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
>> +					struct perf_time_ctx *time,
>> +					struct perf_time_ctx *timeguest);
>> +
>> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
>> +					    struct perf_time_ctx *time,
>> +					    struct perf_time_ctx *timeguest,
>> +					    u64 now);
>> +
>>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>>  {
>>  	struct perf_cgroup_info *t;
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>> -	return t->time.time;
>> +	return __perf_event_time_ctx(event, &t->time, &t->timeguest);
>>  }
>>  
>>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>> @@ -784,9 +799,9 @@ static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>>  	if (!__load_acquire(&t->active))
>> -		return t->time.time;
>> -	now += READ_ONCE(t->time.offset);
>> -	return now;
>> +		return __perf_event_time_ctx(event, &t->time, &t->timeguest);
>> +
>> +	return __perf_event_time_ctx_now(event, &t->time, &t->timeguest, now);
>>  }
>>  
>>  static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, bool adv);
>> @@ -796,6 +811,18 @@ static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bo
>>  	update_perf_time_ctx(&info->time, now, adv);
>>  }
>>  
>> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
>> +{
>> +	update_perf_time_ctx(&info->timeguest, now, adv);
>> +}
>> +
>> +static inline void update_cgrp_time(struct perf_cgroup_info *info, u64 now)
>> +{
>> +	__update_cgrp_time(info, now, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_cgrp_guest_time(info, now, true);
>> +}
>> +
>>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>>  {
>>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
>> @@ -809,7 +836,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
>>  			cgrp = container_of(css, struct perf_cgroup, css);
>>  			info = this_cpu_ptr(cgrp->info);
>>  
>> -			__update_cgrp_time(info, now, true);
>> +			update_cgrp_time(info, now);
>>  			if (final)
>>  				__store_release(&info->active, 0);
>>  		}
>> @@ -832,11 +859,11 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
>>  	 * Do not update time when cgroup is not active
>>  	 */
>>  	if (info->active)
>> -		__update_cgrp_time(info, perf_clock(), true);
>> +		update_cgrp_time(info, perf_clock());
>>  }
>>  
>>  static inline void
>> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>>  {
>>  	struct perf_event_context *ctx = &cpuctx->ctx;
>>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
>> @@ -856,8 +883,12 @@ perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>>  	for (css = &cgrp->css; css; css = css->parent) {
>>  		cgrp = container_of(css, struct perf_cgroup, css);
>>  		info = this_cpu_ptr(cgrp->info);
>> -		__update_cgrp_time(info, ctx->time.stamp, false);
>> -		__store_release(&info->active, 1);
>> +		if (guest) {
>> +			__update_cgrp_guest_time(info, ctx->time.stamp, false);
>> +		} else {
>> +			__update_cgrp_time(info, ctx->time.stamp, false);
>> +			__store_release(&info->active, 1);
>> +		}
>>  	}
>>  }
>>  
>> @@ -1061,7 +1092,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>>  }
>>  
>>  static inline void
>> -perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
>> +perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx, bool guest)
>>  {
>>  }
>>  
>> @@ -1488,16 +1519,34 @@ static inline void update_perf_time_ctx(struct perf_time_ctx *time, u64 now, boo
>>   */
>>  static void __update_context_time(struct perf_event_context *ctx, bool adv)
>>  {
>> -	u64 now = perf_clock();
>> +	lockdep_assert_held(&ctx->lock);
>> +
>> +	update_perf_time_ctx(&ctx->time, perf_clock(), adv);
>> +}
>>  
>> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
>> +{
>>  	lockdep_assert_held(&ctx->lock);
>>  
>> -	update_perf_time_ctx(&ctx->time, now, adv);
>> +	/* must be called after __update_context_time(); */
>> +	update_perf_time_ctx(&ctx->timeguest, ctx->time.stamp, adv);
>>  }
>>  
>>  static void update_context_time(struct perf_event_context *ctx)
>>  {
>>  	__update_context_time(ctx, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_context_guest_time(ctx, true);
>> +}
>> +
>> +static inline u64 __perf_event_time_ctx(struct perf_event *event,
>> +					struct perf_time_ctx *time,
>> +					struct perf_time_ctx *timeguest)
>> +{
>> +	if (event->attr.exclude_guest)
>> +		return time->time - timeguest->time;
>> +	else
>> +		return time->time;
>>  }
>>  
>>  static u64 perf_event_time(struct perf_event *event)
>> @@ -1510,7 +1559,26 @@ static u64 perf_event_time(struct perf_event *event)
>>  	if (is_cgroup_event(event))
>>  		return perf_cgroup_event_time(event);
>>  
>> -	return ctx->time.time;
>> +	return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
>> +}
>> +
>> +static inline u64 __perf_event_time_ctx_now(struct perf_event *event,
>> +					    struct perf_time_ctx *time,
>> +					    struct perf_time_ctx *timeguest,
>> +					    u64 now)
>> +{
>> +	/*
>> +	 * The exclude_guest event time should be calculated from
>> +	 * the ctx time -  the guest time.
>> +	 * The ctx time is now + READ_ONCE(time->offset).
>> +	 * The guest time is now + READ_ONCE(timeguest->offset).
>> +	 * So the exclude_guest time is
>> +	 * READ_ONCE(time->offset) - READ_ONCE(timeguest->offset).
>> +	 */
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest))
>> +		return READ_ONCE(time->offset) - READ_ONCE(timeguest->offset);
>> +	else
>> +		return now + READ_ONCE(time->offset);
>>  }
>>  
>>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
>> @@ -1524,10 +1592,9 @@ static u64 perf_event_time_now(struct perf_event *event, u64 now)
>>  		return perf_cgroup_event_time_now(event, now);
>>  
>>  	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
>> -		return ctx->time.time;
>> +		return __perf_event_time_ctx(event, &ctx->time, &ctx->timeguest);
>>  
>> -	now += READ_ONCE(ctx->time.offset);
>> -	return now;
>> +	return __perf_event_time_ctx_now(event, &ctx->time, &ctx->timeguest, now);
>>  }
>>  
>>  static enum event_type_t get_event_type(struct perf_event *event)
>> @@ -3334,9 +3401,15 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>>  	 * would only update time for the pinned events.
>>  	 */
>>  	if (is_active & EVENT_TIME) {
>> +		bool stop;
>> +
>> +		/* vPMU should not stop time */
>> +		stop = !(event_type & EVENT_GUEST) &&
>> +		       ctx == &cpuctx->ctx;
>> +
>>  		/* update (and stop) ctx time */
>>  		update_context_time(ctx);
>> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -3354,7 +3427,18 @@ ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule out all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +		__update_context_guest_time(ctx, false);
>> +		perf_cgroup_set_timestamp(cpuctx, true);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>> @@ -3853,10 +3937,15 @@ static inline void group_update_userpage(struct perf_event *group_event)
>>  		event_update_userpage(event);
>>  }
>>  
>> +struct merge_sched_data {
>> +	int can_add_hw;
>> +	enum event_type_t event_type;
>> +};
>> +
>>  static int merge_sched_in(struct perf_event *event, void *data)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> -	int *can_add_hw = data;
>> +	struct merge_sched_data *msd = data;
>>  
>>  	if (event->state <= PERF_EVENT_STATE_OFF)
>>  		return 0;
>> @@ -3864,13 +3953,22 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  	if (!event_filter_match(event))
>>  		return 0;
>>  
>> -	if (group_can_go_on(event, *can_add_hw)) {
>> +	/*
>> +	 * Don't schedule in any exclude_guest events of PMU with
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
>> +	 */
>> +	if (__this_cpu_read(perf_in_guest) && event->attr.exclude_guest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
>> +	    !(msd->event_type & EVENT_GUEST))
>> +		return 0;
>> +
> 
> It is possible for event groups to have a mix of software and core PMU events.
> If the group leader is a software event, event->pmu will point to the software
> PMU but event->pmu_ctx->pmu will point to the core PMU. When perf_in_guest is
> true for a CPU and the group leader is passed to merge_sched_in(), the condition
> above fails as the software PMU does not have PERF_PMU_CAP_PASSTHROUGH_VPMU
> capability. This can lead to group_sched_in() getting called later where all
> the sibling events, which includes core PMU events that are not supposed to
> be scheduled in, to be brought in. So event->pmu_ctx->pmu->capabilities needs
> to be looked at instead.

Right, Thanks.
I will fix it in the next version.

Thanks,
Kan

> 
>> +	if (group_can_go_on(event, msd->can_add_hw)) {
>>  		if (!group_sched_in(event, ctx))
>>  			list_add_tail(&event->active_list, get_event_list(event));
>>  	}
>>  
>>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -		*can_add_hw = 0;
>> +		msd->can_add_hw = 0;
>>  		if (event->attr.pinned) {
>>  			perf_cgroup_event_disable(event, ctx);
>>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>> @@ -3889,11 +3987,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
>>  
>>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>>  				struct perf_event_groups *groups,
>> -				struct pmu *pmu)
>> +				struct pmu *pmu,
>> +				enum event_type_t event_type)
>>  {
>> -	int can_add_hw = 1;
>> +	struct merge_sched_data msd = {
>> +		.can_add_hw = 1,
>> +		.event_type = event_type,
>> +	};
>>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
>> -			   merge_sched_in, &can_add_hw);
>> +			   merge_sched_in, &msd);
>>  }
>>  
>>  static void ctx_groups_sched_in(struct perf_event_context *ctx,
>> @@ -3905,14 +4007,14 @@ static void ctx_groups_sched_in(struct perf_event_context *ctx,
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>>  			continue;
>> -		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
>> +		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
>>  	}
>>  }
>>  
>>  static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
>>  			       struct pmu *pmu)
>>  {
>> -	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
>> +	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
>>  }
>>  
>>  static void
>> @@ -3927,9 +4029,11 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>>  		return;
>>  
>>  	if (!(is_active & EVENT_TIME)) {
>> +		/* EVENT_TIME should be active while the guest runs */
>> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>>  		/* start ctx time */
>>  		__update_context_time(ctx, false);
>> -		perf_cgroup_set_timestamp(cpuctx);
>> +		perf_cgroup_set_timestamp(cpuctx, false);
>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -3945,7 +4049,23 @@ ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
>>  			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>>  	}
>>  
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
>> +		/*
>> +		 * Schedule in all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +
>> +		/*
>> +		 * Update ctx time to set the new start time for
>> +		 * the exclude_guest events.
>> +		 */
>> +		update_context_time(ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, false);
>> +		barrier();
>> +	} else {
>> +		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	/*
>>  	 * First go through the list and put on any pinned groups
> 
> 


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2024-11-20  3:21     ` Mi, Dapeng
  2024-11-20 17:06       ` Sean Christopherson
@ 2025-01-15  0:17       ` Mingwei Zhang
  2025-01-15  2:52         ` Mi, Dapeng
  1 sibling, 1 reply; 183+ messages in thread
From: Mingwei Zhang @ 2025-01-15  0:17 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users

On Wed, Nov 20, 2024, Mi, Dapeng wrote:
> 
> On 11/19/2024 10:30 PM, Sean Christopherson wrote:
> > As per my feedback in the initial RFC[*]:
> >
> >  2. The module param absolutely must not be exposed to userspace until all patches
> >     are in place.  The easiest way to do that without creating dependency hell is
> >     to simply not create the module param.
> >
> > [*] https://lore.kernel.org/all/ZhhQBHQ6V7Zcb8Ve@google.com
> 
> Sure. It looks we missed this comment. Would address it.
> 

Dapeng, just synced with Sean offline. His point is that we still need
kernel parameter but introduce that in the last part of the series so
that any bisect won't trigger the new PMU logic in the middle of the
series. But I think you are right to create a global config and make it
false.

> 
> >
> > On Thu, Aug 01, 2024, Mingwei Zhang wrote:
> >> Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
> >> variable is true only when the following conditions satisfies:
> >>  - set to true when module loaded.
> >>  - enable_pmu is true.
> >>  - is running on Intel CPU.
> >>  - supports PerfMon v4.
> >>  - host PMU supports passthrough mode.
> >>
> >> The value is always read-only because passthrough PMU currently does not
> >> support features like LBR and PEBS, while emualted PMU does. This will end
> >> up with two different values for kvm_cap.supported_perf_cap, which is
> >> initialized at module load time. Maintaining two different perf
> >> capabilities will add complexity. Further, there is not enough motivation
> >> to support running two types of PMU implementations at the same time,
> >> although it is possible/feasible in reality.
> >>
> >> Finally, always propagate enable_passthrough_pmu and perf_capabilities into
> >> kvm->arch for each KVM instance.
> >>
> >> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> >> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |  1 +
> >>  arch/x86/kvm/pmu.h              | 14 ++++++++++++++
> >>  arch/x86/kvm/vmx/vmx.c          |  7 +++++--
> >>  arch/x86/kvm/x86.c              |  8 ++++++++
> >>  arch/x86/kvm/x86.h              |  1 +
> >>  5 files changed, 29 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index f8ca74e7678f..a15c783f20b9 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -1406,6 +1406,7 @@ struct kvm_arch {
> >>  
> >>  	bool bus_lock_detection_enabled;
> >>  	bool enable_pmu;
> >> +	bool enable_passthrough_pmu;
> > Again, as I suggested/requested in the initial RFC[*], drop the per-VM flag as well
> > as kvm_pmu.passthrough.  There is zero reason to cache the module param.  KVM
> > should always query kvm->arch.enable_pmu prior to checking if the mediated PMU
> > is enabled, so I doubt we even need a helper to check both.
> >
> > [*] https://lore.kernel.org/all/ZhhOEDAl6k-NzOkM@google.com
> 
> Sure.
> 
> 
> >
> >>  
> >>  	u32 notify_window;
> >>  	u32 notify_vmexit_flags;
> >> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> >> index 4d52b0b539ba..cf93be5e7359 100644
> >> --- a/arch/x86/kvm/pmu.h
> >> +++ b/arch/x86/kvm/pmu.h
> >> @@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
> >>  			enable_pmu = false;
> >>  	}
> >>  
> >> +	/* Pass-through vPMU is only supported in Intel CPUs. */
> >> +	if (!is_intel)
> >> +		enable_passthrough_pmu = false;
> >> +
> >> +	/*
> >> +	 * Pass-through vPMU requires at least PerfMon version 4 because the
> >> +	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
> >> +	 * for counter emulation as well as PMU context switch.  In addition, it
> >> +	 * requires host PMU support on passthrough mode. Disable pass-through
> >> +	 * vPMU if any condition fails.
> >> +	 */
> >> +	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
> > As is quite obvious by the end of the series, the v4 requirement is specific to
> > Intel.
> >
> > 	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
> > 	    (is_intel && kvm_pmu_cap.version < 4) ||
> > 	    (is_amd && kvm_pmu_cap.version < 2))
> > 		enable_passthrough_pmu = false;
> >
> > Furthermore, there is zero reason to explicitly and manually check the vendor,
> > kvm_init_pmu_capability() takes kvm_pmu_ops.  Adding a callback is somewhat
> > undesirable as it would lead to duplicate code, but we can still provide separation
> > of concerns by adding const variables to kvm_pmu_ops, a la MAX_NR_GP_COUNTERS.
> >
> > E.g.
> >
> > 	if (enable_pmu) {
> > 		perf_get_x86_pmu_capability(&kvm_pmu_cap);
> >
> > 		/*
> > 		 * WARN if perf did NOT disable hardware PMU if the number of
> > 		 * architecturally required GP counters aren't present, i.e. if
> > 		 * there are a non-zero number of counters, but fewer than what
> > 		 * is architecturally required.
> > 		 */
> > 		if (!kvm_pmu_cap.num_counters_gp ||
> > 		    WARN_ON_ONCE(kvm_pmu_cap.num_counters_gp < min_nr_gp_ctrs))
> > 			enable_pmu = false;
> > 		else if (pmu_ops->MIN_PMU_VERSION > kvm_pmu_cap.version)
> > 			enable_pmu = false;
> > 	}
> >
> > 	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
> > 	    pmu_ops->MIN_MEDIATED_PMU_VERSION > kvm_pmu_cap.version)
> > 		enable_mediated_pmu = false;
> 
> Sure.  would do.
> 
> 
> >> +		enable_passthrough_pmu = false;
> >> +
> >>  	if (!enable_pmu) {
> >>  		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
> >>  		return;
> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >> index ad465881b043..2ad122995f11 100644
> >> --- a/arch/x86/kvm/vmx/vmx.c
> >> +++ b/arch/x86/kvm/vmx/vmx.c
> >> @@ -146,6 +146,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> >>  extern bool __read_mostly allow_smaller_maxphyaddr;
> >>  module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
> >>  
> >> +module_param(enable_passthrough_pmu, bool, 0444);
> > Hmm, we either need to put this param in kvm.ko, or move enable_pmu to vendor
> > modules (or duplicate it there if we need to for backwards compatibility?).
> >
> > There are advantages to putting params in vendor modules, when it's safe to do so,
> > e.g. it allows toggling the param when (re)loading a vendor module, so I think I'm
> > supportive of having the param live in vendor code.  I just don't want to split
> > the two PMU knobs.
> 
> Since enable_passthrough_pmu has already been in vendor modules,  we'd
> better duplicate enable_pmu module parameter in vendor modules as the 1st step.
> 
> 
> >
> >>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> >>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> >>  #define KVM_VM_CR0_ALWAYS_ON				\
> >> @@ -7924,7 +7926,8 @@ static __init u64 vmx_get_perf_capabilities(void)
> >>  	if (boot_cpu_has(X86_FEATURE_PDCM))
> >>  		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
> >>  
> >> -	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
> >> +	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
> >> +	    !enable_passthrough_pmu) {
> >>  		x86_perf_get_lbr(&vmx_lbr_caps);
> >>  
> >>  		/*
> >> @@ -7938,7 +7941,7 @@ static __init u64 vmx_get_perf_capabilities(void)
> >>  			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
> >>  	}
> >>  
> >> -	if (vmx_pebs_supported()) {
> >> +	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
> > Checking enable_mediated_pmu belongs in vmx_pebs_supported(), not in here,
> > otherwise KVM will incorrectly advertise support to userspace:
> >
> > 	if (vmx_pebs_supported()) {
> > 		kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
> > 		kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
> > 	}
> 
> Sure. Thanks for pointing this.
> 
> 
> >
> >>  		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
> >>  		/*
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index f1d589c07068..0c40f551130e 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -187,6 +187,10 @@ bool __read_mostly enable_pmu = true;
> >>  EXPORT_SYMBOL_GPL(enable_pmu);
> >>  module_param(enable_pmu, bool, 0444);
> >>  
> >> +/* Enable/disable mediated passthrough PMU virtualization */
> >> +bool __read_mostly enable_passthrough_pmu;
> >> +EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
> >> +
> >>  bool __read_mostly eager_page_split = true;
> >>  module_param(eager_page_split, bool, 0644);
> >>  
> >> @@ -6682,6 +6686,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >>  		mutex_lock(&kvm->lock);
> >>  		if (!kvm->created_vcpus) {
> >>  			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
> >> +			/* Disable passthrough PMU if enable_pmu is false. */
> >> +			if (!kvm->arch.enable_pmu)
> >> +				kvm->arch.enable_passthrough_pmu = false;
> > And this code obviously goes away if the per-VM snapshot is removed.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2025-01-15  0:17       ` Mingwei Zhang
@ 2025-01-15  2:52         ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2025-01-15  2:52 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, Like Xu, Peter Zijlstra,
	Raghavendra Rao Ananta, kvm, linux-perf-users


On 1/15/2025 8:17 AM, Mingwei Zhang wrote:
> On Wed, Nov 20, 2024, Mi, Dapeng wrote:
>> On 11/19/2024 10:30 PM, Sean Christopherson wrote:
>>> As per my feedback in the initial RFC[*]:
>>>
>>>  2. The module param absolutely must not be exposed to userspace until all patches
>>>     are in place.  The easiest way to do that without creating dependency hell is
>>>     to simply not create the module param.
>>>
>>> [*] https://lore.kernel.org/all/ZhhQBHQ6V7Zcb8Ve@google.com
>> Sure. It looks we missed this comment. Would address it.
>>
> Dapeng, just synced with Sean offline. His point is that we still need
> kernel parameter but introduce that in the last part of the series so
> that any bisect won't trigger the new PMU logic in the middle of the
> series. But I think you are right to create a global config and make it
> false.

Ok, I would add a patch to expose the kernel parameter and put it at the
last. Thanks.


>
>>> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>>>> Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
>>>> variable is true only when the following conditions satisfies:
>>>>  - set to true when module loaded.
>>>>  - enable_pmu is true.
>>>>  - is running on Intel CPU.
>>>>  - supports PerfMon v4.
>>>>  - host PMU supports passthrough mode.
>>>>
>>>> The value is always read-only because passthrough PMU currently does not
>>>> support features like LBR and PEBS, while emualted PMU does. This will end
>>>> up with two different values for kvm_cap.supported_perf_cap, which is
>>>> initialized at module load time. Maintaining two different perf
>>>> capabilities will add complexity. Further, there is not enough motivation
>>>> to support running two types of PMU implementations at the same time,
>>>> although it is possible/feasible in reality.
>>>>
>>>> Finally, always propagate enable_passthrough_pmu and perf_capabilities into
>>>> kvm->arch for each KVM instance.
>>>>
>>>> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>>>> ---
>>>>  arch/x86/include/asm/kvm_host.h |  1 +
>>>>  arch/x86/kvm/pmu.h              | 14 ++++++++++++++
>>>>  arch/x86/kvm/vmx/vmx.c          |  7 +++++--
>>>>  arch/x86/kvm/x86.c              |  8 ++++++++
>>>>  arch/x86/kvm/x86.h              |  1 +
>>>>  5 files changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index f8ca74e7678f..a15c783f20b9 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -1406,6 +1406,7 @@ struct kvm_arch {
>>>>  
>>>>  	bool bus_lock_detection_enabled;
>>>>  	bool enable_pmu;
>>>> +	bool enable_passthrough_pmu;
>>> Again, as I suggested/requested in the initial RFC[*], drop the per-VM flag as well
>>> as kvm_pmu.passthrough.  There is zero reason to cache the module param.  KVM
>>> should always query kvm->arch.enable_pmu prior to checking if the mediated PMU
>>> is enabled, so I doubt we even need a helper to check both.
>>>
>>> [*] https://lore.kernel.org/all/ZhhOEDAl6k-NzOkM@google.com
>> Sure.
>>
>>
>>>>  
>>>>  	u32 notify_window;
>>>>  	u32 notify_vmexit_flags;
>>>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>>>> index 4d52b0b539ba..cf93be5e7359 100644
>>>> --- a/arch/x86/kvm/pmu.h
>>>> +++ b/arch/x86/kvm/pmu.h
>>>> @@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
>>>>  			enable_pmu = false;
>>>>  	}
>>>>  
>>>> +	/* Pass-through vPMU is only supported in Intel CPUs. */
>>>> +	if (!is_intel)
>>>> +		enable_passthrough_pmu = false;
>>>> +
>>>> +	/*
>>>> +	 * Pass-through vPMU requires at least PerfMon version 4 because the
>>>> +	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
>>>> +	 * for counter emulation as well as PMU context switch.  In addition, it
>>>> +	 * requires host PMU support on passthrough mode. Disable pass-through
>>>> +	 * vPMU if any condition fails.
>>>> +	 */
>>>> +	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
>>> As is quite obvious by the end of the series, the v4 requirement is specific to
>>> Intel.
>>>
>>> 	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
>>> 	    (is_intel && kvm_pmu_cap.version < 4) ||
>>> 	    (is_amd && kvm_pmu_cap.version < 2))
>>> 		enable_passthrough_pmu = false;
>>>
>>> Furthermore, there is zero reason to explicitly and manually check the vendor,
>>> kvm_init_pmu_capability() takes kvm_pmu_ops.  Adding a callback is somewhat
>>> undesirable as it would lead to duplicate code, but we can still provide separation
>>> of concerns by adding const variables to kvm_pmu_ops, a la MAX_NR_GP_COUNTERS.
>>>
>>> E.g.
>>>
>>> 	if (enable_pmu) {
>>> 		perf_get_x86_pmu_capability(&kvm_pmu_cap);
>>>
>>> 		/*
>>> 		 * WARN if perf did NOT disable hardware PMU if the number of
>>> 		 * architecturally required GP counters aren't present, i.e. if
>>> 		 * there are a non-zero number of counters, but fewer than what
>>> 		 * is architecturally required.
>>> 		 */
>>> 		if (!kvm_pmu_cap.num_counters_gp ||
>>> 		    WARN_ON_ONCE(kvm_pmu_cap.num_counters_gp < min_nr_gp_ctrs))
>>> 			enable_pmu = false;
>>> 		else if (pmu_ops->MIN_PMU_VERSION > kvm_pmu_cap.version)
>>> 			enable_pmu = false;
>>> 	}
>>>
>>> 	if (!enable_pmu || !kvm_pmu_cap.passthrough ||
>>> 	    pmu_ops->MIN_MEDIATED_PMU_VERSION > kvm_pmu_cap.version)
>>> 		enable_mediated_pmu = false;
>> Sure.  would do.
>>
>>
>>>> +		enable_passthrough_pmu = false;
>>>> +
>>>>  	if (!enable_pmu) {
>>>>  		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
>>>>  		return;
>>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>>> index ad465881b043..2ad122995f11 100644
>>>> --- a/arch/x86/kvm/vmx/vmx.c
>>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>>> @@ -146,6 +146,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>>>>  extern bool __read_mostly allow_smaller_maxphyaddr;
>>>>  module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
>>>>  
>>>> +module_param(enable_passthrough_pmu, bool, 0444);
>>> Hmm, we either need to put this param in kvm.ko, or move enable_pmu to vendor
>>> modules (or duplicate it there if we need to for backwards compatibility?).
>>>
>>> There are advantages to putting params in vendor modules, when it's safe to do so,
>>> e.g. it allows toggling the param when (re)loading a vendor module, so I think I'm
>>> supportive of having the param live in vendor code.  I just don't want to split
>>> the two PMU knobs.
>> Since enable_passthrough_pmu has already been in vendor modules,  we'd
>> better duplicate enable_pmu module parameter in vendor modules as the 1st step.
>>
>>
>>>>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
>>>>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
>>>>  #define KVM_VM_CR0_ALWAYS_ON				\
>>>> @@ -7924,7 +7926,8 @@ static __init u64 vmx_get_perf_capabilities(void)
>>>>  	if (boot_cpu_has(X86_FEATURE_PDCM))
>>>>  		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
>>>>  
>>>> -	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
>>>> +	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
>>>> +	    !enable_passthrough_pmu) {
>>>>  		x86_perf_get_lbr(&vmx_lbr_caps);
>>>>  
>>>>  		/*
>>>> @@ -7938,7 +7941,7 @@ static __init u64 vmx_get_perf_capabilities(void)
>>>>  			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
>>>>  	}
>>>>  
>>>> -	if (vmx_pebs_supported()) {
>>>> +	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
>>> Checking enable_mediated_pmu belongs in vmx_pebs_supported(), not in here,
>>> otherwise KVM will incorrectly advertise support to userspace:
>>>
>>> 	if (vmx_pebs_supported()) {
>>> 		kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
>>> 		kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
>>> 	}
>> Sure. Thanks for pointing this.
>>
>>
>>>>  		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
>>>>  		/*
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index f1d589c07068..0c40f551130e 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -187,6 +187,10 @@ bool __read_mostly enable_pmu = true;
>>>>  EXPORT_SYMBOL_GPL(enable_pmu);
>>>>  module_param(enable_pmu, bool, 0444);
>>>>  
>>>> +/* Enable/disable mediated passthrough PMU virtualization */
>>>> +bool __read_mostly enable_passthrough_pmu;
>>>> +EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
>>>> +
>>>>  bool __read_mostly eager_page_split = true;
>>>>  module_param(eager_page_split, bool, 0644);
>>>>  
>>>> @@ -6682,6 +6686,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>  		mutex_lock(&kvm->lock);
>>>>  		if (!kvm->created_vcpus) {
>>>>  			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
>>>> +			/* Disable passthrough PMU if enable_pmu is false. */
>>>> +			if (!kvm->arch.enable_pmu)
>>>> +				kvm->arch.enable_passthrough_pmu = false;
>>> And this code obviously goes away if the per-VM snapshot is removed.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-11-20  5:31     ` Mi, Dapeng
@ 2025-01-22  5:08       ` Mi, Dapeng
  0 siblings, 0 replies; 183+ messages in thread
From: Mi, Dapeng @ 2025-01-22  5:08 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, Like Xu, Peter Zijlstra, Raghavendra Rao Ananta, kvm,
	linux-perf-users


On 11/20/2024 1:31 PM, Mi, Dapeng wrote:
> On 11/20/2024 12:32 AM, Sean Christopherson wrote:
>> On Thu, Aug 01, 2024, Mingwei Zhang wrote:
>>> Clear RDPMC_EXITING in vmcs when all counters on the host side are exposed
>>> to guest VM. This gives performance to passthrough PMU. However, when guest
>>> does not get all counters, intercept RDPMC to prevent access to unexposed
>>> counters. Make decision in vmx_vcpu_after_set_cpuid() when guest enables
>>> PMU and passthrough PMU is enabled.
>>>
>>> Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>>> ---
>>>  arch/x86/kvm/pmu.c     | 16 ++++++++++++++++
>>>  arch/x86/kvm/pmu.h     |  1 +
>>>  arch/x86/kvm/vmx/vmx.c |  5 +++++
>>>  3 files changed, 22 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>> index e656f72fdace..19104e16a986 100644
>>> --- a/arch/x86/kvm/pmu.c
>>> +++ b/arch/x86/kvm/pmu.c
>>> @@ -96,6 +96,22 @@ void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops)
>>>  #undef __KVM_X86_PMU_OP
>>>  }
>>>  
>>> +bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
>> As suggested earlier, kvm_rdpmc_in_guest().
> Sure.
>
>
>>> +{
>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>> +
>>> +	if (is_passthrough_pmu_enabled(vcpu) &&
>>> +	    !enable_vmware_backdoor &&
>> Please add a comment about the VMware backdoor, I doubt most folks know about
>> VMware's tweaks to RDPMC behavior.  It's somewhat obvious from the code and
>> comment in check_rdpmc(), but I think it's worth calling out here too.
> Sure.
>
>
>>> +	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
>>> +	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
>>> +	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
>>> +	    pmu->counter_bitmask[KVM_PMC_FIXED] == (((u64)1 << kvm_pmu_cap.bit_width_fixed)  - 1))
>> BIT_ULL?  GENMASK_ULL?
> Sure.
>
>
>>> +		return true;
>>> +
>>> +	return false;
>> Do this:
>>
>>
>> 	return <true>;
>>
>> not:
>>
>> 	if (<true>)
>> 		return true;
>>
>> 	return false;
>>
>> Short-circuiting on certain cases is fine, and I would probably vote for that so
>> it's easier to add comments, but that's obviously not what's done here.  E.g. either
>>
>> 	if (!enable_mediated_pmu)
>> 		return false;
>>
>> 	/* comment goes here */
>> 	if (enable_vmware_backdoor)
>> 		return false;
>>
>> 	return <counters checks>;
>>
>> or
>>
>> 	return <massive combined check>;
> Nice suggestion. Thanks.
>
>
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvm_pmu_check_rdpmc_passthrough);
>> Maybe just make this an inline in a header?  enable_vmware_backdoor is exported,
>> and presumably enable_mediated_pmu will be too.
> Sure.
>
>
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index 4d60a8cf2dd1..339742350b7a 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -7911,6 +7911,11 @@ void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>>  		vmx->msr_ia32_feature_control_valid_bits &=
>>>  			~FEAT_CTL_SGX_LC_ENABLED;
>>>  
>>> +	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))
>> No need to follow vmx->vcpu, @vcpu is readily available.
> Yes.
>
>
>>> +		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
>>> +	else
>>> +		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
>> I wonder if it makes sense to add a helper to change a bit.  IIRC, the only reason
>> I didn't add one along with the set/clear helpers was because there weren't many
>> users and I couldn't think of good alternative to "set".
>>
>> I still don't have a good name, but I think we're reaching the point where it's
>> worth forcing the issue to avoid common goofs, e.g. handling only the "clear"
>> case and no the "set" case.
>>
>> Maybe changebit?  E.g.
>>
>> static __always_inline void lname##_controls_changebit(struct vcpu_vmx *vmx, u##bits val,	\
>> 						       bool set)				\
>> {												\
>> 	if (set)										\
>> 		lname##_controls_setbit(vmx, val);						\
>> 	else											\
>> 		lname##_controls_clearbit(vmx, val);						\
>> }
>>
>>
>> and then vmx_refresh_apicv_exec_ctrl() can be:
>>
>> 	secondary_exec_controls_changebit(vmx,
>> 					  SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> 					  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY,
>> 					  kvm_vcpu_apicv_active(vcpu));
>> 	tertiary_exec_controls_changebit(vmx, TERTIARY_EXEC_IPI_VIRT,
>> 					 kvm_vcpu_apicv_active(vcpu) && enable_ipiv);
>>
>> and this can be:
>>
>> 	exec_controls_changebit(vmx, CPU_BASED_RDPMC_EXITING,
>> 				!kvm_rdpmc_in_guest(vcpu));
> Sure. would add a separate patch to add these helpers.

Hi Sean,

The upcoming v4 patches would include some code which you suggest in the
comments, like this one. I would like add a co-developed-by and
corresponding SoB tags for you in the patch if the patch includes you
suggested code. Is it ok? Or you more prefer the "suggested-by" tag? Thanks.


>
>
>

^ permalink raw reply	[flat|nested] 183+ messages in thread

end of thread, other threads:[~2025-01-22  5:08 UTC | newest]

Thread overview: 183+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-01  4:58 [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 01/58] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 02/58] sched/core: Drop spinlocks on contention iff kernel is preemptible Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 03/58] perf/x86: Do not set bit width for unavailable counters Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 04/58] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 05/58] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 06/58] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
2024-09-06 10:59   ` Mi, Dapeng
2024-09-06 15:40     ` Liang, Kan
2024-09-09 22:17       ` Namhyung Kim
2024-08-01  4:58 ` [RFC PATCH v3 07/58] perf: Skip pmu_ctx based on event_type Mingwei Zhang
2024-10-11 11:18   ` Peter Zijlstra
2024-08-01  4:58 ` [RFC PATCH v3 08/58] perf: Clean up perf ctx time Mingwei Zhang
2024-10-11 11:39   ` Peter Zijlstra
2024-08-01  4:58 ` [RFC PATCH v3 09/58] perf: Add a EVENT_GUEST flag Mingwei Zhang
2024-08-21  5:27   ` Mi, Dapeng
2024-08-21 13:16     ` Liang, Kan
2024-10-11 11:41     ` Peter Zijlstra
2024-10-11 13:16       ` Liang, Kan
2024-10-11 18:42   ` Peter Zijlstra
2024-10-11 19:49     ` Liang, Kan
2024-10-14 10:55       ` Peter Zijlstra
2024-10-14 11:14   ` Peter Zijlstra
2024-10-14 15:06     ` Liang, Kan
2024-12-13  9:37   ` Sandipan Das
2024-12-13 16:26     ` Liang, Kan
2024-08-01  4:58 ` [RFC PATCH v3 10/58] perf: Add generic exclude_guest support Mingwei Zhang
2024-10-14 11:20   ` Peter Zijlstra
2024-10-14 15:27     ` Liang, Kan
2024-08-01  4:58 ` [RFC PATCH v3 11/58] x86/irq: Factor out common code for installing kvm irq handler Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 12/58] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
2024-09-09 22:11   ` Colton Lewis
2024-09-10  4:59     ` Mi, Dapeng
2024-09-10 16:45       ` Colton Lewis
2024-08-01  4:58 ` [RFC PATCH v3 13/58] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface Mingwei Zhang
2024-09-19  6:02   ` Manali Shukla
2024-09-19 13:00     ` Liang, Kan
2024-09-20  5:09       ` Manali Shukla
2024-09-23 18:49         ` Mingwei Zhang
2024-09-24 16:55           ` Manali Shukla
2024-10-14 11:59           ` Peter Zijlstra
2024-10-14 16:15             ` Liang, Kan
2024-10-14 17:45               ` Peter Zijlstra
2024-10-15 15:59                 ` Liang, Kan
2024-10-14 11:56   ` Peter Zijlstra
2024-10-14 15:40     ` Liang, Kan
2024-10-14 17:47       ` Peter Zijlstra
2024-10-14 17:51       ` Peter Zijlstra
2024-10-14 12:03   ` Peter Zijlstra
2024-10-14 15:51     ` Liang, Kan
2024-10-14 17:49       ` Peter Zijlstra
2024-10-15 13:23         ` Liang, Kan
2024-10-14 13:52   ` Peter Zijlstra
2024-10-14 15:57     ` Liang, Kan
2024-08-01  4:58 ` [RFC PATCH v3 15/58] perf/x86: Support switch_interrupt interface Mingwei Zhang
2024-09-09 22:11   ` Colton Lewis
2024-09-10  5:00     ` Mi, Dapeng
2024-10-24 19:45     ` Chen, Zide
2024-10-25  0:52       ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 16/58] perf/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
2024-09-02  7:56   ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 17/58] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
2024-11-19 14:30   ` Sean Christopherson
2024-11-20  3:21     ` Mi, Dapeng
2024-11-20 17:06       ` Sean Christopherson
2025-01-15  0:17       ` Mingwei Zhang
2025-01-15  2:52         ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
2024-11-19 14:54   ` Sean Christopherson
2024-11-20  3:47     ` Mi, Dapeng
2024-11-20 16:45       ` Sean Christopherson
2024-11-21  0:29         ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
2024-11-19 15:37   ` Sean Christopherson
2024-11-20  5:19     ` Mi, Dapeng
2024-11-20 17:09       ` Sean Christopherson
2024-11-21  0:37         ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 21/58] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
2024-11-19 15:43   ` Sean Christopherson
2024-11-20  5:21     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
2024-11-19 16:32   ` Sean Christopherson
2024-11-20  5:31     ` Mi, Dapeng
2025-01-22  5:08       ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
2024-11-19 17:03   ` Sean Christopherson
2024-11-20  5:44     ` Mi, Dapeng
2024-11-20 17:21       ` Sean Christopherson
2024-08-01  4:58 ` [RFC PATCH v3 25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
2024-11-19 17:32   ` Sean Christopherson
2024-11-20  6:22     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
2024-08-06  7:04   ` Mi, Dapeng
2024-10-24 20:26   ` Chen, Zide
2024-10-25  2:36     ` Mi, Dapeng
2024-11-19 18:16   ` Sean Christopherson
2024-11-20  7:56     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
2024-10-24 19:58   ` Chen, Zide
2024-10-25  2:50     ` Mi, Dapeng
2024-11-19 18:17   ` Sean Christopherson
2024-11-20  7:57     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
2024-11-19 18:24   ` Sean Christopherson
2024-11-20 10:12     ` Mi, Dapeng
2024-11-20 18:32       ` Sean Christopherson
2024-08-01  4:58 ` [RFC PATCH v3 29/58] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
2024-09-02  7:51   ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
2024-11-19 18:58   ` Sean Christopherson
2024-11-20 11:50     ` Mi, Dapeng
2024-11-20 17:30       ` Sean Christopherson
2024-11-21  0:56         ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 32/58] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
2024-08-06  7:27   ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 34/58] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 35/58] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
2024-09-02  7:59   ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
2024-10-24 20:26   ` Chen, Zide
2024-10-25  2:51     ` Mi, Dapeng
2024-11-19  1:46       ` Sean Christopherson
2024-11-19  5:20         ` Mi, Dapeng
2024-11-19 13:44           ` Sean Christopherson
2024-11-20  2:08             ` Mi, Dapeng
2024-10-31  3:14   ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
2024-11-20 18:42   ` Sean Christopherson
2024-11-21  1:13     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 39/58] KVM: x86/pmu: Notify perf core at KVM context switch boundary Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
2024-11-20 18:46   ` Sean Christopherson
2024-11-21  2:04     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
2024-10-24 19:57   ` Chen, Zide
2024-10-25  2:55     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 42/58] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
2024-10-25 16:16   ` Chen, Zide
2024-10-27 12:06     ` Mi, Dapeng
2024-11-20 18:48     ` Sean Christopherson
2024-11-21  2:05       ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
2024-11-20 20:13   ` Sean Christopherson
2024-11-21  2:27     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
2024-11-20 20:19   ` Sean Christopherson
2024-11-21  2:52     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
2024-11-20 20:40   ` Sean Christopherson
2024-11-21  3:02     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 47/58] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
2024-11-20 20:52   ` Sean Christopherson
2024-11-21  3:14     ` Mi, Dapeng
2024-08-01  4:58 ` [RFC PATCH v3 48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
2024-08-02 17:50   ` Liang, Kan
2024-08-01  4:58 ` [RFC PATCH v3 49/58] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
2024-08-01  4:58 ` [RFC PATCH v3 50/58] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
2024-08-01  4:59 ` [RFC PATCH v3 51/58] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
2024-08-01  4:59 ` [RFC PATCH v3 52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
2024-11-20 21:02   ` Sean Christopherson
2024-11-21  3:24     ` Mi, Dapeng
2024-08-01  4:59 ` [RFC PATCH v3 53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
2024-11-20 21:38   ` Sean Christopherson
2024-11-21  3:26     ` Mi, Dapeng
2024-08-01  4:59 ` [RFC PATCH v3 54/58] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
2024-08-01  4:59 ` [RFC PATCH v3 55/58] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
2024-08-01  4:59 ` [RFC PATCH v3 56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
2024-11-20 21:39   ` Sean Christopherson
2024-11-21  3:29     ` Mi, Dapeng
2024-08-01  4:59 ` [RFC PATCH v3 57/58] KVM: x86/pmu/svm: Implement callback to increment counters Mingwei Zhang
2024-08-01  4:59 ` [RFC PATCH v3 58/58] perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host Mingwei Zhang
2024-09-11 10:45 ` [RFC PATCH v3 00/58] Mediated Passthrough vPMU 3.0 for x86 Ma, Yongwei
2024-11-19 14:00 ` Sean Christopherson
2024-11-20  2:31   ` Mi, Dapeng
2024-11-20 11:55 ` Mi, Dapeng
2024-11-20 18:34   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).